Apache Spark DataFrames Getting Started Guide: Working with DataFrame
Second, the operation of DataFrameIn the previous article, we introduced how to create a DataFrame. This article describes how to manipulate the data in the DataFrame and print out the data in the DataFrame
Print the pattern inside the DataFrameAfter the creation of DataFrame, we generally see the data inside the model, we can
printSchemafunction to view. It prints the name and type of the column:
students.printSchemathe output is as follows:
Data on the DataFrame is sampledAfter printing the pattern, the second thing we have to do is look at the data loaded into the DataFrame. There are many ways to sample data from a newly created DataFrame. Let's introduce it.
The simplest is to use the show method, show method has four versions:
(1), the first number we need to specify the number of rows
def show(numRows: Int)；
(2), the second does not require us to specify any parameters, in which case, show function will be loaded by default 20 lines of data
(3), the third need to specify a boolean value, this value shows whether the need for more than 20
def show(truncate: Boolean)；
(4), the last need to specify the sampling of the line and whether the need for
def show(numRows: Int, truncate: Boolean)of the column
def show(numRows: Int, truncate: Boolean). In fact, the first three functions are called to achieve this function.
The Show function differs from other functions in that it not only displays the lines that need to be printed, but also prints out the header information and directs it directly in the default output stream. To see how to use it:
Query the columns inside the DataFrameAs you can see, all the columns in the DataFrame are named. The Select function can help us select the required columns from the DataFrame and return a new DataFrame, which I will introduce below.
1, select only one column. If we only want to select this email from the DataFrame, the DataFrame is immutable, so this will return a new DataFrame:
printSchemato ensure that the select column is
printSchemaprint out. If the column name is invalid, the
org.apache.spark.sql.AnalysisExceptionwill appear as follows:
Filter the data according to the criteriaNow that we know how to select the required columns in the DataFrame, let's look at how to filter the data in the DataFrame according to the criteria. Corresponding to Row-based data, we can view the DataFrame as a regular collection of Scala, and then we need to filter the relevant conditions, in order to show clearly, I did not use the show function behind the show filter results.
Sort the data inside the DataFrameUsing the sort function, we can sort the columns specified in the DataFrame:
Rename the columnIf we are not interested in the default column name in the DataFrame, we can rename it with as if we selected it, and the following column will
studentNameto name and email the column name
Think of DataFrame as a relational data tableOne of the strengths of DataFrame is that we can think of it as a relational data table and then run SQL queries on it as long as we do the following two steps:
(1), the DataFrame registered as a student named table:
Join two DataFrame operationsWe already know how to register a DataFrame as a table, and now let's look at how to use normal SQL to join the two DataFrame.
1, inline : the inline is the default Join operation, it only returns two DataFrame are matched to the results, take a look at the following example:
Save the DataFrame as a fileLet me introduce how to save a DataFrame into a file. We used to load the csv file load function, and for the preservation of the file can use the save function. Specific operations include the following two steps:
1, first create a map object, used to store some of the save function needs to use some of the properties. Here I will develop a save file to save the path and csv header information.
It should be noted that the path parameter specified above is to save the folder, not the last save the file name.