Spark SQL DataFrame
First, the DataFrame object is generatedSpark-SQL can generate DataFrame objects with other RDD objects, parquet files, json files, hive tables, and other JDBC-based relational databases as data sources. This article will be MySQL database as a data source, generate DataFrame object after the relevant DataFame on the operation.
The code for generating the DataFrame in the text is as follows:
Second, the DataFrame object on the Action
In the form of a table in the output show
show : show data
jdbcDFdata, similar to
select * from spark_sql_testfunction.
There are four ways to call the way, respectively,
Only show the first 20 records.
Whether it only shows up to 20 characters, the default is
show(numRows: Int, truncate: Boolean)
Synthesize the number of previous display records, and the display format for too long strings.
collect : get all the data to the array
jdbcDFfrom the previous
showmethod, where the
jdbcDFall the data are obtained, and return an
jdbcDF, each record by a
GenericRowWithSchema, you can store the field name and field values.
collectAsList : get all the data to the List
collectsimilar, but will return to the
Listobject, use the following method
This method can be
describe(cols: String*) : Get the specified field
DataFramepassed to one or more
Stringtype field names, the result is still
DataFramevalue type field
DataFrame, such as
count, mean, stddev, min, maxand so on.
c1field for the character type,
c2field for the integer,
c4field for the floating-point type
first, head, take, takeAsList : get a number of line records
firstget the first line of the record (2)
headobtain the first line of records,
head(n: Int)obtain the first n line records (3)
takeAsList(n: Int)obtain the first n rows of data, and in the form of
Array[Row]in the form of one or more rows of data.
headfunction the same.
takeAsListmethod will get the data back to the Driver side, so the use of these two methods need to pay
OutOfMemoryErroramount of data, so that Driver occurred
Use and results slightly.
Second, the DataFrame object on the conditional query and join and other operationsThe following returns a method for the DataFrame type that can be called continuously.
1, where conditional correlation(1)
where(conditionExpr: String): SQL language where the keyword after the
where(conditionExpr: String), you can use
or. Get the return result of the DataFrame type,
filter: Filter by field <br> Incoming filter conditional expression, get the return result of the DataFrame type. And
2, query the specified field(1)
select: Get the specified field value <br> Get the value of the specified field based on the passed
Stringtype field name, return the sample to the DataFrame type Example:
There is also a overloaded
selectmethod, not passed in the
Stringtype parameter, but passed the
Columntype parameter. Can achieve
select id, id+1 from testthis logic.
Can get the type of
colmethod, the use of
applymethod is more simple.
selectExpr: <br> special handling may be performed on the specified field can directly call the UDF field specified function, or specify aliases. Pass the
Stringtype parameter to get the DataFrame object.
c3field take alias
col: get the
colfield <br> can only get a field, return the object for the Column type.
Val idCol = jdbcDF.col ("id").
apply: get the specified field <br> can only get a field, return the object for the Column type Example:
drop: remove the specified field, keep other fields <br> return a new DataFrame object, which does not contain the removed field, one can only remove a field.
limitmethod gets the first n rows of the specified DataFrame to get a new DataFrame object. And
limitmethod is not Action.
4, order by(1)
sort: sorted by the
orderByfield, the default is ascending Example 1, sorted by the specified field. Plus a
-that descending order.
orderByuse the same method
Example 2, sorted by field string
Similar to the
sortmethod above, the
sortWithinPartitionsis that the
sortWithinPartitionsmethod returns a DataFrame object that is sorted by Partition.
5, group by(1)
groupByto the field
groupBymethod has two ways to call, you can pass the
Stringtype of field name, can also be passed to the
Columntype of object.
Use the following method,
rollup: group by the
rollupfunction similar to the
group by cube/rollup, slightly.
(3) GroupedData object <br> This method is the
GroupedDatatype object, in the
GroupedDataAPI provides a
GroupedData, for example,
max(colNames: String*)method, to obtain the
max(colNames: String*)field in the group or all the numeric type field maximum, can only apply to the digital field
min(colNames: String*)method to get the minimum value of the
min(colNames: String*)field in the group or all the numeric type fields, only to the numeric field
mean(colNames: String*)method, which takes the average of the
mean(colNames: String*)field in the group or all of the numeric type fields, can only be applied to the numeric field
sum(colNames: String*)method, to obtain the
sum(colNames: String*)field in the group or all the numeric type field and the value can only be applied to the digital field
count()method to get the number of elements in the group
Example of running results:
Which is more complicated inside the two methods,
agg, which is similar to the one
aggbelow and can be used to
distinct: Returns a DataFrame that does not contain duplicate records
Returns a unique Row record in the current DataFrame. This method is the same as the result when the next
dropDuplicates()method does not pass in the
dropDuplicatesfield to go to the weight of the
dropDuplicatesfield. Similar to
select distinct a, bExample of operation:
aggmethod, which has a variety of
The following example is one of the most simple and intuitive use of the
idfield for the maximum value of the
unionAllmethod: Combining two DataFrams is similar to the
UNION ALLoperation in
9, joinThe key is coming. In the
SQLlanguage is a lot of use is the
joinoperation, DataFrame also provides a
The next step is to introduce the
joinmethod. There are six overloaded
joinmethods in the DataFrame.
(1), Descartes plot
usinga field form <br> The following join is similar to the form of
a join b using column1, the need for two DataFrame have the same column name,
joinDF2to the field
joinDF2, the results are as follows,
usingfield only once.
usingmultiple field forms <br> In addition to the above
usinga field of the case, you can also use multiple fields, as follows
jointype <br> Two DataFrame
inner, outer, left_outer, right_outer, leftsemitype. In the above example of
usingmultiple fields join, you can write a third
Stringtype parameter, specify the
jointype, as shown below
If you do not use the
usingmode, the flexibility to specify the
joinfield, you can use the following form
(6), specify the
jointype in the
joinfield at the same time as shown below
10, get the specified field statistics
statmethod can be used to compute
statfields or specified fields, such as variance, covariance, and so on. This method returns a
c4field, the statistics of the field value of the
30%of the content. The
"a, b, a, c, d, b". Where
bappear at frequencies of
2 / 6and greater than
11, get two DataFrame common records
intersectmethod can compute the same records in both DataFrames,
12, get a DataFrame in another DataFrame that does not have a recordExample:
13, the operation field name(1)
withColumnRenamed: rename the specified field name in the DataFrame <br> If the specified field name does not exist, do nothing. The
withColumn: add a column to the current DataFrame
whtiColumn(colName: String , col: Column)method according to the specified
colNameto the DataFrame add a column, if
colNamealready exists, it will overwrite the current column.
jdbcDFcode adds a column named
14, the line columnSometimes need to be based on the contents of a field
c3_, and then generate multiple lines, then you can use the
c3_to the space in the
c3field to segment the contents of the field,
c3_content stored in the new field
c3_, as follows As shown
15, other operationsAPI there
na, randomSplit, repartition, alias, asmethod, to be added later.