Spark SQL DataFrame
First, the DataFrame object is generated
Spark-SQL can generate DataFrame objects with other RDD objects, parquet files, json files, hive tables, and other JDBC-based relational databases as data sources. This article will be MySQL database as a data source, generate DataFrame object after the relevant DataFame on the operation.The code for generating the DataFrame in the text is as follows:
- 1
http://m.blog.csdn.net/article/details?id=52802150
Second, the DataFrame object on the Action
1, show : show data
In the form of a table in the output show jdbcDF data, similar to select * from spark_sql_test function. There are four ways to call the way, respectively,
(1)
show Only show the first 20 records.
Example:
- 1
(2)
show(numRows: Int) Show
numRows example: - 1
(3)
show(truncate: Boolean) Whether it only shows up to 20 characters, the default is
true . Example:
- 1
(4)
show(numRows: Int, truncate: Boolean) Synthesize the number of previous display records, and the display format for too long strings.
Example:
- 1
2, collect : get all the data to the array
jdbcDF from the previous show method, where the collect method will jdbcDF all the data are obtained, and return an Array object. - 1
jdbcDF , each record by a GenericRowWithSchema object to GenericRowWithSchema , you can store the field name and field values.
3, collectAsList : get all the data to the List
Function and collect similar, but will return to the List into a List object, use the following method - 1
4, describe(cols: String*) : Get the specified field describe(cols: String*)
This method can be DataFrame passed to one or more String type field names, the result is still DataFrame object, for DataFrame value type field DataFrame , such as count, mean, stddev, min, max and so on. Use the
c1 methods, where c1 field for the character type, c2 field for the integer, c4 field for the floating-point type - 1
5, first, head, take, takeAsList : get a number of line records
(1) first get the first line of the record (2) head obtain the first line of records, head(n: Int) obtain the first n line records (3) take(n: Int) (4) takeAsList(n: Int) obtain the first n rows of data, and in the form of List to show Row or Array[Row] in the form of one or more rows of data. first and head function the same. take and takeAsList method will get the data back to the Driver side, so the use of these two methods need to pay OutOfMemoryError amount of data, so that Driver occurred OutOfMemoryError Use and results slightly.
Second, the DataFrame object on the conditional query and join and other operations
The following returns a method for the DataFrame type that can be called continuously.1, where conditional correlation
(1)where(conditionExpr: String) : SQL language where the keyword after the where(conditionExpr: String) , you can use and and or . Get the return result of the DataFrame type, Example:
- 1
(2)
filter : Filter by field <br> Incoming filter conditional expression, get the return result of the DataFrame type. And where the same where use: - 1
2, query the specified field
(1)select : Get the specified field value <br> Get the value of the specified field based on the passed String type field name, return the sample to the DataFrame type Example: - 1
There is also a overloaded
select method, not passed in the String type parameter, but passed the Column type parameter. Can achieve select id, id+1 from test this logic. - 1
Can get the type of
Column method is apply and col method, the use of apply method is more simple. (2)
selectExpr : <br> special handling may be performed on the specified field can directly call the UDF field specified function, or specify aliases. Pass the String type parameter to get the DataFrame object. Example, query
id field, c3 field take alias time , c4 field rounded: - 1
(3)
col : get the col field <br> can only get a field, return the object for the Column type. Val idCol = jdbcDF.col ("id").
(4)
apply : get the specified field <br> can only get a field, return the object for the Column type Example: - 1
(5)
drop : remove the specified field, keep other fields <br> return a new DataFrame object, which does not contain the removed field, one can only remove a field. Example:
- 1
3, limit
limit method gets the first n rows of the specified DataFrame to get a new DataFrame object. And take with the head is limit , limit method is not Action. - 1
4, order by
(1)orderBy and sort : sorted by the orderBy field, the default is ascending Example 1, sorted by the specified field. Plus a - that descending order. sort and orderBy use the same method - 1
Example 2, sorted by field string
- 1
(2)
sortWithinPartitions Similar to the
sort method above, the sortWithinPartitions is that the sortWithinPartitions method returns a DataFrame object that is sorted by Partition. 5, group by
(1)groupBy : groupBy to the field group by groupBy method has two ways to call, you can pass the String type of field name, can also be passed to the Column type of object. Use the following method,
- 1
cube and rollup : group by the rollup function similar to the SQL group by cube/rollup , slightly. (3) GroupedData object <br> This method is the
GroupedData type object, in the GroupedData API provides a group by the GroupedData , for example, -
max(colNames: String*)method, to obtain themax(colNames: String*)field in the group or all the numeric type field maximum, can only apply to the digital field -
min(colNames: String*)method to get the minimum value of themin(colNames: String*)field in the group or all the numeric type fields, only to the numeric field -
mean(colNames: String*)method, which takes the average of themean(colNames: String*)field in the group or all of the numeric type fields, can only be applied to the numeric field -
sum(colNames: String*)method, to obtain thesum(colNames: String*)field in the group or all the numeric type field and the value can only be applied to the digital field -
count()method to get the number of elements in the group
Example of running results:
count
max
Which is more complicated inside the two methods,
agg, which is similar to the oneaggbelow and can be used toaggfield.
pivot 6, distinct
(1)distinct : Returns a DataFrame that does not contain duplicate records Returns a unique Row record in the current DataFrame. This method is the same as the result when the next
dropDuplicates() method does not pass in the dropDuplicates() field. Example:
- 1
(2)
dropDuplicates : dropDuplicates field to go to the weight of the dropDuplicates field. Similar to select distinct a, b Example of operation: - 1
7, polymerization
agg call the agg method, which has a variety of agg methods. groupBy used in groupBy with the groupBy method. The following example is one of the most simple and intuitive use of the
id field for the maximum value of the c4 field sum. - 1
8, union
unionAll method: Combining two DataFrams is similar to the UNION ALL operation in SQL . Example:
- 1
9, join
The key is coming. In theSQL language is a lot of use is the join operation, DataFrame also provides a join function. The next step is to introduce the
join method. There are six overloaded join methods in the DataFrame. (1), Descartes plot
- 1
using a field form <br> The following join is similar to the form of a join b using column1 , the need for two DataFrame have the same column name, - 1
joinDF1 and joinDF2 to the field id join joinDF2 , the results are as follows, using field only once. (3),
using multiple field forms <br> In addition to the above using a field of the case, you can also use multiple fields, as follows - 1
join type <br> Two DataFrame join operations are inner, outer, left_outer, right_outer, leftsemi type. In the above example of using multiple fields join, you can write a third String type parameter, specify the join type, as shown below - 1
Column type to join If you do not use the
using mode, the flexibility to specify the join field, you can use the following form - 1
(6), specify the
join type in the join field at the same time as shown below - 1
10, get the specified field statistics
stat method can be used to compute stat fields or specified fields, such as variance, covariance, and so on. This method returns a DataFramesStatFunctions type object. The
c4 code c4 field, the statistics of the field value of the c4 more than 30% of the content. The jdbcDF of field c1 in jdbcDF are "a, b, a, c, d, b" . Where a and b appear at frequencies of 2 / 6 and greater than 0.3 - 1
11, get two DataFrame common records
intersect method can compute the same records in both DataFrames, - 1
12, get a DataFrame in another DataFrame that does not have a record
Example:- 1
13, the operation field name
(1)withColumnRenamed : rename the specified field name in the DataFrame <br> If the specified field name does not exist, do nothing. The jdbcDF example jdbcDF the id field in jdbcDF to idx . - 1
(2)
withColumn : add a column to the current DataFrame whtiColumn(colName: String , col: Column) method according to the specified colName to the DataFrame add a column, if colName already exists, it will overwrite the current column. The
jdbcDF code adds a column named id2 to jdbcDF , - 1
14, the line column
Sometimes need to be based on the contents of a fieldc3_ , and then generate multiple lines, then you can use the explode method The c3_ , c3_ to the space in the c3 field to segment the contents of the field, c3_ content stored in the new field c3_ , as follows As shown - 1
15, other operations
API therena, randomSplit, repartition, alias, as method, to be added later.
Commentaires
Enregistrer un commentaire