Spark writes dataframe data to the Hive partition table

juin 09, 2017

From spark1.2 to spark1.3, spark SQL in the SchemaRDD into a DataFrame, DataFrame relative to the SchemaRDD has been greatly changed, while providing more useful and convenient API.
DataFrame data will be written to hive, the default is hive default database, insertInto does not specify the parameters of the database, this article uses the following way to write data to the hive table or hive table partition, for reference only.
1, the DataFrame data written to the Hive table from the DataFrame class can be seen with the hive table related to the Api write the following:

RegisterTempTable (tableName: String): Unit,
InsertInto (tableName: String): Unit
InsertInto (tableName: String, overwrite: Boolean): Unit
SaveAsTable (tableName: String, source: String, mode: SaveMode, options: Map [String, String]): Unit

There are many overloaded functions that are not enumerated
The registerTempTable function is created by the spark temporary table
InsertInto function is to write data to the table, you can see that this function can not specify the database and partition information, can not be directly written.
Write data to the hive data warehouse must specify the database, hive data table can be established in the hive, or use hiveContext.sql ("create table ....")
The following statement writes data to the specified database data table:

  case class Person( name :String,col1: Int ,col2:String) 

 val sc = new org.apache.spark.SparkContext 

 val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) 

 import hiveContext.implicits._ 

  hiveContext.sql( "use DataBaseName" ) 

  ).map(x=>x.split( "\\s+" )).map(x=>Person(x(0),x(1).toInt,x(2))) 

  data.toDF().insertInto( "tableName" )

Create a case class will RDD data type into the case class type, and then through toDF converted to DataFrame, call insertInto function, first specify the database, using the hiveContext.sql ("use DataBaseName") statement, you can DataFrame data Write in the hive data table

2, the DataFrame data written to the hive designated data table partition
Hive data set can be established in the hive, or use hiveContext.sql ("create table ...."), the use of saveAsTable data storage format is limited, the default format for the parquet, can be specified as json, if there are other formats specified, Try to use the statement to create the hive table.
Write the data to the partition table The idea is to first write the DataFrame data to a temporary table, followed by the hiveContext.sql statement to write the data to the hive partition table. The specific operation is as follows:

  case class Person( name :String,col1: Int ,col2:String) 

 val sc = new org.apache.spark.SparkContext 

 val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) 

 import hiveContext.implicits._ 

  hiveContext.sql( "use DataBaseName" ) 

  ).map(x=>x.split( "\\s+" )).map(x=>Person(x(0),x(1).toInt,x(2))) 

  data.toDF().registerTempTable( "table1" ) 

  hiveContext.sql( "insert into table2 partition(date='2015-04-02') select name,col1,col2 from table1" )

Use the above method to write dataframe data to the hive partition table

Rechercher dans ce blog

Big data

Spark writes dataframe data to the Hive partition table

Commentaires

Enregistrer un commentaire

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark performance optimization: shuffle tuning

Spark optimization