Difference between Spark HiveContext and SQLContext

Difference between Spark HiveContext and SQLContext

Goal:

This article explains what is the difference between Spark HiveContext and SQLContext.

Env:

Below tests are done on Spark 1.5.2

Solution:

Per Spark SQL programming guide, HiveContext is a super set of the SQLContext. Additional features include the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables.

For example, one the key difference is using HiveContext you can use the new window function feature.
In Spark 1.5.2, once you launch spark-shell, the default SQL context is already HiveContext, although below line still shows "SQL context":
1
SQL context available as sqlContext.
To avoid confusing, let's create 2 separate contexts explicitly in spark-shell:
1
2
3
4
import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.api.java.JavaSparkContext;
var hiveContext = new HiveContext(JavaSparkContext.toSparkContext(sc))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
HiveContext can use window function:
1
2
scala> hiveContext.sql("SELECT value, dense_rank() OVER (PARTITION BY key ORDER BY value DESC) as rank FROM src").collect
res7: Array[org.apache.spark.sql.Row] = Array([val_431,1], [val_431,1], [val_431,1], [val_431,1], [val_431,1], [val_431,1], [val_432,1], [val_432,1], [val_33,1], [val_33,1], [val_233,1], [val_233,1], [val_233,1], [val_233,1], [val_34,1], [val_34,1], [val_35,1], [val_35,1], [val_35,1], [val_35,1], [val_35,1], [val_35,1], [val_235,1], [val_235,1], [val_435,1], [val_435,1], [val_436,1], [val_436,1], [val_37,1], [val_37,1], [val_37,1], [val_37,1], [val_237,1], [val_237,1], [val_237,1], [val_237,1], [val_437,1], [val_437,1], [val_238,1], [val_238,1], [val_238,1], [val_238,1], [val_438,1], [val_438,1], [val_438,1], [val_438,1], [val_438,1], [val_438,1], [val_239,1], [val_239,1], [val_239,1], [val_239,1], [val_439,1], [val_439,1], [val_439,1], [val_439,1], [val_41,1], [val_41,1], [val_241,1], ...
SQLcontext can not:
1
2
3
4
scala> sqlContext.sql("SELECT value, dense_rank() OVER (PARTITION BY key ORDER BY value DESC) as rank FROM src").collect
java.lang.RuntimeException: [1.33] failure: ``union'' expected but `(' found
 
SELECT value, dense_rank() OVER (PARTITION BY key ORDER BY value DESC) as rank FROM src



==

Commentaires

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark performance optimization: shuffle tuning

Spark optimization