Spark fails to parse a json object with multiple lines

Spark fails to parse a json object with multiple lines

Env:

Spark 1.3.1

Symptom:

Spark fails to parse a json object with multiple lines.

This issue can happen when either creating a DataFrame using:
1
val people = sqlContext.jsonFile(path)
or creating a temp table using Spark SQL:
1
2
3
4
5
CREATE TEMPORARY TABLE jsonTable2
USING org.apache.spark.sql.json
OPTIONS (
  path "/xxx/test2.json"
);
Sample error message is:
1
2
java.lang.RuntimeException: Failed to parse record   "array" : [ {.
Please make sure that each line of the file (or each string in the RDD) is a valid JSON object or an array of JSON objects.

Root Cause:

As mentioned in Spark Documentation:Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

Solution:

Convert the json object from multiple lines to a single line.
For example, convert below json object:
1
2
3
4
5
6
7
8
9
{
  "array" : [ {
    "count" : "site1",
    "sitename" : "sitename1"
  }, {
    "count" : "site2",
    "sitename" : "sitename2"
  } ]
}
to below:
1
{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }
After that, Spark can successfully parse it. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
spark-sql> CREATE TEMPORARY TABLE jsonTable
         > USING org.apache.spark.sql.json
         > OPTIONS (
         >   path "/xxx/test_spark.json"
         > )
         > ;
Time taken: 3.738 seconds
 
spark-sql> select * from jsonTable ;
[{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}]
Time taken: 1.4 seconds, Fetched 1 row(s)
 
spark-sql> select array[0].count,array[0].sitename from jsonTable;
site1 sitename1
Time taken: 0.184 seconds, Fetched 1 row(s)

You can  also put multiple single-line json objects into one file.
For example:
1
2
3
4
5
6
# cat test_spark_multiple.json
{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }
{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }
{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }
{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }
{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }
The query result would be:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
spark-sql>  CREATE TEMPORARY TABLE jsonTable3
         > USING org.apache.spark.sql.json
         > OPTIONS (
         >   path "/xxx/test_spark_multiple.json"
         > );
Time taken: 0.234 seconds
 
spark-sql> select * from jsonTable3 ;
[{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}]
[{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}]
[{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}]
[{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}]
[{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}]
Time taken: 0.153 seconds, Fetched 5 row(s)

Commentaires

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark performance optimization: shuffle tuning

Spark optimization