Spark fails to parse a json object with multiple lines
Spark fails to parse a json object with multiple lines
Env:
Spark 1.3.1Symptom:
Spark fails to parse a json object with multiple lines.This issue can happen when either creating a DataFrame using:
1
| val people = sqlContext.jsonFile(path) |
1
2
3
4
5
| CREATE TEMPORARY TABLE jsonTable2 USING org.apache.spark.sql.json OPTIONS ( path "/xxx/test2.json" ); |
1
2
| java.lang.RuntimeException: Failed to parse record "array" : [ {. Please make sure that each line of the file (or each string in the RDD) is a valid JSON object or an array of JSON objects. |
Root Cause:
As mentioned in Spark Documentation:Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.Solution:
Convert the json object from multiple lines to a single line.For example, convert below json object:
1
2
3
4
5
6
7
8
9
| { "array" : [ { "count" : "site1", "sitename" : "sitename1" }, { "count" : "site2", "sitename" : "sitename2" } ] } |
1
| { "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] } |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| spark-sql> CREATE TEMPORARY TABLE jsonTable > USING org.apache.spark.sql.json > OPTIONS ( > path "/xxx/test_spark.json" > ) > ; Time taken: 3.738 seconds spark-sql> select * from jsonTable ; [{ "count" : "site1" , "sitename" : "sitename1" },{ "count" : "site2" , "sitename" : "sitename2" }] Time taken: 1.4 seconds, Fetched 1 row(s) spark-sql> select array[0]. count ,array[0].sitename from jsonTable; site1 sitename1 Time taken: 0.184 seconds, Fetched 1 row(s) |
You can also put multiple single-line json objects into one file.
For example:
1
2
3
4
5
6
| # cat test_spark_multiple.json { "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] } { "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] } { "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] } { "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] } { "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] } |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| spark-sql> CREATE TEMPORARY TABLE jsonTable3 > USING org.apache.spark.sql.json > OPTIONS ( > path "/xxx/test_spark_multiple.json" > ); Time taken: 0.234 seconds spark-sql> select * from jsonTable3 ; [{ "count" : "site1" , "sitename" : "sitename1" },{ "count" : "site2" , "sitename" : "sitename2" }] [{ "count" : "site1" , "sitename" : "sitename1" },{ "count" : "site2" , "sitename" : "sitename2" }] [{ "count" : "site1" , "sitename" : "sitename1" },{ "count" : "site2" , "sitename" : "sitename2" }] [{ "count" : "site1" , "sitename" : "sitename1" },{ "count" : "site2" , "sitename" : "sitename2" }] [{ "count" : "site1" , "sitename" : "sitename1" },{ "count" : "site2" , "sitename" : "sitename2" }] Time taken: 0.153 seconds, Fetched 5 row(s) |
Commentaires
Enregistrer un commentaire