Spark fails to parse a json object with multiple lines
Spark fails to parse a json object with multiple lines
Env:
Spark 1.3.1Symptom:
Spark fails to parse a json object with multiple lines.This issue can happen when either creating a DataFrame using:
1
| val people = sqlContext.jsonFile(path) |
1
2
3
4
5
| CREATE TEMPORARY TABLE jsonTable2USING org.apache.spark.sql.jsonOPTIONS ( path "/xxx/test2.json"); |
1
2
| java.lang.RuntimeException: Failed to parse record "array" : [ {. Please make sure that each line of the file (or each string in the RDD) is a valid JSON object or an array of JSON objects. |
Root Cause:
As mentioned in Spark Documentation:Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.Solution:
Convert the json object from multiple lines to a single line.For example, convert below json object:
1
2
3
4
5
6
7
8
9
| { "array" : [ { "count" : "site1", "sitename" : "sitename1" }, { "count" : "site2", "sitename" : "sitename2" } ]} |
1
| { "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] } |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| spark-sql> CREATE TEMPORARY TABLE jsonTable > USING org.apache.spark.sql.json > OPTIONS ( > path "/xxx/test_spark.json" > ) > ;Time taken: 3.738 secondsspark-sql> select * from jsonTable ;[{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}]Time taken: 1.4 seconds, Fetched 1 row(s)spark-sql> select array[0].count,array[0].sitename from jsonTable;site1 sitename1Time taken: 0.184 seconds, Fetched 1 row(s) |
You can also put multiple single-line json objects into one file.
For example:
1
2
3
4
5
6
| # cat test_spark_multiple.json{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] } |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| spark-sql> CREATE TEMPORARY TABLE jsonTable3 > USING org.apache.spark.sql.json > OPTIONS ( > path "/xxx/test_spark_multiple.json" > );Time taken: 0.234 secondsspark-sql> select * from jsonTable3 ;[{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}][{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}][{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}][{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}][{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}]Time taken: 0.153 seconds, Fetched 5 row(s) |
Commentaires
Enregistrer un commentaire