Spark operator: RDD action Action action (5) -saveAsTextFile, saveAsSequenceFile, saveAsObjectFile

SaveAsTextFile

Def saveAsTextFile (path: String): Unit
Def saveAsTextFile (path: String, codec: Class [_ <: CompressionCodec]): Unit
SaveAsTextFile is used to store RDD as a text file in a file system.
The codec parameter can specify the compressed class name.
  1. Var rdd1 = sc.makeRDD (1 to 10,2)
  2. Scala> rdd1.saveAsTextFile ("hdfs: //cdh5/tmp/lxw1234.com/") // save to HDFS
  3. Hadoop fs -ls /tmp/lxw1234.com
  4. Found 2 items
  5. -rw-r-r-2 lxw1234 supergroup 0 2015-07-10 09:15 /tmp/lxw1234.com/_SUCCESS
  6. -rw-r-r-2 lxw1234 supergroup 21 2015-07-10 09:15 /tmp/lxw1234.com/part-00000
  7.  
  8. Hadoop fs -cat /tmp/lxw1234.com/part-00000
  9. 1
  10. 2
  11. 3
  12. 4
  13. 5
  14. 6
  15. 7
  16. 8
  17. 9
  18. 10
  19.  
Note: If you save a file to a local file system using rdd1.saveAsTextFile ("file: ///tmp/lxw1234.com"), only the local directory of the machine where the Executor is located is saved.
// specify the compressed format to save
  1. Rdd1.saveAsTextFile ("hdfs: //cdh5/tmp/lxw1234.com/", classOf [com.hadoop.compression.lzo.LzopCodec])
  2.  
  3. Hadoop fs -ls /tmp/lxw1234.com
  4. -rw-r-r-2 lxw1234 supergroup 0 2015-07-10 09:20 /tmp/lxw1234.com/_SUCCESS
  5. -rw-r-r-2 lxw1234 supergroup 71 2015-07-10 09:20 /tmp/lxw1234.com/part-00000.lzo
  6.  
  7. Hadoop fs -text /tmp/lxw1234.com/part-00000.lzo
  8. 1
  9. 2
  10. 3
  11. 4
  12. 5
  13. 6
  14. 7
  15. 8
  16. 9
  17. 10
  18.  
  19.  

SaveAsSequenceFile

SaveAsSequenceFile is used to save RDD to HDFS in the file format of SequenceFile.
Usage with saveAsTextFile.

SaveAsObjectFile

Def saveAsObjectFile (path: String): Unit
SaveAsObjectFile is used to serialize the elements in the RDD into objects and store them in a file.
For HDFS, the default is saved with SequenceFile.
  1. Var rdd1 = sc.makeRDD (1 to 10,2)
  2. Rdd1.saveAsObjectFile ("hdfs: //cdh5/tmp/lxw1234.com/")
  3.  
  4. Hadoop fs -cat /tmp/lxw1234.com/part-00000
  5. SEQ! Org.apache.hadoop.io.NullWritable "org.apache.hadoop.io.BytesWritableT

Commentaires

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark performance optimization: shuffle tuning

Spark optimization