Spark operator: RDD action Action action (5) -saveAsTextFile, saveAsSequenceFile, saveAsObjectFile

SaveAsTextFile

Def saveAsTextFile (path: String): Unit
Def saveAsTextFile (path: String, codec: Class [_ <: CompressionCodec]): Unit
SaveAsTextFile is used to store RDD as a text file in a file system.
The codec parameter can specify the compressed class name.
  1. Var rdd1 = sc.makeRDD (1 to 10,2)
  2. Scala> rdd1.saveAsTextFile ("hdfs: //cdh5/tmp/lxw1234.com/") // save to HDFS
  3. Hadoop fs -ls /tmp/lxw1234.com
  4. Found 2 items
  5. -rw-r-r-2 lxw1234 supergroup 0 2015-07-10 09:15 /tmp/lxw1234.com/_SUCCESS
  6. -rw-r-r-2 lxw1234 supergroup 21 2015-07-10 09:15 /tmp/lxw1234.com/part-00000
  7.  
  8. Hadoop fs -cat /tmp/lxw1234.com/part-00000
  9. 1
  10. 2
  11. 3
  12. 4
  13. 5
  14. 6
  15. 7
  16. 8
  17. 9
  18. 10
  19.  
Note: If you save a file to a local file system using rdd1.saveAsTextFile ("file: ///tmp/lxw1234.com"), only the local directory of the machine where the Executor is located is saved.
// specify the compressed format to save
  1. Rdd1.saveAsTextFile ("hdfs: //cdh5/tmp/lxw1234.com/", classOf [com.hadoop.compression.lzo.LzopCodec])
  2.  
  3. Hadoop fs -ls /tmp/lxw1234.com
  4. -rw-r-r-2 lxw1234 supergroup 0 2015-07-10 09:20 /tmp/lxw1234.com/_SUCCESS
  5. -rw-r-r-2 lxw1234 supergroup 71 2015-07-10 09:20 /tmp/lxw1234.com/part-00000.lzo
  6.  
  7. Hadoop fs -text /tmp/lxw1234.com/part-00000.lzo
  8. 1
  9. 2
  10. 3
  11. 4
  12. 5
  13. 6
  14. 7
  15. 8
  16. 9
  17. 10
  18.  
  19.  

SaveAsSequenceFile

SaveAsSequenceFile is used to save RDD to HDFS in the file format of SequenceFile.
Usage with saveAsTextFile.

SaveAsObjectFile

Def saveAsObjectFile (path: String): Unit
SaveAsObjectFile is used to serialize the elements in the RDD into objects and store them in a file.
For HDFS, the default is saved with SequenceFile.
  1. Var rdd1 = sc.makeRDD (1 to 10,2)
  2. Rdd1.saveAsObjectFile ("hdfs: //cdh5/tmp/lxw1234.com/")
  3.  
  4. Hadoop fs -cat /tmp/lxw1234.com/part-00000
  5. SEQ! Org.apache.hadoop.io.NullWritable "org.apache.hadoop.io.BytesWritableT

Commentaires

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark optimization

Spark performance optimization: shuffle tuning