Spark function to explain: checkpoint

Set the checkpoint for the current RDD. The function will create a binary file and store it in the checkpoint directory, which is set with Spark Context.setCheckpointDir (). During the checkpoint process, all of the RDD-dependent information in the parent RDD will be all removed. A checkpoint operation on RDD is not performed immediately and an Action must be performed to trigger it.

Function prototype

def checkpoint()
 
scala> val data = sc.parallelize(1 to 100000 , 15)
data: org.apache.spark.rdd.RDD[Int] =
  ParallelCollectionRDD[12] at parallelize at <console>:12
 
scala> sc.setCheckpointDir("/iteblog")
 
scala> data.checkpoint
 
scala> data.count
15/02/15 11:47:47 INFO RDDCheckpointData: Done checkpointing RDD 12 to
hdfs://iteblogcluster/iteblog/5f2053e9-a02f-4661-ad1d-2250a8473e92/rdd-12,
new parent is RDD 13
res17: Long = 100000
 
[iteblog.com@ ~]$ bin/hadoop fs -ls /iteblog/5f2053e9-a02f-4661-ad1d-2250a8473e92/rdd-12
Found 15 items
-rw-r--r-- ... 2015-02-15 /iteblog/5f2053e9-a02f-4661-ad1d-2250a8473e92/rdd-12/part-00000
-rw-r--r-- ... 2015-02-15 /iteblog/5f2053e9-a02f-4661-ad1d-2250a8473e92/rdd-12/part-00001
-rw-r--r-- ... 2015-02-15 /iteblog/5f2053e9-a02f-4661-ad1d-2250a8473e92/rdd-12/part-00002
-rw-r--r-- ... 2015-02-15 /iteblog/5f2053e9-a02f-4661-ad1d-2250a8473e92/rdd-12/part-00003
-rw-r--r-- ... 2015-02-15 /iteblog/5f2053e9-a02f-4661-ad1d-2250a8473e92/rdd-12/part-00004
-rw-r--r-- ... 2015-02-15 /iteblog/5f2053e9-a02f-4661-ad1d-2250a8473e92/rdd-12/part-00005
-rw-r--r-- ... 2015-02-15 /iteblog/5f2053e9-a02f-4661-ad1d-2250a8473e92/rdd-12/part-00006
-rw-r--r-- ... 2015-02-15 /iteblog/5f2053e9-a02f-4661-ad1d-2250a8473e92/rdd-12/part-00007
-rw-r--r-- ... 2015-02-15 /iteblog/5f2053e9-a02f-4661-ad1d-2250a8473e92/rdd-12/part-00008
-rw-r--r-- ... 2015-02-15 /iteblog/5f2053e9-a02f-4661-ad1d-2250a8473e92/rdd-12/part-00009
-rw-r--r-- ... 2015-02-15 /iteblog/5f2053e9-a02f-4661-ad1d-2250a8473e92/rdd-12/part-00010
-rw-r--r-- ... 2015-02-15 /iteblog/5f2053e9-a02f-4661-ad1d-2250a8473e92/rdd-12/part-00011
-rw-r--r-- ... 2015-02-15 /iteblog/5f2053e9-a02f-4661-ad1d-2250a8473e92/rdd-12/part-00012
-rw-r--r-- ... 2015-02-15 /iteblog/5f2053e9-a02f-4661-ad1d-2250a8473e92/rdd-12/part-00013
-rw-r--r-- ... 2015-02-15 /iteblog/5f2053e9-a02f-4661-ad1d-2250a8473e92/rdd-12/part-00014
 

Commentaires

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark performance optimization: shuffle tuning

Spark optimization