Spark function to explain: cache

Use the MEMORY_ONLY storage level to cache the RDD, whose internal implementation calls the persist () function. Official document definition:
Persist this RDD with the default storage level (`MEMORY_ONLY`).

Function prototype

def cache() : this.type
 -----------------------------
scala> var data = sc.parallelize(List(1,2,3,4))
data: org.apache.spark.rdd.RDD[Int] =
  ParallelCollectionRDD[44] at parallelize at <console>:12
 
scala> data.getStorageLevel
res65: org.apache.spark.storage.StorageLevel =
  StorageLevel(false, false, false, false, 1)
 
scala> data.cache
res66: org.apache.spark.rdd.RDD[Int] =
  ParallelCollectionRDD[44] at parallelize at <console>:12
 
scala> data.getStorageLevel
res67: org.apache.spark.storage.StorageLevel =
  StorageLevel(false, true, false, true, 1)
 

Commentaires

  1. Really Good blog post.provided a helpful information.I hope that you will post more updates like this Big Data Hadoop Online Course Hyderabad

    RépondreSupprimer

Enregistrer un commentaire

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark performance optimization: shuffle tuning

Spark optimization