Spark function to explain: cache

Use the MEMORY_ONLY storage level to cache the RDD, whose internal implementation calls the persist () function. Official document definition:
Persist this RDD with the default storage level (`MEMORY_ONLY`).

Function prototype

def cache() : this.type
 -----------------------------
scala> var data = sc.parallelize(List(1,2,3,4))
data: org.apache.spark.rdd.RDD[Int] =
  ParallelCollectionRDD[44] at parallelize at <console>:12
 
scala> data.getStorageLevel
res65: org.apache.spark.storage.StorageLevel =
  StorageLevel(false, false, false, false, 1)
 
scala> data.cache
res66: org.apache.spark.rdd.RDD[Int] =
  ParallelCollectionRDD[44] at parallelize at <console>:12
 
scala> data.getStorageLevel
res67: org.apache.spark.storage.StorageLevel =
  StorageLevel(false, true, false, true, 1)
 

Commentaires

  1. Really Good blog post.provided a helpful information.I hope that you will post more updates like this Big Data Hadoop Online Course Hyderabad

    RépondreSupprimer

Enregistrer un commentaire

Posts les plus consultés de ce blog

Spark performance optimization: shuffle tuning

Spark optimization

Use Apache Spark to write data to ElasticSearch