Spark function to explain: cache

mai 30, 2017

Use the MEMORY_ONLY storage level to cache the RDD, whose internal implementation calls the persist () function. Official document definition:

Persist this RDD with the default storage level (`MEMORY_ONLY`).

Function prototype

def cache() : this.type

 -----------------------------

scala> var data = sc.parallelize(List(1,2,3,4))

data: org.apache.spark.rdd.RDD[Int] =

　　ParallelCollectionRDD[44] at parallelize at <console>:12

scala> data.getStorageLevel

res65: org.apache.spark.storage.StorageLevel =

　　StorageLevel(false, false, false, false, 1)

scala> data.cache

res66: org.apache.spark.rdd.RDD[Int] =

　　ParallelCollectionRDD[44] at parallelize at <console>:12

scala> data.getStorageLevel

res67: org.apache.spark.storage.StorageLevel =

　　StorageLevel(false, true, false, true, 1)

Commentaires

Tejuteju14 septembre 2018 à 06:46
Really Good blog post.provided a helpful information.I hope that you will post more updates like this Big Data Hadoop Online Course Hyderabad
RépondreSupprimer
Réponses

Ajouter un commentaire

Rechercher dans ce blog

Big data

Spark function to explain: cache

Function prototype

Commentaires

Enregistrer un commentaire

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark optimization

Spark performance optimization: shuffle tuning