Spark operator: RDD basic conversion operation (3) -randomSplit, glom

mai 26, 2017

RandomSplit

Def randomSplit (weights: Array [Double], seed: Long = Utils.random.nextLong): Array [RDD [T]]

This function divides an RDD into multiple RDDs based on the weight.

The weight parameter is a Double array
The second parameter is random seed, basically negligible.


  Scala> var rdd = sc.makeRDD (1 to 10,10)
 Rdd: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [16] at makeRDD at: 21
 
 Scala> rdd.collect
 Res6: Array [Int] = Array (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)  
 
 Scala> var splitRDD = rdd.randomSplit ( Array (1.0, 2.0, 3.0, 4.0) )
 SplitRDD: Array [org.apache.spark.rdd.RDD [Int]] = Array (MapPartitionsRDD [17] at randomSplit at: 23, 
 MapPartitionsRDD [18] at randomSplit at: 23, 
 MapPartitionsRDD [19] at randomSplit at: 23, 
 MapPartitionsRDD [20] at randomSplit at: 23)
 
 // here note: The result of randomSplit is an RDD array scala> splitRDD.size
 Res8: Int = 4
 // Since the first argument to randomSplit has four values, it is split into four RDDs,
 / / The original rdd in accordance with the weight of 1.0,2.0,3.0,4.0, randomly divided into the four RDD, the high weight RDD, divided into the probability of some big.
 // Note that the sum of the weights is 1 , otherwise it will not be normal scala> splitRDD (0) .collect
 Res10: Array [Int] = Array (1, 4)
 
 Scala> splitRDD (1) .collect
 Res11: Array [Int] = Array (3)                                                    
 
 Scala> splitRDD (2) .collect
 Res12: Array [Int] = Array (5, 9)
 
 Scala> splitRDD (3) .collect
 Res13: Array [Int] = Array (2, 6, 7, 8, 10)
  
Glom 
Def glom (): RDD [Array [T]] 

 
 This function converts an element of type T in each partition in RDD 
into Array [T], so that each partition has only one array element.








Scala> var rdd = sc.makeRDD (1 to 10,3)
 Rdd: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [38] at makeRDD at   :twenty one
 Scala> rdd.partitions.size
 Res33: Int = 3 // The RDD has 3 partitions scala> rdd.glom (). Collect
 Res35: Array [Array [Int]] = Array (Array (1, 2, 3), Array (4, 5, 6), Array (7, 8, 9, 10))
 // glom puts the elements in each partition into an array, and the result becomes three arrays

Commentaires

veera8 octobre 2020 à 21:36
Very nice article,keep sharing more posts with us.
thank you.....

Big data and hadoop training

Big data and hadoop course
RépondreSupprimer
Réponses

Ajouter un commentaire

Rechercher dans ce blog

Big data

Spark operator: RDD basic conversion operation (3) -randomSplit, glom

RandomSplit

Def randomSplit (weights: Array [Double], seed: Long = Utils.random.nextLong): Array [RDD [T]]

This function divides an RDD into multiple RDDs based on the weight.

Glom

Commentaires

Enregistrer un commentaire

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark optimization

Spark performance optimization: shuffle tuning