Spark operator: RDD basic conversion operation (3) -randomSplit, glom

RandomSplit

Def randomSplit (weights: Array [Double], seed: Long = Utils.random.nextLong): Array [RDD [T]]

This function divides an RDD into multiple RDDs based on the weight.

The weight parameter is a Double array
The second parameter is random seed, basically negligible.
  1. Scala> var rdd = sc.makeRDD (1 to 10,10)
  2. Rdd: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [16] at makeRDD at: 21
  3.  
  4. Scala> rdd.collect
  5. Res6: Array [Int] = Array (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
  6.  
  7. Scala> var splitRDD = rdd.randomSplit ( Array (1.0, 2.0, 3.0, 4.0) )
  8. SplitRDD: Array [org.apache.spark.rdd.RDD [Int]] = Array (MapPartitionsRDD [17] at randomSplit at: 23,
  9. MapPartitionsRDD [18] at randomSplit at: 23,
  10. MapPartitionsRDD [19] at randomSplit at: 23,
  11. MapPartitionsRDD [20] at randomSplit at: 23)
  12.  
  13. // here note: The result of randomSplit is an RDD array scala> splitRDD.size
  14. Res8: Int = 4
  15. // Since the first argument to randomSplit has four values, it is split into four RDDs,
  16. / / The original rdd in accordance with the weight of 1.0,2.0,3.0,4.0, randomly divided into the four RDD, the high weight RDD, divided into the probability of some big.
  17. // Note that the sum of the weights is 1 , otherwise it will not be normal scala> splitRDD (0) .collect
  18. Res10: Array [Int] = Array (1, 4)
  19.  
  20. Scala> splitRDD (1) .collect
  21. Res11: Array [Int] = Array (3)
  22.  
  23. Scala> splitRDD (2) .collect
  24. Res12: Array [Int] = Array (5, 9)
  25.  
  26. Scala> splitRDD (3) .collect
  27. Res13: Array [Int] = Array (2, 6, 7, 8, 10)
  28.  

    Glom

    Def glom (): RDD [Array [T]]
    This function converts an element of type T in each partition in RDD into Array [T], so that each partition has only one array element.



    1. Scala> var rdd = sc.makeRDD (1 to 10,3)
    2. Rdd: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [38] at makeRDD at :twenty one
    3. Scala> rdd.partitions.size
    4. Res33: Int = 3 // The RDD has 3 partitions scala> rdd.glom (). Collect
    5. Res35: Array [Array [Int]] = Array (Array (1, 2, 3), Array (4, 5, 6), Array (7, 8, 9, 10))
    6. // glom puts the elements in each partition into an array, and the result becomes three arrays

Commentaires

Enregistrer un commentaire

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark performance optimization: shuffle tuning

Spark optimization