Spark operator: RDD basic conversion operation (3) -randomSplit, glom

RandomSplit

Def randomSplit (weights: Array [Double], seed: Long = Utils.random.nextLong): Array [RDD [T]]

This function divides an RDD into multiple RDDs based on the weight.

The weight parameter is a Double array
The second parameter is random seed, basically negligible.
  1. Scala> var rdd = sc.makeRDD (1 to 10,10)
  2. Rdd: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [16] at makeRDD at: 21
  3.  
  4. Scala> rdd.collect
  5. Res6: Array [Int] = Array (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
  6.  
  7. Scala> var splitRDD = rdd.randomSplit ( Array (1.0, 2.0, 3.0, 4.0) )
  8. SplitRDD: Array [org.apache.spark.rdd.RDD [Int]] = Array (MapPartitionsRDD [17] at randomSplit at: 23,
  9. MapPartitionsRDD [18] at randomSplit at: 23,
  10. MapPartitionsRDD [19] at randomSplit at: 23,
  11. MapPartitionsRDD [20] at randomSplit at: 23)
  12.  
  13. // here note: The result of randomSplit is an RDD array scala> splitRDD.size
  14. Res8: Int = 4
  15. // Since the first argument to randomSplit has four values, it is split into four RDDs,
  16. / / The original rdd in accordance with the weight of 1.0,2.0,3.0,4.0, randomly divided into the four RDD, the high weight RDD, divided into the probability of some big.
  17. // Note that the sum of the weights is 1 , otherwise it will not be normal scala> splitRDD (0) .collect
  18. Res10: Array [Int] = Array (1, 4)
  19.  
  20. Scala> splitRDD (1) .collect
  21. Res11: Array [Int] = Array (3)
  22.  
  23. Scala> splitRDD (2) .collect
  24. Res12: Array [Int] = Array (5, 9)
  25.  
  26. Scala> splitRDD (3) .collect
  27. Res13: Array [Int] = Array (2, 6, 7, 8, 10)
  28.  

    Glom

    Def glom (): RDD [Array [T]]
    This function converts an element of type T in each partition in RDD into Array [T], so that each partition has only one array element.



    1. Scala> var rdd = sc.makeRDD (1 to 10,3)
    2. Rdd: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [38] at makeRDD at :twenty one
    3. Scala> rdd.partitions.size
    4. Res33: Int = 3 // The RDD has 3 partitions scala> rdd.glom (). Collect
    5. Res35: Array [Array [Int]] = Array (Array (1, 2, 3), Array (4, 5, 6), Array (7, 8, 9, 10))
    6. // glom puts the elements in each partition into an array, and the result becomes three arrays

Commentaires

Publier un commentaire

Posts les plus consultés de ce blog

Spark performance optimization: shuffle tuning

Spark optimization

Use Apache Spark to write data to ElasticSearch