Spark operator: RDD basic conversion operation (3) -randomSplit, glom
RandomSplit
Def randomSplit (weights: Array [Double], seed: Long = Utils.random.nextLong): Array [RDD [T]]
This function divides an RDD into multiple RDDs based on the weight.
The weight parameter is a Double arrayThe second parameter is random seed, basically negligible.
- Scala> var rdd = sc.makeRDD (1 to 10,10)
- Rdd: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [16] at makeRDD at: 21
- Scala> rdd.collect
- Res6: Array [Int] = Array (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
- Scala> var splitRDD = rdd.randomSplit ( Array (1.0, 2.0, 3.0, 4.0) )
- SplitRDD: Array [org.apache.spark.rdd.RDD [Int]] = Array (MapPartitionsRDD [17] at randomSplit at: 23,
- MapPartitionsRDD [18] at randomSplit at: 23,
- MapPartitionsRDD [19] at randomSplit at: 23,
- MapPartitionsRDD [20] at randomSplit at: 23)
- // here note: The result of randomSplit is an RDD array scala> splitRDD.size
- Res8: Int = 4
- // Since the first argument to randomSplit has four values, it is split into four RDDs,
- / / The original rdd in accordance with the weight of 1.0,2.0,3.0,4.0, randomly divided into the four RDD, the high weight RDD, divided into the probability of some big.
- // Note that the sum of the weights is 1 , otherwise it will not be normal scala> splitRDD (0) .collect
- Res10: Array [Int] = Array (1, 4)
- Scala> splitRDD (1) .collect
- Res11: Array [Int] = Array (3)
- Scala> splitRDD (2) .collect
- Res12: Array [Int] = Array (5, 9)
- Scala> splitRDD (3) .collect
- Res13: Array [Int] = Array (2, 6, 7, 8, 10)
Glom
Def glom (): RDD [Array [T]]
This function converts an element of type T in each partition in RDD into Array [T], so that each partition has only one array element.
- Scala> var rdd = sc.makeRDD (1 to 10,3)
- Rdd: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [38] at makeRDD at :twenty one
- Scala> rdd.partitions.size
- Res33: Int = 3 // The RDD has 3 partitions scala> rdd.glom (). Collect
- Res35: Array [Array [Int]] = Array (Array (1, 2, 3), Array (4, 5, 6), Array (7, 8, 9, 10))
- // glom puts the elements in each partition into an array, and the result becomes three arrays
Very nice article,keep sharing more posts with us.
RépondreSupprimerthank you.....
Big data and hadoop training
Big data and hadoop course