Spark operator: RDD key conversion operation (3) -groupByKey, reduceByKey, reduceByKeyLocally

GroupByKey

Def groupByKey (): RDD [(K, Iterable [V])]
Def groupByKey (numPartitions: Int): RDD [(K, Iterable [V])]
Def groupByKey (partitioner: Partitioner): RDD [(K, Iterable [V])]
This function is used to combine the V values ​​corresponding to each K in RDD [K, V] into a set Iterable [V]
The numPartitions parameter is used to specify the number of partitions;
The parameter partitioner is used to specify the partition function;
  1. ("A", 0), ("A", 2), ("B", 1), ("B", 2), ("C", 1) ))
  2. Rdd1: org.apache.spark.rdd.RDD [(String, Int)] = ParallelCollectionRDD [89] at makeRDD at: 21
  3.  
  4. Scala> rdd1.groupByKey (). Collect
  5. (B, CompactBuffer (2, 1)), (C, CompactBuffer (1))) ()
  6.  

ReduceByKey

Def reduceByKey (func: (V, V) => V): RDD [(K, V)]
Def reduceByKey (func: (V, V) => V, numPartitions: Int): RDD [(K, V)]
Def reduceByKey (partitioner: Partitioner, func: (V, V) => V): RDD [(K, V)]
This function is used to calculate the V value corresponding to each K in RDD [K, V] according to the mapping function.
The numPartitions parameter is used to specify the number of partitions;
The parameter partitioner is used to specify the partition function;
  1. ("A", 0), ("A", 2), ("B", 1), ("B", 2), ("C", 1) ))
  2. Rdd1: org.apache.spark.rdd.RDD [(String, Int)] = ParallelCollectionRDD [91] at makeRDD at: 21
  3.  
  4. Scala> rdd1.partitions.size
  5. Res82: Int = 15
  6.  
  7. Scala> var rdd2 = rdd1.reduceByKey ((x, y) => x + y)
  8. Rdd2: org.apache.spark.rdd.RDD [(String, Int)] = ShuffledRDD [94] at reduceByKey at: 23
  9.  
  10. Scala> rdd2.collect
  11. Res85: Array [(String, Int)] = Array ((A, 2), (B, 3), (C, 1))
  12.  
  13. Scala> rdd2.partitions.size
  14. Res86: Int = 15
  15.  
  16. Scala> var rdd2 = rdd1.reduceByKey (new org.apache.spark.HashPartitioner (2), (x, y) => x + y)
  17. Rdd2: org.apache.spark.rdd.RDD [(String, Int)] = ShuffledRDD [95] at reduceByKey at: 23
  18.  
  19. Scala> rdd2.collect
  20. Res87: Array [(String, Int)] = Array ((B, 3), (A, 2), (C, 1))
  21.  
  22. Scala> rdd2.partitions.size
  23. Res88: Int = 2
  24.  

ReduceByKeyLocally

Def reduceByKeyLocally (func: (V, V) => V): Map [K, V]
This function calculates the V value corresponding to each K in RDD [K, V] according to the mapping function. The result is mapped to a Map [K, V] instead of RDD [K, V].
  1. ("A", 0), ("A", 2), ("B", 1), ("B", 2), ("C", 1) ))
  2. Rdd1: org.apache.spark.rdd.RDD [(String, Int)] = ParallelCollectionRDD [91] at makeRDD at: 21
  3.  
  4. Scala> rdd1.reduceByKeyLocally ((x, y) => x + y)
  5. Res90: scala.collection.Map [String, Int] = Map (B -> 3, A -> 2, C -> 1)
  6.  
  7.  

Commentaires

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark performance optimization: shuffle tuning

Spark optimization