Spark operator: RDD key conversion operation (3) -groupByKey, reduceByKey, reduceByKeyLocally

mai 26, 2017

GroupByKey

Def groupByKey (): RDD [(K, Iterable [V])]
Def groupByKey (numPartitions: Int): RDD [(K, Iterable [V])]
Def groupByKey (partitioner: Partitioner): RDD [(K, Iterable [V])]
This function is used to combine the V values corresponding to each K in RDD [K, V] into a set Iterable [V]
The numPartitions parameter is used to specify the number of partitions;
The parameter partitioner is used to specify the partition function;


  ("A", 0), ("A", 2), ("B", 1), ("B", 2), ("C", 1) ))
 Rdd1: org.apache.spark.rdd.RDD [(String, Int)] = ParallelCollectionRDD [89] at makeRDD at: 21
 
 Scala> rdd1.groupByKey (). Collect
 (B, CompactBuffer (2, 1)), (C, CompactBuffer (1))) ()

ReduceByKey

Def reduceByKey (func: (V, V) => V): RDD [(K, V)]
Def reduceByKey (func: (V, V) => V, numPartitions: Int): RDD [(K, V)]
Def reduceByKey (partitioner: Partitioner, func: (V, V) => V): RDD [(K, V)]
This function is used to calculate the V value corresponding to each K in RDD [K, V] according to the mapping function.
The numPartitions parameter is used to specify the number of partitions;
The parameter partitioner is used to specify the partition function;


  ("A", 0), ("A", 2), ("B", 1), ("B", 2), ("C", 1) ))
 Rdd1: org.apache.spark.rdd.RDD [(String, Int)] = ParallelCollectionRDD [91] at makeRDD at: 21
 
 Scala> rdd1.partitions.size
 Res82: Int = 15
 
 Scala> var rdd2 = rdd1.reduceByKey ((x, y) => x + y)
 Rdd2: org.apache.spark.rdd.RDD [(String, Int)] = ShuffledRDD [94] at reduceByKey at: 23
 
 Scala> rdd2.collect
 Res85: Array [(String, Int)] = Array ((A, 2), (B, 3), (C, 1))
 
 Scala> rdd2.partitions.size
 Res86: Int = 15
 
 Scala> var rdd2 = rdd1.reduceByKey (new org.apache.spark.HashPartitioner (2), (x, y) => x + y)
 Rdd2: org.apache.spark.rdd.RDD [(String, Int)] = ShuffledRDD [95] at reduceByKey at: 23
 
 Scala> rdd2.collect
 Res87: Array [(String, Int)] = Array ((B, 3), (A, 2), (C, 1))
 
 Scala> rdd2.partitions.size
 Res88: Int = 2

ReduceByKeyLocally

Def reduceByKeyLocally (func: (V, V) => V): Map [K, V]
This function calculates the V value corresponding to each K in RDD [K, V] according to the mapping function. The result is mapped to a Map [K, V] instead of RDD [K, V].


  ("A", 0), ("A", 2), ("B", 1), ("B", 2), ("C", 1) ))
 Rdd1: org.apache.spark.rdd.RDD [(String, Int)] = ParallelCollectionRDD [91] at makeRDD at: 21
 
 Scala> rdd1.reduceByKeyLocally ((x, y) => x + y)
 Res90: scala.collection.Map [String, Int] = Map (B -> 3, A -> 2, C -> 1)

Rechercher dans ce blog

Big data