Spark operator: RDD key conversion operation (3) -groupByKey, reduceByKey, reduceByKeyLocally
GroupByKey
Def groupByKey (): RDD [(K, Iterable [V])]Def groupByKey (numPartitions: Int): RDD [(K, Iterable [V])]
Def groupByKey (partitioner: Partitioner): RDD [(K, Iterable [V])]
This function is used to combine the V values corresponding to each K in RDD [K, V] into a set Iterable [V]
The numPartitions parameter is used to specify the number of partitions;
The parameter partitioner is used to specify the partition function;
- ("A", 0), ("A", 2), ("B", 1), ("B", 2), ("C", 1) ))
- Rdd1: org.apache.spark.rdd.RDD [(String, Int)] = ParallelCollectionRDD [89] at makeRDD at: 21
- Scala> rdd1.groupByKey (). Collect
- (B, CompactBuffer (2, 1)), (C, CompactBuffer (1))) ()
ReduceByKey
Def reduceByKey (func: (V, V) => V): RDD [(K, V)]Def reduceByKey (func: (V, V) => V, numPartitions: Int): RDD [(K, V)]
Def reduceByKey (partitioner: Partitioner, func: (V, V) => V): RDD [(K, V)]
This function is used to calculate the V value corresponding to each K in RDD [K, V] according to the mapping function.
The numPartitions parameter is used to specify the number of partitions;
The parameter partitioner is used to specify the partition function;
- ("A", 0), ("A", 2), ("B", 1), ("B", 2), ("C", 1) ))
- Rdd1: org.apache.spark.rdd.RDD [(String, Int)] = ParallelCollectionRDD [91] at makeRDD at: 21
- Scala> rdd1.partitions.size
- Res82: Int = 15
- Scala> var rdd2 = rdd1.reduceByKey ((x, y) => x + y)
- Rdd2: org.apache.spark.rdd.RDD [(String, Int)] = ShuffledRDD [94] at reduceByKey at: 23
- Scala> rdd2.collect
- Res85: Array [(String, Int)] = Array ((A, 2), (B, 3), (C, 1))
- Scala> rdd2.partitions.size
- Res86: Int = 15
- Scala> var rdd2 = rdd1.reduceByKey (new org.apache.spark.HashPartitioner (2), (x, y) => x + y)
- Rdd2: org.apache.spark.rdd.RDD [(String, Int)] = ShuffledRDD [95] at reduceByKey at: 23
- Scala> rdd2.collect
- Res87: Array [(String, Int)] = Array ((B, 3), (A, 2), (C, 1))
- Scala> rdd2.partitions.size
- Res88: Int = 2
ReduceByKeyLocally
Def reduceByKeyLocally (func: (V, V) => V): Map [K, V]This function calculates the V value corresponding to each K in RDD [K, V] according to the mapping function. The result is mapped to a Map [K, V] instead of RDD [K, V].
- ("A", 0), ("A", 2), ("B", 1), ("B", 2), ("C", 1) ))
- Rdd1: org.apache.spark.rdd.RDD [(String, Int)] = ParallelCollectionRDD [91] at makeRDD at: 21
- Scala> rdd1.reduceByKeyLocally ((x, y) => x + y)
- Res90: scala.collection.Map [String, Int] = Map (B -> 3, A -> 2, C -> 1)
Commentaires
Enregistrer un commentaire