Spark operator: RDD key conversion operation (4) -cogroup, join

Cogroup

## The parameter is 1 RDD
Def cogroup [W] (other: RDD [(K, W)]): RDD [(K, (Iterable [V], Iterable [W]))
Def cogroup [W] (other: RDD [(K, W)], numPartitions: Int): RDD [(K, (Iterable [V], Iterable [W]))
Def cogroup [W] (other: RDD [(K, W)], partitioner: Partitioner): RDD [(K, (Iterable [V], Iterable [W]))
## parameters for 2 RDD
(K, (W, W1)], other2: RDD [(K, W2)]): RDD [(K, (Iterable [V], Iterable [W1], Iterable [W2] ])))
(K, W2)], numPartitions: Int): RDD [(K, (Iterable [V], Iterable [W1]], other words: , Iterable [W2]))]
(K, W2)], partitioner: Partitioner): RDD [(K, (Iterable [V], Iterable [W1]), other than Rc [W1, W2] , Iterable [W2]))]
## The parameter is 3 RDD
(K, W1)], other: RDD [(K, W2)], other3: RDD [(K, W3)]): RDD [(K, Iterable [V1], Iterable [W1], Iterable [W2], Iterable [W3]))
(C, W1)], other: RDD [(K, W2)], other3: RDD [(K, W3)], numPartitions: Int): RDD [ (Iterable [V], Iterable [W1], Iterable [W2], Iterable [W3]))
(C, W2)], other: RDD [(K, W1)], other2: RDD [(K, W2)], other3: RDD [(K, W3)], partitioner: Partitioner): RDD [ (Iterable [V], Iterable [W1], Iterable [W2], Iterable [W3]))
Cogroup is equivalent to SQL in the full outer association full outer join, return to the RDD in the record, the association is not on the empty.
Parameters numPartitions The number of partitions used to specify the result.
The parameter partitioner is used to specify the partition function.
## The argument is an example of RDD
  1. ("A", "1"), ("B", "2"), ("C", "3")), 2)
  2. ("A", "a"), ("C", "c"), ("D", "d")), 2)
  3.  
  4. Scala> var rdd3 = rdd1.cogroup (rdd2)
  5. Rdd3: org.apache.spark.rdd.RDD [(String, (Iterable [String], Iterable [String])) = MapPartitionsRDD [12] at cogroup at: 25
  6.  
  7. Scala> rdd3.partitions.size
  8. Res3: Int = 2
  9.  
  10. Scala> rdd3.collect
  11. Res1: Array [(String, (Iterable [String], Iterable [String]))] = Array (
  12. (B, (CompactBuffer (2), CompactBuffer ())),
  13. (D, (CompactBuffer (), CompactBuffer (d))),
  14. (A, (CompactBuffer (1), CompactBuffer (a))),
  15. (C, (CompactBuffer (3), CompactBuffer (c)))
  16. )
  17.  
  18.  
  19. Scala> var rdd4 = rdd1.cogroup (rdd2,3)
  20. Rdd4: org.apache.spark.rdd.RDD [(String, (Iterable [String], Iterable [String])) = MapPartitionsRDD [14] at cogroup at: 25
  21.  
  22. Scala> rdd4.partitions.size
  23. Res5: Int = 3
  24.  
  25. Scala> rdd4.collect
  26. Res6: Array [(String, (Iterable [String], Iterable [String]))] = Array (
  27. (B, ( CompactBuffer (2) , CompactBuffer ())),
  28. (C, ( CompactBuffer (3) , CompactBuffer (c))),
  29. (A, ( CompactBuffer (1) , CompactBuffer (a))),
  30. (D, ( CompactBuffer () , CompactBuffer (d))))
  31.  
## The argument is an example of two RDDs
  1. ("A", "1"), ("B", "2"), ("C", "3")), 2)
  2. ("A", "a"), ("C", "c"), ("D", "d")), 2)
  3. Var rdd3 = sc.makeRDD (Array ("A", "A"), ("E", "E")), 2)
  4.  
  5. Scala> var rdd4 = rdd1.cogroup (rdd2, rdd3)
  6. Rdd4: org.apache.spark.rdd.RDD [(String, (Iterable [String], Iterable [String], Iterable [String]))] =
  7. MapPartitionsRDD [17] at cogroup at: 27
  8.  
  9. Scala> rdd4.partitions.size
  10. Res7: Int = 2
  11.  
  12. Scala> rdd4.collect
  13. Res9: Array [(String, (Iterable [String], Iterable [String], Iterable [String]))] = Array (
  14. (B, ( CompactBuffer (2) , CompactBuffer (), CompactBuffer () )),
  15. (D, ( CompactBuffer () , CompactBuffer (d), CompactBuffer () )),
  16. (A, ( CompactBuffer (1) , CompactBuffer (a), CompactBuffer (A) )),
  17. ( CompactBuffer (3) , CompactBuffer (c), CompactBuffer ( )),
  18. (E, ( CompactBuffer () , CompactBuffer (), CompactBuffer (E) )))
  19.  
  20.  
## parameters for the three RDD examples slightly, ibid.

Join

Def join [W] (other: RDD [(K, W)]): RDD [(K, (V, W))]
Def join [W] (other: RDD [(K, W)], numPartitions: Int): RDD [(K, (V, W))]
Def join [W] (other: RDD [(K, W)], partitioner: Partitioner): RDD [(K, (V, W))]
Join is equivalent to the SQL in the context of the join, only two RDD According to the K can be associated with the results, join can only be used for the association between the two RDD, if more than one RDD association, many times can be related.
Parameters numPartitions The number of partitions used to specify the result
The parameter partitioner is used to specify the partition function
  1. ("A", "1"), ("B", "2"), ("C", "3")), 2)
  2. ("A", "a"), ("C", "c"), ("D", "d")), 2)
  3.  
  4. Scala> rdd1.join (rdd2) .collect
  5. Res10: Array [(String, (String, String))] = Array ((A, (1, a)), (C, (3, c)))
  6.  
For more information on the Spark operator, refer to the Spark operator series .
Http://lxw1234.com/archives/2015/07/363.htm

Commentaires

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark performance optimization: shuffle tuning

Spark optimization