Spark operator: RDD key conversion operation (4) -cogroup, join

mai 26, 2017

Cogroup

## The parameter is 1 RDD
Def cogroup [W] (other: RDD [(K, W)]): RDD [(K, (Iterable [V], Iterable [W]))
Def cogroup [W] (other: RDD [(K, W)], numPartitions: Int): RDD [(K, (Iterable [V], Iterable [W]))
Def cogroup [W] (other: RDD [(K, W)], partitioner: Partitioner): RDD [(K, (Iterable [V], Iterable [W]))
## parameters for 2 RDD
(K, (W, W1)], other2: RDD [(K, W2)]): RDD [(K, (Iterable [V], Iterable [W1], Iterable [W2] ])))
(K, W2)], numPartitions: Int): RDD [(K, (Iterable [V], Iterable [W1]], other words: , Iterable [W2]))]
(K, W2)], partitioner: Partitioner): RDD [(K, (Iterable [V], Iterable [W1]), other than Rc [W1, W2] , Iterable [W2]))]
## The parameter is 3 RDD
(K, W1)], other: RDD [(K, W2)], other3: RDD [(K, W3)]): RDD [(K, Iterable [V1], Iterable [W1], Iterable [W2], Iterable [W3]))
(C, W1)], other: RDD [(K, W2)], other3: RDD [(K, W3)], numPartitions: Int): RDD [ (Iterable [V], Iterable [W1], Iterable [W2], Iterable [W3]))
(C, W2)], other: RDD [(K, W1)], other2: RDD [(K, W2)], other3: RDD [(K, W3)], partitioner: Partitioner): RDD [ (Iterable [V], Iterable [W1], Iterable [W2], Iterable [W3]))
Cogroup is equivalent to SQL in the full outer association full outer join, return to the RDD in the record, the association is not on the empty.
Parameters numPartitions The number of partitions used to specify the result.
The parameter partitioner is used to specify the partition function.
## The argument is an example of RDD


  ("A", "1"), ("B", "2"), ("C", "3")), 2)
 ("A", "a"), ("C", "c"), ("D", "d")), 2)
 
 Scala> var rdd3 = rdd1.cogroup (rdd2)
 Rdd3: org.apache.spark.rdd.RDD [(String, (Iterable [String], Iterable [String])) = MapPartitionsRDD [12] at cogroup at: 25
 
 Scala> rdd3.partitions.size
 Res3: Int = 2
 
 Scala> rdd3.collect
 Res1: Array [(String, (Iterable [String], Iterable [String]))] = Array (
 (B, (CompactBuffer (2), CompactBuffer ())), 
 (D, (CompactBuffer (), CompactBuffer (d))), 
 (A, (CompactBuffer (1), CompactBuffer (a))), 
 (C, (CompactBuffer (3), CompactBuffer (c)))
 )
 
 
 Scala> var rdd4 = rdd1.cogroup (rdd2,3)
 Rdd4: org.apache.spark.rdd.RDD [(String, (Iterable [String], Iterable [String])) = MapPartitionsRDD [14] at cogroup at: 25
 
 Scala> rdd4.partitions.size
 Res5: Int = 3
 
 Scala> rdd4.collect
 Res6: Array [(String, (Iterable [String], Iterable [String]))] = Array (
 (B, ( CompactBuffer (2) , CompactBuffer ())), 
 (C, ( CompactBuffer (3) , CompactBuffer (c))), 
 (A, ( CompactBuffer (1) , CompactBuffer (a))), 
 (D, ( CompactBuffer () , CompactBuffer (d))))

## The argument is an example of two RDDs


  ("A", "1"), ("B", "2"), ("C", "3")), 2)
 ("A", "a"), ("C", "c"), ("D", "d")), 2)
 Var rdd3 = sc.makeRDD (Array ("A", "A"), ("E", "E")), 2)
 
 Scala> var rdd4 = rdd1.cogroup (rdd2, rdd3)
 Rdd4: org.apache.spark.rdd.RDD [(String, (Iterable [String], Iterable [String], Iterable [String]))] = 
 MapPartitionsRDD [17] at cogroup at: 27
 
 Scala> rdd4.partitions.size
 Res7: Int = 2
 
 Scala> rdd4.collect
 Res9: Array [(String, (Iterable [String], Iterable [String], Iterable [String]))] = Array (
 (B, ( CompactBuffer (2) , CompactBuffer (), CompactBuffer () )), 
 (D, ( CompactBuffer () , CompactBuffer (d), CompactBuffer () )), 
 (A, ( CompactBuffer (1) , CompactBuffer (a), CompactBuffer (A) )), 
 ( CompactBuffer (3) , CompactBuffer (c), CompactBuffer ( )), 
 (E, ( CompactBuffer () , CompactBuffer (), CompactBuffer (E) )))

## parameters for the three RDD examples slightly, ibid.

Join

Def join [W] (other: RDD [(K, W)]): RDD [(K, (V, W))]
Def join [W] (other: RDD [(K, W)], numPartitions: Int): RDD [(K, (V, W))]
Def join [W] (other: RDD [(K, W)], partitioner: Partitioner): RDD [(K, (V, W))]
Join is equivalent to the SQL in the context of the join, only two RDD According to the K can be associated with the results, join can only be used for the association between the two RDD, if more than one RDD association, many times can be related.
Parameters numPartitions The number of partitions used to specify the result
The parameter partitioner is used to specify the partition function


  ("A", "1"), ("B", "2"), ("C", "3")), 2)
 ("A", "a"), ("C", "c"), ("D", "d")), 2)
 
 Scala> rdd1.join (rdd2) .collect
 Res10: Array [(String, (String, String))] = Array ((A, (1, a)), (C, (3, c)))

For more information on the Spark operator, refer to the Spark operator series .
Http://lxw1234.com/archives/2015/07/363.htm

Rechercher dans ce blog

Big data