Spark operator: RDD key conversion operation (4) -cogroup, join
Cogroup
## The parameter is 1 RDDDef cogroup [W] (other: RDD [(K, W)]): RDD [(K, (Iterable [V], Iterable [W]))
Def cogroup [W] (other: RDD [(K, W)], numPartitions: Int): RDD [(K, (Iterable [V], Iterable [W]))
Def cogroup [W] (other: RDD [(K, W)], partitioner: Partitioner): RDD [(K, (Iterable [V], Iterable [W]))
## parameters for 2 RDD
(K, (W, W1)], other2: RDD [(K, W2)]): RDD [(K, (Iterable [V], Iterable [W1], Iterable [W2] ])))
(K, W2)], numPartitions: Int): RDD [(K, (Iterable [V], Iterable [W1]], other words: , Iterable [W2]))]
(K, W2)], partitioner: Partitioner): RDD [(K, (Iterable [V], Iterable [W1]), other than Rc [W1, W2] , Iterable [W2]))]
## The parameter is 3 RDD
(K, W1)], other: RDD [(K, W2)], other3: RDD [(K, W3)]): RDD [(K, Iterable [V1], Iterable [W1], Iterable [W2], Iterable [W3]))
(C, W1)], other: RDD [(K, W2)], other3: RDD [(K, W3)], numPartitions: Int): RDD [ (Iterable [V], Iterable [W1], Iterable [W2], Iterable [W3]))
(C, W2)], other: RDD [(K, W1)], other2: RDD [(K, W2)], other3: RDD [(K, W3)], partitioner: Partitioner): RDD [ (Iterable [V], Iterable [W1], Iterable [W2], Iterable [W3]))
Cogroup is equivalent to SQL in the full outer association full outer join, return to the RDD in the record, the association is not on the empty.
Parameters numPartitions The number of partitions used to specify the result.
The parameter partitioner is used to specify the partition function.
## The argument is an example of RDD
## The argument is an example of two RDDs
- ("A", "1"), ("B", "2"), ("C", "3")), 2)
- ("A", "a"), ("C", "c"), ("D", "d")), 2)
- Scala> var rdd3 = rdd1.cogroup (rdd2)
- Rdd3: org.apache.spark.rdd.RDD [(String, (Iterable [String], Iterable [String])) = MapPartitionsRDD [12] at cogroup at: 25
- Scala> rdd3.partitions.size
- Res3: Int = 2
- Scala> rdd3.collect
- Res1: Array [(String, (Iterable [String], Iterable [String]))] = Array (
- (B, (CompactBuffer (2), CompactBuffer ())),
- (D, (CompactBuffer (), CompactBuffer (d))),
- (A, (CompactBuffer (1), CompactBuffer (a))),
- (C, (CompactBuffer (3), CompactBuffer (c)))
- )
- Scala> var rdd4 = rdd1.cogroup (rdd2,3)
- Rdd4: org.apache.spark.rdd.RDD [(String, (Iterable [String], Iterable [String])) = MapPartitionsRDD [14] at cogroup at: 25
- Scala> rdd4.partitions.size
- Res5: Int = 3
- Scala> rdd4.collect
- Res6: Array [(String, (Iterable [String], Iterable [String]))] = Array (
- (B, ( CompactBuffer (2) , CompactBuffer ())),
- (C, ( CompactBuffer (3) , CompactBuffer (c))),
- (A, ( CompactBuffer (1) , CompactBuffer (a))),
- (D, ( CompactBuffer () , CompactBuffer (d))))
## parameters for the three RDD examples slightly, ibid.
- ("A", "1"), ("B", "2"), ("C", "3")), 2)
- ("A", "a"), ("C", "c"), ("D", "d")), 2)
- Var rdd3 = sc.makeRDD (Array ("A", "A"), ("E", "E")), 2)
- Scala> var rdd4 = rdd1.cogroup (rdd2, rdd3)
- Rdd4: org.apache.spark.rdd.RDD [(String, (Iterable [String], Iterable [String], Iterable [String]))] =
- MapPartitionsRDD [17] at cogroup at: 27
- Scala> rdd4.partitions.size
- Res7: Int = 2
- Scala> rdd4.collect
- Res9: Array [(String, (Iterable [String], Iterable [String], Iterable [String]))] = Array (
- (B, ( CompactBuffer (2) , CompactBuffer (), CompactBuffer () )),
- (D, ( CompactBuffer () , CompactBuffer (d), CompactBuffer () )),
- (A, ( CompactBuffer (1) , CompactBuffer (a), CompactBuffer (A) )),
- ( CompactBuffer (3) , CompactBuffer (c), CompactBuffer ( )),
- (E, ( CompactBuffer () , CompactBuffer (), CompactBuffer (E) )))
Join
Def join [W] (other: RDD [(K, W)]): RDD [(K, (V, W))]Def join [W] (other: RDD [(K, W)], numPartitions: Int): RDD [(K, (V, W))]
Def join [W] (other: RDD [(K, W)], partitioner: Partitioner): RDD [(K, (V, W))]
Join is equivalent to the SQL in the context of the join, only two RDD According to the K can be associated with the results, join can only be used for the association between the two RDD, if more than one RDD association, many times can be related.
Parameters numPartitions The number of partitions used to specify the result
The parameter partitioner is used to specify the partition function
For more information on the Spark operator, refer to the Spark operator series .
- ("A", "1"), ("B", "2"), ("C", "3")), 2)
- ("A", "a"), ("C", "c"), ("D", "d")), 2)
- Scala> rdd1.join (rdd2) .collect
- Res10: Array [(String, (String, String))] = Array ((A, (1, a)), (C, (3, c)))
Http://lxw1234.com/archives/2015/07/363.htm
Commentaires
Enregistrer un commentaire