Spark operator: RDD basic conversion operation (7) -zipWithIndex, zipWithUniqueId

mai 26, 2017

ZipWithIndex

Def zipWithIndex (): RDD [(T, Long)]
This function combines the elements in the RDD with the ID (index number) of the element in the RDD into a key / value pair.


  Scala> var rdd2 = sc.makeRDD (Seq ("A", "B", "R", "D", "F"), 2)
 Rdd2: org.apache.spark.rdd.RDD [String] = ParallelCollectionRDD [34] at makeRDD at: 21
 
 Scala> rdd2.zipWithIndex (). Collect
 (B, 1), (R, 2), (D, 3), (F, 4))

ZipWithUniqueId

Def zipWithUniqueId (): RDD [(T, Long)]
This function combines RDD elements and a unique ID into key / value pairs. The unique ID generation algorithm is as follows:
The unique ID of the first element in each partition is: the partition index number,
The unique ID of the Nth element in each partition is: (the unique ID value of the previous element) + (the total number of RDD partitions)
Look at the following example:


 (A), "B", "C", "D", "E", "F"), 2)
 Rdd1: org.apache.spark.rdd.RDD [String] = ParallelCollectionRDD [44] at makeRDD at   :twenty one
 // rdd1 has two partitions,


 Scala> rdd1.zipWithUniqueId (). Collect
 (B, 2), (C, 4), (D, 1), (E, 3), (F, 5))
 // The total number of partitions is 2
 // The first partition has the first element ID of 0 and the first partition ID of the second partition
 // The first partition has the second element ID of 0 + 2 = 2, the first partition has the third element ID of 2 + 2 = 4
 // The second partition has the second element ID of 1 + 2 = 3, the second partition has the third element ID of 3 + 2 = 5

Rechercher dans ce blog

Big data