Spark operator: RDD basic conversion operation (7) -zipWithIndex, zipWithUniqueId

ZipWithIndex

Def zipWithIndex (): RDD [(T, Long)]
This function combines the elements in the RDD with the ID (index number) of the element in the RDD into a key / value pair.
  1. Scala> var rdd2 = sc.makeRDD (Seq ("A", "B", "R", "D", "F"), 2)
  2. Rdd2: org.apache.spark.rdd.RDD [String] = ParallelCollectionRDD [34] at makeRDD at: 21
  3.  
  4. Scala> rdd2.zipWithIndex (). Collect
  5. (B, 1), (R, 2), (D, 3), (F, 4))
  6.  

ZipWithUniqueId

Def zipWithUniqueId (): RDD [(T, Long)]
This function combines RDD elements and a unique ID into key / value pairs. The unique ID generation algorithm is as follows:
The unique ID of the first element in each partition is: the partition index number,
The unique ID of the Nth element in each partition is: (the unique ID value of the previous element) + (the total number of RDD partitions)
Look at the following example:
  1. (A), "B", "C", "D", "E", "F"), 2)
  2. Rdd1: org.apache.spark.rdd.RDD [String] = ParallelCollectionRDD [44] at makeRDD at :twenty one
  3. // rdd1 has two partitions,
  4. Scala> rdd1.zipWithUniqueId (). Collect
  5. (B, 2), (C, 4), (D, 1), (E, 3), (F, 5))
  6. // The total number of partitions is 2
  7. // The first partition has the first element ID of 0 and the first partition ID of the second partition
  8. // The first partition has the second element ID of 0 + 2 = 2, the first partition has the third element ID of 2 + 2 = 4
  9. // The second partition has the second element ID of 1 + 2 = 3, the second partition has the third element ID of 3 + 2 = 5

Commentaires

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark performance optimization: shuffle tuning

Spark optimization