Spark operator: RDD basic conversion operation (7) -zipWithIndex, zipWithUniqueId
ZipWithIndex
Def zipWithIndex (): RDD [(T, Long)]This function combines the elements in the RDD with the ID (index number) of the element in the RDD into a key / value pair.
- Scala> var rdd2 = sc.makeRDD (Seq ("A", "B", "R", "D", "F"), 2)
- Rdd2: org.apache.spark.rdd.RDD [String] = ParallelCollectionRDD [34] at makeRDD at: 21
- Scala> rdd2.zipWithIndex (). Collect
- (B, 1), (R, 2), (D, 3), (F, 4))
ZipWithUniqueId
Def zipWithUniqueId (): RDD [(T, Long)]This function combines RDD elements and a unique ID into key / value pairs. The unique ID generation algorithm is as follows:
The unique ID of the first element in each partition is: the partition index number,
The unique ID of the Nth element in each partition is: (the unique ID value of the previous element) + (the total number of RDD partitions)
Look at the following example:
- (A), "B", "C", "D", "E", "F"), 2)
- Rdd1: org.apache.spark.rdd.RDD [String] = ParallelCollectionRDD [44] at makeRDD at :twenty one
- // rdd1 has two partitions,
- Scala> rdd1.zipWithUniqueId (). Collect
- (B, 2), (C, 4), (D, 1), (E, 3), (F, 5))
- // The total number of partitions is 2
- // The first partition has the first element ID of 0 and the first partition ID of the second partition
- // The first partition has the second element ID of 0 + 2 = 2, the first partition has the third element ID of 2 + 2 = 4
- // The second partition has the second element ID of 1 + 2 = 3, the second partition has the third element ID of 3 + 2 = 5
Commentaires
Enregistrer un commentaire