Spark operator: RDD basic conversion operation (4) -union, intersection, subtract

Union

Def union (other: RDD [T]): RDD [T]
The function is relatively simple, that is, the two RDD to merge, not to heavy .

  1. Scala> var rdd1 = sc.makeRDD (1 to 2,1)
  2. Rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [45] at makeRDD at: 21
  3.  
  4. Scala> rdd1.collect
  5. Res42: Array [Int] = Array (1, 2)
  6.  
  7. Scala> var rdd2 = sc.makeRDD (2 to 3,1)
  8. Rdd2: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [46] at makeRDD at: 21
  9.  
  10. Scala> rdd2.collect
  11. Res43: Array [Int] = Array (2, 3)
  12.  
  13. Scala> rdd1.union (rdd2) .collect
  14. Res44: Array [Int] = Array (1, 2, 2, 3)
  15.  

Intersection

Def intersection (other: RDD [T]): RDD [T]
Def node (other: RDD [T], numPartitions: Int): RDD [T]
Def intersection (other: RDD [T], partitioner: Partitioner) (implicit ord: Ordering [T] = null): RDD [T]
The function returns the intersection of two RDDs and goes to the weight .
Parameters numPartitions Specifies the number of partitions for the RDD returned.
The parameter partitioner is used to specify the partition function
  1. Scala> var rdd1 = sc.makeRDD (1 to 2,1)
  2. Rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [45] at makeRDD at: 21
  3.  
  4. Scala> rdd1.collect
  5. Res42: Array [Int] = Array (1, 2)
  6.  
  7. Scala> var rdd2 = sc.makeRDD (2 to 3,1)
  8. Rdd2: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [46] at makeRDD at: 21
  9.  
  10. Scala> rdd2.collect
  11. Res43: Array [Int] = Array (2, 3)
  12.  
  13. Scala> rdd1.intersection (rdd2) .collect
  14. Res45: Array [Int] = Array (2)
  15.  
  16. Scala> var rdd3 = rdd1.intersection (rdd2)
  17. Rdd3: org.apache.spark.rdd.RDD [Int] = MapPartitionsRDD [59] at intersection at: 25
  18.  
  19. Scala> rdd3.partitions.size
  20. Res46: Int = 1
  21.  
  22. Scala> var rdd3 = rdd1.intersection (rdd2,2)
  23. Rdd3: org.apache.spark.rdd.RDD [Int] = MapPartitionsRDD [65] at intersection at: 25
  24.  
  25. Scala> rdd3.partitions.size
  26. Res47: Int = 2
  27.  
  28.  

Subtract

Def subtract (other: RDD [T]): RDD [T]
Def subtract (other: RDD [T], numPartitions: Int): RDD [T]
Def subtract (other: RDD [T], partitioner: Partitioner) (implicit ord: Ordering [T] = null): RDD [T]
The function is similar to intersection, but returns an element that appears in RDD and does not appear in otherRDD.
Parameter meaning same intersection
  1. Scala> var rdd1 = sc.makeRDD (Seq (1,2,2,3))
  2. Rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [66] at makeRDD at: 21
  3.  
  4. Scala> rdd1.collect
  5. Res48: Array [Int] = Array (1, 2, 2, 3)
  6.  
  7. Scala> var rdd2 = sc.makeRDD (3 to 4)
  8. Rdd2: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [67] at makeRDD at: 21
  9.  
  10. Scala> rdd2.collect
  11. Res49: Array [Int] = Array (3, 4)
  12.  
  13. Scala> rdd1.subtract (rdd2) .collect
  14. Res50: Array [Int] = Array (1, 2, 2)

Commentaires

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark performance optimization: shuffle tuning

Spark optimization