Spark operator: RDD basic conversion operation (4) -union, intersection, subtract

Union

Def union (other: RDD [T]): RDD [T]
The function is relatively simple, that is, the two RDD to merge, not to heavy .

  1. Scala> var rdd1 = sc.makeRDD (1 to 2,1)
  2. Rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [45] at makeRDD at: 21
  3.  
  4. Scala> rdd1.collect
  5. Res42: Array [Int] = Array (1, 2)
  6.  
  7. Scala> var rdd2 = sc.makeRDD (2 to 3,1)
  8. Rdd2: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [46] at makeRDD at: 21
  9.  
  10. Scala> rdd2.collect
  11. Res43: Array [Int] = Array (2, 3)
  12.  
  13. Scala> rdd1.union (rdd2) .collect
  14. Res44: Array [Int] = Array (1, 2, 2, 3)
  15.  

Intersection

Def intersection (other: RDD [T]): RDD [T]
Def node (other: RDD [T], numPartitions: Int): RDD [T]
Def intersection (other: RDD [T], partitioner: Partitioner) (implicit ord: Ordering [T] = null): RDD [T]
The function returns the intersection of two RDDs and goes to the weight .
Parameters numPartitions Specifies the number of partitions for the RDD returned.
The parameter partitioner is used to specify the partition function
  1. Scala> var rdd1 = sc.makeRDD (1 to 2,1)
  2. Rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [45] at makeRDD at: 21
  3.  
  4. Scala> rdd1.collect
  5. Res42: Array [Int] = Array (1, 2)
  6.  
  7. Scala> var rdd2 = sc.makeRDD (2 to 3,1)
  8. Rdd2: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [46] at makeRDD at: 21
  9.  
  10. Scala> rdd2.collect
  11. Res43: Array [Int] = Array (2, 3)
  12.  
  13. Scala> rdd1.intersection (rdd2) .collect
  14. Res45: Array [Int] = Array (2)
  15.  
  16. Scala> var rdd3 = rdd1.intersection (rdd2)
  17. Rdd3: org.apache.spark.rdd.RDD [Int] = MapPartitionsRDD [59] at intersection at: 25
  18.  
  19. Scala> rdd3.partitions.size
  20. Res46: Int = 1
  21.  
  22. Scala> var rdd3 = rdd1.intersection (rdd2,2)
  23. Rdd3: org.apache.spark.rdd.RDD [Int] = MapPartitionsRDD [65] at intersection at: 25
  24.  
  25. Scala> rdd3.partitions.size
  26. Res47: Int = 2
  27.  
  28.  

Subtract

Def subtract (other: RDD [T]): RDD [T]
Def subtract (other: RDD [T], numPartitions: Int): RDD [T]
Def subtract (other: RDD [T], partitioner: Partitioner) (implicit ord: Ordering [T] = null): RDD [T]
The function is similar to intersection, but returns an element that appears in RDD and does not appear in otherRDD.
Parameter meaning same intersection
  1. Scala> var rdd1 = sc.makeRDD (Seq (1,2,2,3))
  2. Rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [66] at makeRDD at: 21
  3.  
  4. Scala> rdd1.collect
  5. Res48: Array [Int] = Array (1, 2, 2, 3)
  6.  
  7. Scala> var rdd2 = sc.makeRDD (3 to 4)
  8. Rdd2: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [67] at makeRDD at: 21
  9.  
  10. Scala> rdd2.collect
  11. Res49: Array [Int] = Array (3, 4)
  12.  
  13. Scala> rdd1.subtract (rdd2) .collect
  14. Res50: Array [Int] = Array (1, 2, 2)

Commentaires

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark optimization

Spark performance optimization: shuffle tuning