Spark operator: RDD basic conversion operation (4) -union, intersection, subtract

mai 26, 2017

Union

Def union (other: RDD [T]): RDD [T]
The function is relatively simple, that is, the two RDD to merge, not to heavy .


  Scala> var rdd1 = sc.makeRDD (1 to 2,1)
 Rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [45] at makeRDD at: 21
 
 Scala> rdd1.collect
 Res42: Array [Int] = Array (1, 2)
 
 Scala> var rdd2 = sc.makeRDD (2 to 3,1)
 Rdd2: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [46] at makeRDD at: 21
 
 Scala> rdd2.collect
 Res43: Array [Int] = Array (2, 3)
 
 Scala> rdd1.union (rdd2) .collect
 Res44: Array [Int] = Array (1, 2, 2, 3)

Intersection

Def intersection (other: RDD [T]): RDD [T]
Def node (other: RDD [T], numPartitions: Int): RDD [T]
Def intersection (other: RDD [T], partitioner: Partitioner) (implicit ord: Ordering [T] = null): RDD [T]
The function returns the intersection of two RDDs and goes to the weight .
Parameters numPartitions Specifies the number of partitions for the RDD returned.
The parameter partitioner is used to specify the partition function


  Scala> var rdd1 = sc.makeRDD (1 to 2,1)
 Rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [45] at makeRDD at: 21
 
 Scala> rdd1.collect
 Res42: Array [Int] = Array (1, 2)
 
 Scala> var rdd2 = sc.makeRDD (2 to 3,1)
 Rdd2: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [46] at makeRDD at: 21
 
 Scala> rdd2.collect
 Res43: Array [Int] = Array (2, 3)
 
 Scala> rdd1.intersection (rdd2) .collect
 Res45: Array [Int] = Array (2)
 
 Scala> var rdd3 = rdd1.intersection (rdd2)
 Rdd3: org.apache.spark.rdd.RDD [Int] = MapPartitionsRDD [59] at intersection at: 25
 
 Scala> rdd3.partitions.size
 Res46: Int = 1
 
 Scala> var rdd3 = rdd1.intersection (rdd2,2)
 Rdd3: org.apache.spark.rdd.RDD [Int] = MapPartitionsRDD [65] at intersection at: 25
 
 Scala> rdd3.partitions.size
 Res47: Int = 2

Subtract

Def subtract (other: RDD [T]): RDD [T]
Def subtract (other: RDD [T], numPartitions: Int): RDD [T]
Def subtract (other: RDD [T], partitioner: Partitioner) (implicit ord: Ordering [T] = null): RDD [T]
The function is similar to intersection, but returns an element that appears in RDD and does not appear in otherRDD.
Parameter meaning same intersection


  Scala> var rdd1 = sc.makeRDD (Seq (1,2,2,3))
 Rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [66] at makeRDD at: 21
 
 Scala> rdd1.collect
 Res48: Array [Int] = Array (1, 2, 2, 3)
 
 Scala> var rdd2 = sc.makeRDD (3 to 4)
 Rdd2: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [67] at makeRDD at: 21
 
 Scala> rdd2.collect
 Res49: Array [Int] = Array (3, 4)
 
 Scala> rdd1.subtract (rdd2) .collect
 Res50: Array [Int] = Array (1, 2, 2)

Rechercher dans ce blog

Big data