Spark operator: RDD basic conversion operation (4) -union, intersection, subtract
Union
Def union (other: RDD [T]): RDD [T]The function is relatively simple, that is, the two RDD to merge, not to heavy .
- Scala> var rdd1 = sc.makeRDD (1 to 2,1)
- Rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [45] at makeRDD at: 21
- Scala> rdd1.collect
- Res42: Array [Int] = Array (1, 2)
- Scala> var rdd2 = sc.makeRDD (2 to 3,1)
- Rdd2: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [46] at makeRDD at: 21
- Scala> rdd2.collect
- Res43: Array [Int] = Array (2, 3)
- Scala> rdd1.union (rdd2) .collect
- Res44: Array [Int] = Array (1, 2, 2, 3)
Intersection
Def intersection (other: RDD [T]): RDD [T]Def node (other: RDD [T], numPartitions: Int): RDD [T]
Def intersection (other: RDD [T], partitioner: Partitioner) (implicit ord: Ordering [T] = null): RDD [T]
The function returns the intersection of two RDDs and goes to the weight .
Parameters numPartitions Specifies the number of partitions for the RDD returned.
The parameter partitioner is used to specify the partition function
- Scala> var rdd1 = sc.makeRDD (1 to 2,1)
- Rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [45] at makeRDD at: 21
- Scala> rdd1.collect
- Res42: Array [Int] = Array (1, 2)
- Scala> var rdd2 = sc.makeRDD (2 to 3,1)
- Rdd2: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [46] at makeRDD at: 21
- Scala> rdd2.collect
- Res43: Array [Int] = Array (2, 3)
- Scala> rdd1.intersection (rdd2) .collect
- Res45: Array [Int] = Array (2)
- Scala> var rdd3 = rdd1.intersection (rdd2)
- Rdd3: org.apache.spark.rdd.RDD [Int] = MapPartitionsRDD [59] at intersection at: 25
- Scala> rdd3.partitions.size
- Res46: Int = 1
- Scala> var rdd3 = rdd1.intersection (rdd2,2)
- Rdd3: org.apache.spark.rdd.RDD [Int] = MapPartitionsRDD [65] at intersection at: 25
- Scala> rdd3.partitions.size
- Res47: Int = 2
Subtract
Def subtract (other: RDD [T]): RDD [T]Def subtract (other: RDD [T], numPartitions: Int): RDD [T]
Def subtract (other: RDD [T], partitioner: Partitioner) (implicit ord: Ordering [T] = null): RDD [T]
The function is similar to intersection, but returns an element that appears in RDD and does not appear in otherRDD.
Parameter meaning same intersection
- Scala> var rdd1 = sc.makeRDD (Seq (1,2,2,3))
- Rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [66] at makeRDD at: 21
- Scala> rdd1.collect
- Res48: Array [Int] = Array (1, 2, 2, 3)
- Scala> var rdd2 = sc.makeRDD (3 to 4)
- Rdd2: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [67] at makeRDD at: 21
- Scala> rdd2.collect
- Res49: Array [Int] = Array (3, 4)
- Scala> rdd1.subtract (rdd2) .collect
- Res50: Array [Int] = Array (1, 2, 2)
Commentaires
Enregistrer un commentaire