Spark operator: RDD action Action action (1) -first, count, reduce, collect

First

Def first (): T
First returns the first element in the RDD, not sorted.
  1. ("A", "1"), ("B", "2"), ("C", "3")), 2)
  2. Rdd1: org.apache.spark.rdd.RDD [(String, String)] = ParallelCollectionRDD [33] at makeRDD at: 21
  3.  
  4. Scala> rdd1.first
  5. Res14: (String, String) = (A, 1)
  6.  
  7. Scala> var rdd1 = sc.makeRDD (Seq (10, 4, 2, 12, 3))
  8. Rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [0] at makeRDD at: 21
  9.  
  10. Scala> rdd1.first
  11. Res8: Int = 10
  12.  

Count

Def count (): Long
Count Returns the number of elements in the RDD.
  1. ("A", "1"), ("B", "2"), ("C", "3")), 2)
  2. Rdd1: org.apache.spark.rdd.RDD [(String, String)] = ParallelCollectionRDD [34] at makeRDD at: 21
  3.  
  4. Scala> rdd1.count
  5. Res15: Long = 3
  6.  

Reduce

Def reduce (f: (T, T) ⇒ T): T
According to the mapping function f, the elements in the RDD are binary calculated and the results are returned.
  1. Scala> var rdd1 = sc.makeRDD (1 to 10,2)
  2. Rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [36] at makeRDD at: 21
  3.  
  4. Scala> rdd1.reduce (_ + _)
  5. Res18: Int = 55
  6.  
  7. ("A", 0), ("A", 2), ("B", 1), ("B", 2), ("C", 1) ))
  8. Rdd2: org.apache.spark.rdd.RDD [(String, Int)] = ParallelCollectionRDD [38] at makeRDD at: 21
  9.  
  10. Scala> rdd2.reduce ((x, y) => {
  11. | (X._1 + y._1, x._2 + y._2)
  12. |})
  13. Res21: (String, Int) = (CBBAA, 6)
  14.  

Collect

Def collect (): Array [T]
Collect is used to convert an RDD to an array.
  1. Scala> var rdd1 = sc.makeRDD (1 to 10,2)
  2. Rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [36] at makeRDD at: 21
  3.  
  4. Scala> rdd1.collect
  5. Res23: Array [Int] = Array (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
  6.  
For more information on the Spark operator, refer to the Spark operator series .

Commentaires

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark performance optimization: shuffle tuning

Spark optimization