Spark operator: RDD action Action action (1) -first, count, reduce, collect

mai 26, 2017

First

Def first (): T
First returns the first element in the RDD, not sorted.


  ("A", "1"), ("B", "2"), ("C", "3")), 2)
 Rdd1: org.apache.spark.rdd.RDD [(String, String)] = ParallelCollectionRDD [33] at makeRDD at: 21
 
 Scala> rdd1.first
 Res14: (String, String) = (A, 1)
 
 Scala> var rdd1 = sc.makeRDD (Seq (10, 4, 2, 12, 3))
 Rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [0] at makeRDD at: 21
 
 Scala> rdd1.first
 Res8: Int = 10

Count

Def count (): Long
Count Returns the number of elements in the RDD.


  ("A", "1"), ("B", "2"), ("C", "3")), 2)
 Rdd1: org.apache.spark.rdd.RDD [(String, String)] = ParallelCollectionRDD [34] at makeRDD at: 21
 
 Scala> rdd1.count
 Res15: Long = 3

Reduce

Def reduce (f: (T, T) ⇒ T): T
According to the mapping function f, the elements in the RDD are binary calculated and the results are returned.


  Scala> var rdd1 = sc.makeRDD (1 to 10,2)
 Rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [36] at makeRDD at: 21
 
 Scala> rdd1.reduce (_ + _)
 Res18: Int = 55
 
 ("A", 0), ("A", 2), ("B", 1), ("B", 2), ("C", 1) ))
 Rdd2: org.apache.spark.rdd.RDD [(String, Int)] = ParallelCollectionRDD [38] at makeRDD at: 21
 
 Scala> rdd2.reduce ((x, y) => {
      | (X._1 + y._1, x._2 + y._2)
      |})
 Res21: (String, Int) = (CBBAA, 6)

Collect

Def collect (): Array [T]
Collect is used to convert an RDD to an array.


  Scala> var rdd1 = sc.makeRDD (1 to 10,2)
 Rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [36] at makeRDD at: 21
 
 Scala> rdd1.collect
 Res23: Array [Int] = Array (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

For more information on the Spark operator, refer to the Spark operator series .

Rechercher dans ce blog

Big data

Spark operator: RDD action Action action (1) -first, count, reduce, collect

First

Count

Reduce

Collect

Commentaires

Enregistrer un commentaire

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark optimization

Spark performance optimization: shuffle tuning