Spark operator: RDD basic conversion operation (2) -coalesce, repartition

Coalesce

Def coalesce (numPartitions: Int, shuffle: Boolean = false) (implicit ord: Ordering [T] = null): RDD [T]

  1. Scala> var data = sc.textFile ("/ tmp / lxw1234 / 1.txt")
  2. Data: org.apache.spark.rdd.RDD [String] = MapPartitionsRDD [53] at textFile at: 21
  3.  
  4. Scala> data.collect
  5. Res37: Array [String] = Array (hello world, hello spark, hello hive, hi spark)
  6.  
  7. Scala> data.partitions.size
  8. Res38: Int = 2 // RDD data By default there are two partitions scala> var rdd1 = data.coalesce (1)
  9. Rdd1: org.apache.spark.rdd.RDD [String] = CoalescedRDD [2] at coalesce at: 23
  10.  
  11. Scala> rdd1.partitions.size
  12. Res1: Int = 1 // rdd1 The number of partitions is 1
  13.  
  14.  
  15. Scala> var rdd1 = data.coalesce (4)
  16. Rdd1: org.apache.spark.rdd.RDD [String] = CoalescedRDD [3] at coalesce at: 23
  17.  
  18. Scala> rdd1.partitions.size
  19. Res2: Int = 2 // If the number of partitions is greater than the original number of partitions, you must specify the shuffle parameter to true, otherwise else, the number of partitions is not scalable > var rdd1 = data.coalesce (4, true)
  20. Rdd1: org.apache.spark.rdd.RDD [String] = MapPartitionsRDD [7] at coalesce at: 23
  21.  
  22. Scala> rdd1.partitions.size
  23. Res3: Int = 4
  24.  


Repartition

Def repartition (numPartitions: Int) (implicit ord: Ordering [T] = null): RDD [T]

  1. Scala> var rdd2 = data.repartition (1)
  2. Rdd2: org.apache.spark.rdd.RDD [String] = MapPartitionsRDD [11] at repartition at: 23
  3.  
  4. Scala> rdd2.partitions.size
  5. Res4: Int = 1
  6.  
  7. Scala> var rdd2 = data.repartition (4)
  8. Rdd2: org.apache.spark.rdd.RDD [String] = MapPartitionsRDD [15] at repartition at: 23
  9.  
  10. Scala> rdd2.partitions.size
  11. Res5: Int = 4
  12.  

Commentaires

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark performance optimization: shuffle tuning

Spark optimization