Spark operator: RDD creation operation
Create RDD from the collection
Parallelize
- Scala> var rdd = sc.parallelize (1 to 10)
- Rdd: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [2] at parallelize at: 21
- Scala> rdd.collect
- Res3: Array [Int] = Array (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
- Scala> rdd.partitions.size
- Res4: Int = 15
- // set RDD to 3 partitions scala> var rdd2 = sc.parallelize (1 to 10,3)
- Rdd2: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [3] at parallelize at: 21
- Scala> rdd2.collect
- Res5: Array [Int] = Array (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
- Scala> rdd2.partitions.size
- Res6: Int = 3
MakeRDD
- Scala> var collect = Seq ((1 to 10, Seq ("slave007.lxw1234.com", "slave002.lxw1234.com")),
- (11 to 15, Seq ("slave013.lxw1234.com", "slave015.lxw1234.com"))))
- Collect: Seq [(scala.collection.immutable.Range.Inclusive, Seq [String])] = List ((Range (1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
- List (slave (s) (slave00.lxw1234.com), (Range (11, 12, 13, 14, 15), List (slave013.lxw1234.com, slave015.lxw1234.com)))
- Scala> var rdd = sc.makeRDD (collect)
- Rdd: org.apache.spark.rdd.RDD [scala.collection.immutable.Range.Inclusive] = ParallelCollectionRDD [6] at makeRDD at: 23
- Scala> rdd.partitions.size
- Res33: Int = 2
- Scala> rdd.preferredLocations (rdd.partitions (0))
- Res34: Seq [String] = List (slave007.lxw1234.com, slave002.lxw1234.com)
- Scala> rdd.preferredLocations (rdd.partitions (1))
- Res35: Seq [String] = List (slave013.lxw1234.com, slave015.lxw1234.com)
Create RDD from external storage
TextFile
- // create scala from the hdfs file var rdd = sc.textFile ("hdfs: ///tmp/lxw1234/1.txt")
- Rdd: org.apache.spark.rdd.RDD [String] = MapPartitionsRDD [26] at textFile at: 21
- Scala> rdd.count
- Res48: Long = 4
- // create scala> var rdd = sc.textFile ("file: ///etc/hadoop/conf/core-site.xml" from local file )
- Rdd: org.apache.spark.rdd.RDD [String] = MapPartitionsRDD [28] at textFile at: 21
- Scala> rdd.count
- Res49: Long = 97
Created from other HDFS file formats
HadoopFile
SequenceFile
ObjectFile
NewAPIHadoopFile
- Created from the Hadoop interface API
HadoopRDD
NewAPIHadoopRDD
For example: Create an RDD from HBase
- Scala> import org.apache.hadoop.hbase. {HBaseConfiguration, HTableDescriptor, TableName}
- Import org.apache.hadoop.hbase. {HBaseConfiguration, HTableDescriptor, TableName}
- Scala> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
- Import org.apache.hadoop.hbase.mapreduce.TableInputFormat
- Scala> import org.apache.hadoop.hbase.client.HBaseAdmin
- Import org.apache.hadoop.hbase.client.HBaseAdmin
- Scala> val conf = HBaseConfiguration.create ()
- Scala> conf.set (TableInputFormat.INPUT_TABLE, "lxw1234")
- Scala> var hbaseRDD = sc.newAPIHadoopRDD (
- Conf, classOf [org.apache.hadoop.hbase.mapreduce.TableInputFormat], classOf [org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf [org.apache.hadoop.hbase.client.Result])
- Scala> hbaseRDD.count
- Res52: Long = 1
really Good blog post.provided a helpful information.I hope that you will post more updates like this Big Data Hadoop Online Training
RépondreSupprimervery nice article,Thank you ..
RépondreSupprimerKeep updating..
Big Data Hadoop training