Elastic Distributed Data Set RDD Overview


1. What is Elastic Distributed Data Set RDD?
2. What are the characteristics of RDD?
3. What are the benefits of RDD?
4.RDL programming interface?
5. What are the dependencies between RDDs?
6. How does RDD manage data storage?






Elastic distribution data set RDD
RDD (Resilient Distributed Dataset) is the most basic abstraction of Spark, is the abstract use of distributed memory, to achieve the way to operate the local collection of distributed data sets to achieve the abstract implementation. RDD is the core of Spark, which represents a set of data that has been partitioned, immutable and can be manipulated in parallel, and different dataset formats correspond to different RDD implementations. RDD must be serializable. RDD can cache to memory, each time the operation of the RDD data set after the results can be stored in memory, the next operation can be directly from the memory input, eliminating the MapReduce a lot of disk IO operation. This is more common for iterative computing machine learning algorithms, interactive data mining, efficiency is relatively large.
You will RDD be understood as a large collection, all the data are loaded into memory, easy to reuse multiple times. First, it is distributed, can be distributed in multiple machines, the calculation. Second, it is flexible, in the calculation process, the machine's memory is not enough, it will and hard disk data exchange, to some extent will reduce performance, but can ensure that the calculation can continue.
RDD features
RDD is a distributed read-only and partitioned collection object. These collections are flexible, and if the data set is lost, they can be rebuilt. With fault tolerance, location-aware scheduling and scalability, and fault tolerance is the most difficult to achieve, most of the distributed data set fault tolerance in two ways: data checkpoints and record data updates. For large-scale data analysis system, the data check point operation cost is very high, mainly due to large-scale data transmission between the server in all aspects of the problem, compared to the record data update, RDD only support coarse-grained conversion , That is, how to record from other RDD conversion (ie Lineage), in order to restore the lost partition.
Its characteristics are:
  • The data storage structure is immutable
  • Supports distributed data operations across clusters
  • You can partition the data by pressing key
  • Provides a coarse-grained conversion operation
  • Data is stored in memory to ensure low latency

RDD benefits
  • RDD can only be generated from persistent storage or through the Transformations operation, which can be more efficient than distributed shared memory (DSM), for the loss of part of the data partition only according to its lineage can be recalculated, and do not need to do specific Checkpoint.
  • RDD invariance, you can achieve the type of Hadoop MapReduce speculative implementation.
  • RDD data partitioning feature, through the local nature of the data to improve performance, which Hadoop MapReduce is the same.
  • RDD is serializable, in the case of insufficient memory can be automatically degraded to disk storage, the RDD stored on disk, then the performance will be a big drop but not worse than the current MapReduce.

RDD programming interface
For RDD, there are two types of actions, one is Transformation, one is Action. Their essential difference is:
Transformation return value or an RDD. It uses the chain call design pattern, the calculation of an RDD, the conversion into another RDD, and then the RDD can be another conversion. The process is distributed
Action return value is not an RDD. It is either a regular collection of Scala, either a value, either empty, eventually or returned to the Driver program, or the RDD is written to the file system
Transformations conversion operations, return values ​​or an RDD, such as map, filter, union;
Actions action, return the result, or persist the RDD, such as count, collect, save.



RDD dependency
Different operations depend on their characteristics, may have different dependencies, RDD dependencies between the following two:
  • Narrow Dependencies
    A parent RDD partition is referenced by a child RDD partition, manifested as a parent RDD partition;
    A partition corresponding to a sub-RDD or a partition of multiple parent RDDs corresponds to a partition of a child RDD, that is, a partition of a parent RDD can not correspond to multiple partitions of a child RDD, such as map, filter, union There is a narrow dependence;
  • Wide Dependencies
    A sub-RDD partition depends on the parent RDD of multiple partitions or all partitions, that is to say there is a partition of a parent RDD corresponding to a sub-RDD multiple partitions, such as groupByKey and other operations are generated by a wide range of operations;
In the following figure, the blue solid box represents a partition, and the blue border rectangle represents an RDD:


Stage DAG
Spark will submit Job to multiple stages after submitting Job, and there is dependency between multiple stages. The dependencies between stages form DAG (directed acyclic graph).
For narrow dependencies, Spark places RDD conversions in the same stage as much as possible; for wide dependencies, but most of the time is shuffle, Spark defines this stage as ShuffleMapStage to register shuffle operations with MapOutputTracker. Spark usually defines the shuffle operation as the boundary of the stage.

RDD data storage management
RDD can be abstractly interpreted as a large array (Array), but this array is distributed on the cluster. Logically, each partition of RDD is called a Partition.
In the implementation of Spark, RDD experienced one by one Transfomation operator, and finally through the Action operator to trigger the operation. Logically, every time a transition is made, RDD is converted to a new RDD, and RDD generates dependencies through Lineage, which has a very important role in fault tolerance. The input and output of the transform are both RDD. RDD will be divided into a number of partitions distributed to multiple nodes in the cluster. Partition is a logical concept, before and after the transformation of the old and new partitions in physics may be the same piece of memory storage. This is a very important optimization to prevent the infinite expansion of memory requirements due to immutable data. Some RDD is the intermediate result of the calculation, the partition does not necessarily have the corresponding memory or disk data corresponding, if you want to use the data iteration, you can cache () function cache data.

In the figure above, RDD1 contains five partitions (p1, p2, p3, p4, p5), which are stored in four nodes (Node1, node2, Node3, Node4). RDD2 contains three partitions (p1, p2, p3), distributed in three nodes (Node1, Node2, Node3).
Physically, the RDD object is essentially a metadata structure that stores the mapping relationships of Block, Node, and other metadata information . An RDD is a set of partitions. In the physical data store, each partition of the RDD corresponds to a Block, which can be stored in memory and can be stored on disk when the memory is not enough.
Each block stores a subset of all RDD data items that can be a Block iterator (for example, a user can get a partition iterator over a mapPartitions) or a data item (for example, Through the map function for each data item in parallel calculation). This book will detail the underlying implementation details of the data management in the following sections.
If an external storage is used as an input data source from HDFS, the data is partitioned according to the data distribution policy in HDFS, and a block in HDFS corresponds to a partition of Spark. At the same time Spark support re-partition, the data through the Spark default or user-defined partition to determine the data block distribution in which nodes. For example, a partitioning strategy such as a Hash partition (which takes a hash value of a data item, a hash value, an element with the same hash value in the same partition), and a range partition (which places data belonging to the same data range into the same partition).
Please specify the author Jason Ding and its source
CSDN blog ( http://blog.csdn.net/jasonding1354 )

Commentaires

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark performance optimization: shuffle tuning

Spark optimization