RDD Transformations and Actions APIs in Apache Spark

1. Objective

This Spark API Guide explains all the important APIs of Apache Spark. The tutorial describes the transformations and actions used to process the the data in spark. Spark is the next gen Big Data Tool to learn more about Apache Spark follow this introductory guide.

2. Transformation

Transformations build new RDD(Resilient Distributed Dataset) from previous RDD with the help of operations like filter, map, flatmap  etc. Transformations are lazy operation on RDD, i.e. they don’t execute immediately, instead after calling actions transformations are executed. Transformations are functions that take input and produce one or many “new” output RDDs.
The result Rdd will be always different from their parent Rdd and they can be smaller or bigger or of the same size. To improve performance of computations transformation uses pipelined which is an optimization technique.

2.1. Map:

It passes each element through user-defined function. It returns a new dataset on passing each element to the function. It is applying function on each row / item of RDD. Size of input and output will remain same.
One -> one & size of A = size of B & One element in -> one element out.

2.2. FlatMap:

It does the similar job like map but the difference is that flatmap returns a list of elements (0 or more) as an iterator & output of flatmap is flattened. Function in flat map returns list of elements, array or sequence
One -> many & size of B>= size of A & one element-in -> 0 or more element-out.

2.3. Filter:

It returns a new dataset which is formed by selecting those elements of source on which function returns true. It returns those elements only that satisfy a predicate, predicate is a function that accepts parameter and returns Boolean value either true or false. It keeps only those elements which passes / satisfies the condition and filter out those which don’t, so the new RDD will be set of those elements for which function returns true.

2.4. MapPartitions:

It runs one at a time on each partition or block of the Rdd, so function must be of type iterator<T>. It improves performance by reducing creation of object in map function.

2.5. MappartionwithIndex:

It is similar to MapPartition but with one difference that it takes two parameters, the first parameter is the index and second is an iterator through all items within this partition (Int, Iterator<t>).

2.6. Union:

It performs standard set operation. It is the same as operator ‘++”.It returns a new RDD by making union with other RDD.

2.7. Distinct:

Returns a new dataset containing unique elements. It returns distinct values from one array.

2.8. Intersection:

It returns value or elements from two RDD which are identical but with de-duplication.

2.9. GroupBy:

It works on key value pair, returns a new dataset of grouped items. It will return the new RDD which is made up with key (which is a group) and list of items of that group. Order of elements within group may not be the same when you apply same operation on same RDD over and over. It’s a wide operation as it shuffles data from multiple partitions / divisions and create another RDD.

2.10. ReduceByKey:

It uses associative reduce function, where it merges value of each key. It can be used with Rdd only in key value pair. It’s wide operation which shuffles data from multiple partitions/divisions and creates another RDD. It merges data locally using associative function for optimized data shuffling. Result of the combination (e.g. a sum) is of the same type that the values, and that the operation when combined from different partitions is also the same as the operation when combining values inside a partition.

2.11. AggregateByKey:

It will combine values for particular key and result of such combination can be any object that you specify. You need to specify how values are combined or added inside one partition which is executed in same node and how you combine the result from different partitions (that may be in different nodes).
Aggregate the values of each key in an RDD, using given combine functions and a neutral “zero value”. This function can return a different result type, U, than the type of the values in this RDD, V. Thus, we need one operation for merging a V into a U and one operation for merging two U’s, The former operation is used for merging values within a partition, and the latter is used for merging values between partitions. To avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U.

2.12. SortByKey:

They will work with any key type K that has an implicit Ordering[K] in scope. Ordering objects already exist for all of the standard primitive types. Users can also define their own orderings for custom types, or to override the default ordering. The implicit ordering that is in the closest scope will be used.
When called on Dataset of (K,V) where k is Ordered returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the ascending  argument.

2.13. Join:

It is joining two datasets. When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.

2.14. Coalesce:

It changes number of partition where data is stored. It combines original partitions to new number of partitions, so it reduces number of partitions. It is an optimized version of repartition that allows data movement, but only if you are decreasing number of RDD partitions. It runs operations more efficiently after filtering large datasets.

2.15. Repartition:

Repartition will reshuffle the data in your RDD to produce the final number of partitions you request. it may reduce or increase number of partitions and shuffles data all over network.
Before using Transformation / Actions you have to install Spark, to Install Spark follow this Installation Guide.

3. Actions

It triggers some computation and returns a final result of RDD computations. It uses linage graph to load data from original RDD, carry out all intermediate transformations and returns value back / final result to either driver program or write it out to file system. It is synchronous and only action can materialize a value in spark program with real data. It runs jobs using SparkContext.runJob or directly DAGScheduler.runJob.

3.1. Count ():

It returns number of elements or items in RDD. So it basically counts the number of items present in dataset and returns a number after count.

3.2. Collect():

It returns all the data / elements present in an RDD in the form of array. It prints values of array back to console and used in debugging programs.

3.3. Reduce():

It takes function with two arguments an accumulator and a value which should be commutative and Associative in mathematical nature. It reduces a list of element s into one as a result. This function produces same result when continuously applied on same set of RDD data with multiple partitions irrespective of elements order. It is wide operation.
It executes the provided function to combine the elements into result set .It takes two arguments and returns one. Function should be either commutative or associative so that it can generate  reproducible result in parallel .

3.4. Take(n):

It fetches or extracts first n requested number of elements of RDD and returns them as an array.

3.5. First():

Retrieves the very first data or element of RDD.It is similar to take (1).

3.6. TakeSample():

It is an action that is used to return a fixed-size random sample subset of an RDD includes Boolean option of with or without replacement and random generator seed. It returns an array. It internally randomizes order of elements returned.

3.7. TakeOrdered (count&ordering):

Fetches the specified number of first n items ordered by specified ordering function based on default, natural order or custom comparator.

3.8. CountByKey():

It counts the value of RDD consisting of two components tuple for each distinct key. It actually counts the number of elements for each key and return the result to the master as lists of (key, count) pairs.

3.9. Foreach():

It executes the function on each item in RDD. It is good for writing database or publishing to a web services. It executes parameter less function for each data items.

3.10. SaveAsTextfile():

It writes the content of RDD to text file or saves the RDD as a text file in file path directory using string representation.
To practically implement ans use these APIs follow this beginner’s guide.
other: http://data-flair.training/blogs/introduction-spark-tutorial-quick-start/

Commentaires

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark performance optimization: shuffle tuning

Spark optimization