One of the things I like about Scala is it’s collections framework.
As a non CS graduate I only very lightly covered functional programming
at university and I’d never come across it until Scala. One the benefits
of Scala is that the functional programming concepts can be introduced
slowly to the programmer. One of the first places you’ll start to use
functional constructs is with the collections framework.
Chances are your first collection will be a list of items and we
might want to apply a function to each item in the list in some way.
Map works by applying a function to each element in the list.
flatMap works applying a function that returns a sequence for each
element in the list, and flattening the results into the original list.
This is easier to show than to explain:
So with that all covered, lets look at how you can apply those
concepts to a Map. Now a map can be implemented a number of different
ways, but regardless of how it is implemented it can be thought of as a
sequence of Tuples, where a tuple is a pair of items, the key and the
value.
So we want to think about using map and flatMap on our Map, but
because of the way a map works it often doesn’t make quite the same
sense, we probably don’t want to apply a function to the tuple, but to
the value side of the tuple, leaving the key as is, so for example we
might want to double all the values. Map provides us with a function to
do exactly that.
But in my case I wanted to do something more like flat map in this
case, I want a map to come out that misses out the key 1 because it’s
value is None. flatMap doesn’t work on maps like mapValues, it get’s
passed the tuple and if it returns a List single items you’ll get a list
back, but if you return a tuple you’ll get a Map back.
Ok so we are pretty close to using options with flatMap, we need to
filter out our None’s, we can do returning a list with just e =>
f(e._2) and we’ll get the list of values without the None’s, but that
isn’t really what I want. What I need to do is return an Option
containing a tuple. So here’s our updated function:
but this is pretty ugly, all those _1 and _2’s make me sad. If only
there was a nice way of unapplying the tuple into variables. Given that
this works in python and in a number of places in scala I thought this
code should work:
I spent way too long today looking at this (in 5 minute chunks broken
by meetings to be fair), before I gave in and asked a coworker what the
hell I was missing. The answer is seems is that an unapply is normally
only executed in a PartialFunction, which in scala is most easily
defined as a case statement. So this is the code that works as expected:
Note that we switch to using curly braces, indicating a function
block rather than parameters, and the function is a case statement. This
means that the function block we pass to flatMap is a partialFunction
that is only invoked for items that match the case statement, and in the
case statement the unapply method on tuple is called to extract the
contents of the tuple into the variables. This form of variable
extraction is very common, and you’ll see it used a lot.
There is of course another way of writing that code that doesn’t use
flatMap. Since what we are doing is removing all members of the map that
don’t match a predicate, this is a use for the filter method:
It is useful to be able to control the degree of parallelism when using Spark. Spark provides a very convenient method to increase the degree of parallelism which should be adequate in practice. This blog entry of the "Inside Spark" series describes the knobs Spark uses to control the degree of parallelism. Controlling the number of partitions in the " local " mode Spark is designed ground up to enable unit testing. To control the degree of parallelism in the local mode simply utilize the function JavaRDD<String> org.apache.spark.api.java.JavaSparkContext.parallelize(List<String> list, int numSlices) The following snippet of code demonstrates this usage- public class SparkControlPartitionSizeLocally { public static void main ( String [] args ) { JavaSparkContext sc = new JavaSparkContext ( "local" , "localpartitionsizecontrol" ); String input = "four score and sev
Article directory 1 shuffle tuning 1.1 Summary of tuning 1.2 Overview of ShuffleManager Development 1.3 HashShuffleManager operating principle 1.3.1 Unoptured HashShuffleManager 1.3.2 Optimized HashShuffleManager 1.4 SortShuffleManager operating principle 1.4.1 General operating mechanism 1.4.2 bypass running mechanism 1.5 shuffle related parameters tuning 1.5.1 spark.shuffle.file.buffer 1.5.2 spark.reducer.maxSizeInFlight 1.5.3 spark.shuffle.io.maxRetries 1.5.4 spark.shuffle.io.retryWait 1.5.5 spark.shuffle.memoryFraction 1.5.6 spark.shuffle.manager 1.5.7 spark.shuffle.sort.bypassMergeThreshold 1.5.8 spark.shuffle.consolidateFiles 2 write in the last words Shuffle Summary of tuning Most of the performance of Spark operations is mainly consumed in the shuffle link, because the link contains a large number of disk IO, serialization, network data transmission and othe
One operation and maintenance 1. Master hang up, standby restart is also invalid Master defaults to 512M of memory, when the task in the cluster is particularly high, it will hang, because the master will read each task event log log to generate spark ui, the memory will naturally OOM, you can run the log See that the master of the start through the HA will naturally fail for this reason. solve Increase the Master's memory spark-env.sh , set in the master node spark-env.sh : export SPARK_DAEMON_MEMORY 10g # 根据你的实际情况 Reduce the job information stored in the Master memory spark.ui.retainedJobs 500 # 默认都是1000 spark.ui.retainedStages 500 Hang up or suspend Sometimes we will see the web node in the web ui disappear or in the dead state, the task of running the node will report a variety of lost worker errors, causing the same reasons and the above, worker memory to save a lot of ui The information leads to gc when the heartbeat is lost
Commentaires
Enregistrer un commentaire