Tips: Spark


Pass configuration values from a property file

  • spark-submit supports loading configuration values from a file
  • read whitespace-delimited key/value pairs from this file
  • customize the exact location of the file using the –properties-file flag to spark-submit

$ bin/spark-submit \
–class com.example.MyApp \
–properties-file my-config.conf \

## Contents of my-config.conf ##
spark.master local[4] “My Spark App”
spark.ui.port 36000

Configuration values PRECEDENCE

set() > spark-submit > properties file > default 

  •  Spark has a specific precedence order.
  • The highest priority is given to configurations declared explicitly in the user’s code using the set() function on a SparkConf object.
  • Next are flags passed to spark-submit,
  • then values in the properties file, and
  • finally default values.



  • User code defines a DAG (directed acyclic graph) of RDDs.
  • Actions force translation of the DAG to an execution plan.
  • Spark’s scheduler submits a job to compute all needed RDDs.
  • That job will have one or more stages, which are parallel waves of computation composed of tasks.
  • A stage is a set of independent tasks all computing the same function that need to run as part of a Spark job.
  • Each DAG of tasks run by the scheduler is split up into stages at the boundaries where shuffle occurs.


Level of Parallelism

  • RDD is divided into a set of partitions
  • When Spark schedules and runs tasks, it creates a single task for data stored in one partition, and that task will require, by default, a single core in the cluster to execute.
  • Input RDDs typically choose parallelism based on the underlying storage systems.
  • HDFS input RDDs have one partition for each block of the underlying HDFS file.

Degree of parallelism can affect performance in two ways:-

  • Too little parallelism => Spark might leave resources idle
  • Too much parallelism, small overheads associated with each partition can add up and become significant.

Tune the degree of parallelism for operations:-

  • During operations that shuffle data, you can always give a degree of parallelism for the produced RDD as a parameter
  • Any existing RDD can be redistributed to have more or fewer partitions. The repartition() operator will randomly shuffle an RDD into the desired number of partitions.
  • Use the coalesce() operator; this is more efficient than repartition() since it avoids a shuffle operation
    eg:- RDD returned by filter() will have the same size as its parent and might have many empty or small partitions. In this case you can improve the application’s performance by coalescing down to a smaller RDD


  • When Spark is transferring data over the network or spilling data to disk, it needs to serialize objects into a binary format.
  • By default Spark will use Java’s built-in serializer.
  • Spark also supports the use of Kryo, a third-party serialization library that improves on Java’s serialization by offering both faster serialization times and a more compact binary representation
  • Set spark.serializer = org.apache.spark.serializer.KryoSerializer
    • Registering a class allows Kryo to avoid writing full class names with individual objects, a space savings that can add up over thousands or millions of serialized records.
    • If you want to force this type of registration, you can set spark.kryo.registrationRequired to true, and Kryo will throw errors if it encounters an unregistered class.
    • you may encounter a NotSerializableException if your code refers to a class that does not extend Java’s Serializable interface

Fixing Errors

  1.  java.lang.OutOfMemoryError: Java heap space

           Fix: Increase the driver memory using –driver-memory


Spark Actions


  • Reduces the elements of this RDD using the specified commutative and associative binary operator.
  • reduce(<function type>) takes a Function Type ; which takes 2 elements of RDD Element Type as argument & returns the Element of same type


JavaRDD<Integer> jrdd1 = ctx.parallelize(Arrays.asList(1, 2, 3, 4));

Integer sumRDD = jrdd1
.reduce(new Function2<Integer, Integer, Integer>() {

public Integer call(Integer v1, Integer v2)
throws Exception {

return v1 + v2;

System.out.println(“SUM:” + sumRDD);



  • Aggregate the elements of each partition, and then the results for all the partitions, using a given associative and commutative function and a neutral “zero value”.
  • fold() is similar to reduce except that it takes a ‘Zero value‘(Think of it as a kind of initial value) which will be used in the initial call on each Partition.
  • Both fold() and reduce() require that the return type of our result be the same type as that of the elements in the RDD we are operating over.

JavaRDD<Integer> rdd = ctx.parallelize(Arrays.asList(1, 2, 3, 4));
System.out.println(“Num Partitions: “+rdd.partitions().size());
Integer result = rdd.fold(0,
new Function2<Integer, Integer, Integer>() {
public Integer call(Integer x, Integer y) {
return x + y;
System.out.println(“FOLD Result:”+result);



About shalishvj : My Experience with BigData

6+ years of experience using Bigdata technologies in Architect, Developer and Administrator roles for various clients. • Experience using Hortonworks, Cloudera, AWS distributions. • Cloudera Certified Developer for Hadoop. • Cloudera Certified Administrator for Hadoop. • Spark Certification from Big Data Spark Foundations. • SCJP, OCWCD. • Experience in setting up Hadoop clusters in PROD, DR, UAT , DEV environments.
This entry was posted in spark, Tips and tagged , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s