Pass configuration values from a property file
- spark-submit supports loading configuration values from a file
- read whitespace-delimited key/value pairs from this file
- customize the exact location of the file using the –properties-file flag to spark-submit
$ bin/spark-submit \
–class com.example.MyApp \
–properties-file my-config.conf \
## Contents of my-config.conf ##
spark.app.name “My Spark App”
Configuration values PRECEDENCE
set() > spark-submit > properties file > default
- Spark has a specific precedence order.
- The highest priority is given to configurations declared explicitly in the user’s code using the set() function on a SparkConf object.
- Next are flags passed to spark-submit,
- then values in the properties file, and
- finally default values.
- User code defines a DAG (directed acyclic graph) of RDDs.
- Actions force translation of the DAG to an execution plan.
- Spark’s scheduler submits a job to compute all needed RDDs.
- That job will have one or more stages, which are parallel waves of computation composed of tasks.
- A stage is a set of independent tasks all computing the same function that need to run as part of a Spark job.
- Each DAG of tasks run by the scheduler is split up into stages at the boundaries where shuffle occurs.
Level of Parallelism
- RDD is divided into a set of partitions
- When Spark schedules and runs tasks, it creates a single task for data stored in one partition, and that task will require, by default, a single core in the cluster to execute.
- Input RDDs typically choose parallelism based on the underlying storage systems.
- HDFS input RDDs have one partition for each block of the underlying HDFS file.
Degree of parallelism can affect performance in two ways:-
- Too little parallelism => Spark might leave resources idle
- Too much parallelism, small overheads associated with each partition can add up and become significant.
Tune the degree of parallelism for operations:-
- During operations that shuffle data, you can always give a degree of parallelism for the produced RDD as a parameter
- Any existing RDD can be redistributed to have more or fewer partitions. The repartition() operator will randomly shuffle an RDD into the desired number of partitions.
- Use the coalesce() operator; this is more efficient than repartition() since it avoids a shuffle operation
eg:- RDD returned by filter() will have the same size as its parent and might have many empty or small partitions. In this case you can improve the application’s performance by coalescing down to a smaller RDD
- When Spark is transferring data over the network or spilling data to disk, it needs to serialize objects into a binary format.
- By default Spark will use Java’s built-in serializer.
- Spark also supports the use of Kryo, a third-party serialization library that improves on Java’s serialization by offering both faster serialization times and a more compact binary representation
- Set spark.serializer = org.apache.spark.serializer.KryoSerializer
- Registering a class allows Kryo to avoid writing full class names with individual objects, a space savings that can add up over thousands or millions of serialized records.
- If you want to force this type of registration, you can set spark.kryo.registrationRequired to true, and Kryo will throw errors if it encounters an unregistered class.
- you may encounter a NotSerializableException if your code refers to a class that does not extend Java’s Serializable interface