Developer’s template: Spark

Developer’s template series is intended to ease the life of  Bigdata developers with their application development and leave behind the headache of starting from the scratch. Following program helps you develop and execute an application using  Apache Spark with Java.


  • Hadoop cluster
  • Eclipse
  • Maven
  • Java


**This program has been tested with Hortonworks hadoop HDP 2.3.2, Spark 1.4.1, Java 1.7.0_79

Spark-Java code


import org.apache.spark.SparkConf;
import scala.Tuple2;
import java.util.Arrays;

public class WordCount {
private static final FlatMapFunction<String, String> WORDS_EXTRACTOR =
new FlatMapFunction<String, String>() {
public Iterable call(String s) throws Exception {
return Arrays.asList(s.split(” “));

private static final PairFunction<String, String, Integer> WORDS_MAPPER =
new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) throws Exception {
return new Tuple2<String, Integer>(s, 1);

private static final Function2<Integer, Integer, Integer> WORDS_REDUCER =
new Function2<Integer, Integer, Integer>() {
public Integer call(Integer a, Integer b) throws Exception {
return a + b;

public static void main(String[] args) {

String inpath = “hdfs://<hdfs>:8020/tmp/extTable/test.csv”;
String outpath = “hdfs://<hdfs>:8020/tmp/extTable/out/out1”;

SparkConf conf = new SparkConf().setAppName(“sparkAction”);
JavaSparkContext context = new JavaSparkContext(conf);
JavaRDD file = context.textFile(inpath);
JavaRDD words = file.flatMap(WORDS_EXTRACTOR);
JavaPairRDD<String, Integer> pairs = words.mapToPair(WORDS_MAPPER);
JavaPairRDD<String, Integer> counter = pairs.reduceByKey(WORDS_REDUCER);


<project xmlns=”; xmlns:xsi=”;







<!– Additional configuration. –>



Generate Jar

mvn clean package


spark-submit –class –master yarn-cluster –num-executors 3 –driver-memory 512m –executor-memory 512m –executor-cores 1 \
–conf spark.yarn.jar=hdfs://<hdfs>/sparkSamples/spark-assembly- \
–queue <queuename>\
hdfs://<hdfs>/sparkSamples/sparkTest-0.0.1-SNAPSHOT.jar <input dir hdfs> \
<out dir hdfs>


stay tuned..


About shalishvj : My Experience with BigData

6+ years of experience using Bigdata technologies in Architect, Developer and Administrator roles for various clients. • Experience using Hortonworks, Cloudera, AWS distributions. • Cloudera Certified Developer for Hadoop. • Cloudera Certified Administrator for Hadoop. • Spark Certification from Big Data Spark Foundations. • SCJP, OCWCD. • Experience in setting up Hadoop clusters in PROD, DR, UAT , DEV environments.
This entry was posted in Java-Maven-Hadoop, spark and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s