Tag Archives: Hadoop

Hadoop / SPARK on Windows

Hadoop on Windows Download the required binaries (e.g., winutils.exe) necessary to run hadoop Download link: https://github.com/srccodes/hadoop-common-2.2.0-bin/archive/master.zip Add it to $HADOOP_HOME/bin Set  $HADOOP_HOME, $JAVA_HOME under environment variables Reference: http://stackoverflow.com/questions/19620642/failed-to-locate-the-winutils-binary-in-the-hadoop-binary-path   Spark on Windows While running spark, you can refer to a local path in … Continue reading

Posted in Hadoop, spark, Uncategorized | Tagged , , | Leave a comment

Spark -> Parquet,ORC

Create Java RDDs String filePath = “hdfs://<HDFSName>:8020/user…” String outFile = “hdfs://<HDFSName>:8020/user…” SparkConf conf = new SparkConf().setAppName(“appname”); JavaSparkContext jsc = new JavaSparkContext(conf); JavaRDD<String> inFIleRDD = jsc.textFile(filePath); Remove initial empty lines from a file import org.apache.spark.api.java.function.Function2; Function2 removeSpace = new Function2<Integer, Iterator<String>, Iterator<String>>(){ … Continue reading

Posted in spark, Uncategorized | Tagged , , , , , , , | Leave a comment

Hadoop File Formats

Text – RCFiles – Parquet – ORC Compression Based on a study conducted, Text – RCFiles – Parquet – ORC : Original – 14%  Smaller – 62% Smaller – 78% Smaller   Considerations for ORC over Parquet are: 1. ORC … Continue reading

Posted in Hadoop, hive | Tagged , , , , | Leave a comment

Spark: Configuration, Execution, Performance

Configuration Pass configuration values from a property file spark-submit supports loading configuration values from a file read whitespace-delimited key/value pairs from this file customize the exact location of the file using the –properties-file flag to spark-submit $ bin/spark-submit \ –class … Continue reading

Posted in spark, Tips | Tagged , , , , , , , , , , | Leave a comment

Developer’s template: MapReduce (Java)

Developer’s template series is intended to ease the life of  Bigdata developers with their application development and leave behind the headache of starting from the scratch. Here is a mapreduce java program with its pom file. Prerequisites Hadoop cluster Eclipse Maven Java … Continue reading

Posted in Java-Maven-Hadoop | Tagged , , | Leave a comment

Developer’s template: Spark

Developer’s template series is intended to ease the life of  Bigdata developers with their application development and leave behind the headache of starting from the scratch. Following program helps you develop and execute an application using  Apache Spark with Java. Prerequisites Hadoop … Continue reading

Posted in Java-Maven-Hadoop, spark | Tagged , , , | Leave a comment

Tips: Spark

Execute a Spark Pi From Spark directory (usually /usr/hdp/current/spark-client , in case of Hortonworks HDP 2.3.2) run ./bin/spark-submit –class org.apache.spark.examples.SparkPi –master yarn-cluster  –num-executors 3 –driver-memory 512m  –executor-memory 512m   –executor-cores 1  lib/spark-examples*.jar 10   stay tuned..

Posted in spark, Tips | Tagged , , | Leave a comment