Monthly Archives: May 2016

Spark -> Parquet,ORC

Create Java RDDs String filePath = “hdfs://<HDFSName>:8020/user…” String outFile = “hdfs://<HDFSName>:8020/user…” SparkConf conf = new SparkConf().setAppName(“appname”); JavaSparkContext jsc = new JavaSparkContext(conf); JavaRDD<String> inFIleRDD = jsc.textFile(filePath); Remove initial empty lines¬†from a file import org.apache.spark.api.java.function.Function2; Function2 removeSpace = new Function2<Integer, Iterator<String>, Iterator<String>>(){ … Continue reading

Posted in spark, Uncategorized | Tagged , , , , , , , | Leave a comment

Hadoop File Formats

Text – RCFiles – Parquet – ORC Compression Based on a study conducted, Text – RCFiles – Parquet – ORC : Original – 14% ¬†Smaller – 62% Smaller – 78% Smaller   Considerations for ORC over Parquet are: 1. ORC … Continue reading

Posted in Hadoop, hive | Tagged , , , , | Leave a comment

Spark: Configuration, Execution, Performance

Configuration Pass configuration values from a property file spark-submit supports loading configuration values from a file read whitespace-delimited key/value pairs from this file customize the exact location of the file using the –properties-file flag to spark-submit $ bin/spark-submit \ –class … Continue reading

Posted in spark, Tips | Tagged , , , , , , , , , , | Leave a comment