Spark -> Parquet,ORC

Create Java RDDs

String filePath = “hdfs://<HDFSName>:8020/user…”
String outFile = “hdfs://<HDFSName>:8020/user…”
SparkConf conf = new SparkConf().setAppName(“appname”);
JavaSparkContext jsc = new JavaSparkContext(conf);
JavaRDD<String> inFIleRDD = jsc.textFile(filePath);

Remove initial empty lines from a file

import org.apache.spark.api.java.function.Function2;
Function2 removeSpace = new Function2<Integer, Iterator<String>, Iterator<String>>(){
public Iterator<String> call (Integer ind, Iterator<String> iterator) throws Exception {
if (ind == 0 && iterator.hasNext() && (iterator.next().trim != “”) {
return iterator;
}
JavaRDD<String> fileWOSpaceRDD = inFileRDD.mapPartitionsWithIndex( removeSpace, false);
ind == 0 takes first partition of the RDD
iterator.hasNext() && (iterator.next().trim != “” makes sure that all empty lines are skipped

Remove initial non -empty lines including headers from a file

import org.apache.spark.api.java.function.Function2;
Function2 removeLines = new Function2<Integer, Iterator<String>, Iterator<String>>(){
public Iterator<String> call (Integer ind, Iterator<String> iterator) throws Exception {
if (ind == 0 && iterator.hasNext() && (iterator.next().trim != “”) {
iterator.next(); //include this as many times as the number of lines you need to remove
return iterator;
}
else return iterator;
}
JavaRDD<String> fileWOSpaceRDD = inFileRDD.mapPartitionsWithIndex( removeLines, false);
ind == 0 takes first partition of the RDD
include iterator.next(); as many times as the number of lines you need to remove

Store Files in Parquet Format

Create a java bean class (SparkBean) with the fields age, id
JavaRDD<String> inFIleRDD = jsc.textFile(filePath);
JavaRDD<SparkBean> sBRDD = inFileRDD.map(new Function<String, SparkBean> ()){
 public SparkBean call(String line) throws Exception{
String[] parts = line.trim().split(“,”);
SparkBean sb = new SparkBean();
sb.setAge();
sb.setId();
return sb;
}
}

Save as Parquet File

SQLContext sqlCtx = new SQLContext(sparkconf);
DataFrame sd = sqlCtx.createDataFrame(sBRDD, SparkBean.class);
sd.registerTempTable(“testTable”);
sd.saveAsParquetFile(outFile); //in HDFS

Versions and POM File

groupId : org.apache.spark
artifactId : spark-core_2.10
version : 1.6.1
groupId : org.apache.spark
artifactId : spark-sql_2.10
version : 1.3.1

 

groupId : org.apache.spark
artifactId : spark-hive_2.10
version : 1.4.1

Execute

spark-submit –class <fully qualified classname> –master yarn-client –queue <queuename>  –num-executors 3 –driver-memory 3g –executor-memory 1g \

–conf spark.sql.parquet.compression.codec=uncompressed –executor-cores 3

<jar namewithlocation>

Create HIVE Tables 

  • FIELD NAMES, TYPES SHOULD BE SAME AS PARQUET TABLE ( BEAN CLASS)
  • NOT CASE SENSITIVE
  • TABLE NAME CAN BE DIFFERENT
create external table parquetTable (
age string,
id string)
STORED AS PARQUET
LOCATION ‘<hdfs file location>’;

Hive Commands

describe parquetTable;
select * from parquetTable limit 5;

Save As ORC

JavaSparkContext jsc = new JavaSparkContext(conf);
HiveContext hc = new HiveContext(jsc.sc());
DataFrame sd = sqlCtx.createDataFrame(sBRDD, SparkBean.class);
sd.select().save(outFilePathHdfs,”org.apache.spark.sql.hive.orc”,SaveMode.Append);

ORC: Create HIVE Tables 

create external table ORCTable (
age string,
id string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
STORED AS ORC
LOCATION ”
tblproperties (“orc.compress”=”NONE”);
Advertisements

About shalishvj : My Experience with BigData

6+ years of experience using Bigdata technologies in Architect, Developer and Administrator roles for various clients. • Experience using Hortonworks, Cloudera, AWS distributions. • Cloudera Certified Developer for Hadoop. • Cloudera Certified Administrator for Hadoop. • Spark Certification from Big Data Spark Foundations. • SCJP, OCWCD. • Experience in setting up Hadoop clusters in PROD, DR, UAT , DEV environments.
This entry was posted in spark, Uncategorized and tagged , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s