Hands on Flume

Intro

  • Flume can be used to transport massive quantities of event data.
  • A Flume event is defined as a unit of data flow.
  • A Flume agent is a (JVM) process that hosts the components through which events flow from an external source to the next destination
  • Flume allows a user to build multi-hop flows where events travel through multiple agents before reaching the final destination.
  • The events are staged in a channel on each agent.
  • The events are removed from a channel only after they are stored in the channel of next agent or in the terminal repository.
  • The events are staged in the channel, which manages recovery from failure.
  • There’s also a memory channel which simply stores the events in an in-memory queue, which is faster but any events still left in the memory channel when an agent process dies can’t be recovered.
  • Flume agent configuration is stored in a local configuration file.
    Configurations for one or more agents can be specified in the same configuration file.

Hands-on :-

Configure Flume Agent to listen to a port and push data into HDFS as it comes :-

  • A sample configuration defines a single agent named netCatAgentOne.
  • netCatAgentOne has a source that listens for data on port 44444
  • a channel that buffers event data in memory,
    and a sink that sends event data to a directory in HDFS.

 

# Name the components on this agent
netCatAgentOne.sources = r1
netCatAgentOne.sinks = k1
netCatAgentOne.channels = c1

# Describe/configure the source
netCatAgentOne.sources.r1.type = netcat
netCatAgentOne.sources.r1.bind = <machine ip where agent runs>
netCatAgentOne.sources.r1.port = 44444

# Use a channel which buffers events in memory
netCatAgentOne.channels.c1.type = memory
netCatAgentOne.channels.c1.capacity = 1000
netCatAgentOne.channels.c1.transactionCapacity = 100

# HDFS sinks
netCatAgentOne.sinks.k1.type = hdfs
netCatAgentOne.sinks.k1.hdfs.fileType = DataStream
netCatAgentOne.sinks.k1.hdfs.path = /flume/dir
netCatAgentOne.sinks.k1.hdfs.filePrefix = filename
netCatAgentOne.sinks.k1.hdfs.fileSuffix = .txt
netCatAgentOne.sinks.k1.hdfs.batchSize = 1000

# Bind the source and sink to the channel
netCatAgentOne.sources.r1.channels = c1
netCatAgentOne.sinks.k1.channel = c1

Start Flume Agent

flume-ng agent –conf /tmp/flumeNetcatAgentOne –conf-file /tmp/flumeNetcatAgentOne/flumeNetcatHDFSAgent.conf –name netCatAgentOne

Pump data into the port

nc <machine ip where agent runs> 44444

 

More Tips..

Limit the size of my file by the number of events (100)

netCatAgentOne.sinks.si1.type = hdfs

netCatAgentOne.sinks.si1.hdfs.path =

netCatAgentOne.sinks.si1.hdfs.batchSize = 100

netCatAgentOne.sinks.si1.hdfs.filePrefix = agent-

netCatAgentOne.sinks.si1.hdfs.rollInterval = 0

netCatAgentOne.sinks.si1.hdfs.writeFormat = Text

netCatAgentOne.sinks.si1.hdfs.fileType = DataStream

 

Stay tuned 🙂

Advertisements

About shalishvj : My Experience with BigData

6+ years of experience using Bigdata technologies in Architect, Developer and Administrator roles for various clients. • Experience using Hortonworks, Cloudera, AWS distributions. • Cloudera Certified Developer for Hadoop. • Cloudera Certified Administrator for Hadoop. • Spark Certification from Big Data Spark Foundations. • SCJP, OCWCD. • Experience in setting up Hadoop clusters in PROD, DR, UAT , DEV environments.
This entry was posted in flume and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s