- Flume can be used to transport massive quantities of event data.
- A Flume event is defined as a unit of data flow.
- A Flume agent is a (JVM) process that hosts the components through which events flow from an external source to the next destination
- Flume allows a user to build multi-hop flows where events travel through multiple agents before reaching the final destination.
- The events are staged in a channel on each agent.
- The events are removed from a channel only after they are stored in the channel of next agent or in the terminal repository.
- The events are staged in the channel, which manages recovery from failure.
- There’s also a memory channel which simply stores the events in an in-memory queue, which is faster but any events still left in the memory channel when an agent process dies can’t be recovered.
- Flume agent configuration is stored in a local configuration file.
Configurations for one or more agents can be specified in the same configuration file.
Configure Flume Agent to listen to a port and push data into HDFS as it comes :-
- A sample configuration defines a single agent named netCatAgentOne.
- netCatAgentOne has a source that listens for data on port 44444
- a channel that buffers event data in memory,
and a sink that sends event data to a directory in HDFS.
# Name the components on this agent
netCatAgentOne.sources = r1
netCatAgentOne.sinks = k1
netCatAgentOne.channels = c1
# Describe/configure the source
netCatAgentOne.sources.r1.type = netcat
netCatAgentOne.sources.r1.bind = <machine ip where agent runs>
netCatAgentOne.sources.r1.port = 44444
# Use a channel which buffers events in memory
netCatAgentOne.channels.c1.type = memory
netCatAgentOne.channels.c1.capacity = 1000
netCatAgentOne.channels.c1.transactionCapacity = 100
# HDFS sinks
netCatAgentOne.sinks.k1.type = hdfs
netCatAgentOne.sinks.k1.hdfs.fileType = DataStream
netCatAgentOne.sinks.k1.hdfs.path = /flume/dir
netCatAgentOne.sinks.k1.hdfs.filePrefix = filename
netCatAgentOne.sinks.k1.hdfs.fileSuffix = .txt
netCatAgentOne.sinks.k1.hdfs.batchSize = 1000
# Bind the source and sink to the channel
netCatAgentOne.sources.r1.channels = c1
netCatAgentOne.sinks.k1.channel = c1
Start Flume Agent
flume-ng agent –conf /tmp/flumeNetcatAgentOne –conf-file /tmp/flumeNetcatAgentOne/flumeNetcatHDFSAgent.conf –name netCatAgentOne
Pump data into the port
nc <machine ip where agent runs> 44444
Limit the size of my file by the number of events (100)
netCatAgentOne.sinks.si1.type = hdfs
netCatAgentOne.sinks.si1.hdfs.batchSize = 100
netCatAgentOne.sinks.si1.hdfs.filePrefix = agent-
netCatAgentOne.sinks.si1.hdfs.rollInterval = 0
netCatAgentOne.sinks.si1.hdfs.writeFormat = Text
netCatAgentOne.sinks.si1.hdfs.fileType = DataStream
Stay tuned 🙂