In the Hadoop world, streaming means to publish data on your cluster in real-time. This is a necessary step before analyzing data, of course. You can get static/historical data from a MySQL database into Hadoop with sqoop, for example.

But when you want to get data from things like webserver logs, new stock trades, or new sensor data from IoT devices into your cluster in real-time, you need different tools, and Kafka and Flume are the right choices here. These tools allow you to process data as it’s being generated.

These are two different tasks: 1.) Load data into your cluster, and 2.) Process the data

Kafka

Kafka is not just a Hadoop tool, but a more general-purpose system. A Kafka server listens for incoming messages from publishers, e.g. an app that listens to stock exchange trades and sends them to the Kafka server. It then publishes these messages to a data stream called a topic. Then, the clients (called consumers, also apps, usually pre-written) subscribe to these topics to receive the data as it’s being published. If a consumer goes offline or just “takes a break” for a while, Kafka remembers where it left off, and continues to send the missed data from that point.

Other things that can connect to Kafka are stream processors, that might listen to incoming data, transform it in some way, and then store it again in some new topic.

Flume

Since the Cloudera exam requires Flume but not Kafka, I’ll concentrate on this tool here.

Flume is another tool to stream data into your cluster. As opposed to Kafka, Flume was built with Hadoop integration in mind. It has built-in HDFS and HBase sinks, and was made for log aggregation.

A Flume agent consists of a source, a channel, and a sink, and connects a data source (such as a webserver’s logs) to some storage (such as HBase).

A source specifies where your data comes from. It can optionally include a channel selector, which splits data according to some criteria and sends them to different channels, or include interceptors, which preprocess the data before forwarding it. Built-in source types include spooling directories (listening for files that get dropped into a directory), Avro, Exec (such as tail -F), and HTTP.

A channel then transfers the data from source to sink. It can be done via memory, which is faster, or files, which is more resilient in case a Flume server goes down.

A sink then connects to only one channel, grabs the data there, and stores it. The channel deletes the message as soon as one sink received it. Sink types can include HDFS, Hive, and Avro. Using Avro as a sink type makes it possible to connect several Flume agents together in a chain.

This is an example config file for Flume that sets up the agent a1 to listen to localhost:4444 for incoming data. You can connect there with telnet and just type things, and Flume will capture them to a logger sink.

# example.conf: A single-node Flume configuration
 
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
 
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
 
# Describe the sink
a1.sinks.k1.type = logger
 
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
 
# Bind the source and sink to the channel
a1.sources.r1.channels = c1

Usually your config files will look more complex, with sinks to HDFS, for example.

Start Flume by issuing the command below:

cd /usr/hdp/current/flume-server  # this is for Hortonworks
bin/flume-ng agent --conf conf --conf-file ~/sparkstreamingflume.conf --name a1

Analyzing this data can now be done in real-time by Spark Streaming.