The HDFS allows very large files to be broken up into blocks and distributed across a cluster of computers. It also makes sure that computers process these blocks that are physically close to the computers.
It intelligently stores more than one copy of each block, so that you don’t lose any information if a single node goes down. This allows you to get commodity computers instead of high-reliability systems, because you don’t care too much about one node failing.
HDFS consists of one Name Node that knows what’s on all individual Data Nodes. Name node resilience is a problem, but there are solutions. These are more of a concern for Hadoop administrators, not the application developers/analysts, though.
The two main ways to use HDFS are:
- Through a UI such as Ambari. This makes it look like a giant hard drive. It’s very intuitive and you can manipulate files like through a file explorer.
Command-line interface. You can SSH to your master node and execute commands such as
hadoop fs -ls. All HDFS command start with
hadoop fs. Other alternatives include:
- HTTP/HDFS proxies
- Java interface. Since Hadoop is written in Java, there are Java APIs available for almost anything.
- NFS gateway: You can mount a HDFS cluster!
The sample data I will be working with is the MovieLens 100k data set. It contains 100’000 ratings for movies, in two tables:
u.data has the columns userID, movieID, rating, and timestamp. The first few lines look like this:
A second table,
u.item, has the titles (and more information) for the movieIDs. It looks like this:
It’s the failure-resilient default builtin for processing data on your cluster. Essentially, you program a mapper function that processes your data, and a reducer function that aggregates the partial results into one final answer.
The mapper will transform each row into a key/value pair. The key is what’s getting aggregated on (e.g. a user id), and the value is something you do aggregate (e.g. a rating). The mapper then automatically performs a shuffle and sort step, where it aggregates (and sorts) all the values per key. You end up with a unique user id, and a list of ratings per user. Finally, the reducer combines them into your result (e.g. the average rating of each user id).
Sometimes, it’s hard to transform a question into a MapReduce job. In these cases, other frameworks like Spark or Hive are appropriate, which enable you to ask SQL-style queries.
Under the hood, you kick off a MapReduce job from some client node on your cluster. It then talks to the YARN resource manager. He knows which machines are up and have capacity. The client might need to copy some data to HDFS. Then, YARN spawns a MapReduce application master. He works with YARN to keep track of all Map tasks.
The application manager monitors worker tasks for errors or hanging. If a worker task explodes, the application manager can restart it (preferably on a different node).
If the application master goes down, YARN can try to restart it.
If an entire node goes down, the resource manager will try to restart it.
The one scary thing that could happen is that the resource manager goes down. You could set up “high availability MapReduce” with Zookeeper. It will maintain a hot standby resource manager that jumps in if the first one fails.
A first MapReduce script in Python
This is a Python script that computes a frequency table of ratings for the ml100k dataset. It consists of a class inheriting from MRJob, and a mapper and reducer function.
More commonly, problems are not as easy as this one, and creating a plain MapReduce script is somewhat of a struggle. This is why nowadays people commonly use higher level systems instead (Hive or Spark, for example).