The number of tools available in the Hadoop ecosystem can be overwhelming. This post aims to give a few guidelines to choosing the appropriate tools and designing a well-fitting architecture on your cluster.

A nice paradigm is to work backwards. The first thing you should do is not to think about your data sources or favorite tools, but to define your end user’s needs. Define how your users will access the data. Will there be many small transactions in single rows of your data? Or are there mostly analytical queries over large subsets of your data?

Also, find out the demands regarding availability and consistency. If you work with financial transaction data, consistency is probably the most important factor. If you analyze trending hashtags, on the other hand, it will be okay if, during some periods, the counts are off by a small number.

These are some questions you should ask yourself:

  1. Is your data actually big enough to justify using a cluster? Don’t use one just because it’s the hip thing to do. Setting up and maintaining a cluster environment is a long and hard process, and comes with large overhead costs as opposed to a simpler, non-distributed solution.
  2. Is there already existing technology and knowledge in use? Always lean towards the simplest solution. For example, if you have 5 Hive experts in your company, but no one has worked with Pig so far, there has to be a very good reason to approach a new problem with Pig instead of Hive.
  3. What latency do your users require? If you need to render websites, you only have milliseconds to produce a result. In that case, a very fast distributed database like HBase or Cassandra are the proper tools.
  4. How old can the data you are querying be? If it’s okay to work on data from yesterday or from one hour ago, you can e.g. use Oozie to schedule Hive or Spark jobs and work on these datasets. If, on the other hand, you require real-time data, then Spark Streaming, Storm or Flink may be more appropriate.
  5. Will your solution be in use for a long time? It will be very hard to move your data, so a proprietary solution or a cloud-based storage might not be the best choice.

Also, don’t take the requirements as written in stone. If relaxing a requirement would save you months of developing time, at least ask your project manager if this one is flexible and present an argument that shows the time and/or budget saved if this requirement is dropped or relaxed.