Hive can make your Hadoop cluster look like a relational database. It is layered on top of MapReduce or Tez, and translates your SQL queries to MapReduce jobs.

The language itself is called HiveQL. It’s very similar to MySQL, with some extensions (e.g., views). HiveQL is highly optimized, and very extensible by things like user defined functions, Thrift servers, and it exposes JDBC/ODBC drivers, so that it can look like any other database.

Hive also has a few disadvantages: It’s not useful for OLTP (i.e. high-throughput INSERT, UPDATE, or DELETE transactions) because the MapReduce jobs have some overhead cost. Also, SQL is a limited language. Pig and Spark are more appropriate for more complicated tasks.

Importing tables

If you have a data file on your hard disk (for example a .csv file) or on HDFS, you can import it into a table in one of two ways: through a UI such as Ambari, which is more convenient, or through SQL commands. Just for illustration purposes, this is how you would import the ml-100k data set from HDFS into Hive:

CREATE TABLE ratings (
  userID INT,
  movieID INT,
  rating INT,
  time INT)

The command LOAD DATA LOCAL (as opposed to LOAD DATA) will copy the data into Hive. For actual big data sets, you may want to move instead of copy it. Then, of course, a DROP TABLE command will delete your data, so be careful!

If you want Hive to work with data, but not “own” the data, you can create an external table (as opposed to a managed table). Hive stores the metadata (column names and types, e.g.), but the data is still on HDFS and accessible by other tools.

Querying tables

I previously said that SQL is a somewhat limited tool for data analysis. But: you can use views to store results of a query in an intermediate, virtual table. In the following SQL statement, you can then access that view the same way you would access a “real” table. For example, in this script you generate the view topMovieIDs which contains the movie ID and the number of ratings, then in a subsequent statement you join the movie name to it and select only the name and the rating count in your final result:

SELECT movieID, count(movieID) AS ratingCount
FROM ratings
ORDER BY ratingCount DESC;

SELECT n.title, ratingCount
FROM topMovieIDs t JOIN names n ON t.movieID = n.movieID;

You can run this script from the Ambari UI, or straight from the command line. There, the result will look like this:

hive> CREATE VIEW topMovieIDs AS
    > SELECT movieID, count(movieID) AS ratingCount
    > FROM ratings
    > GROUP BY movieID
    > ORDER BY ratingCount DESC;
Time taken: 0.66 seconds
hive> SELECT n.title, ratingCount
    > FROM topMovieIDs t JOIN names n ON t.movieID = n.movieID
    > LIMIT 10;
Query ID = maria_dev_20170708103104_30ff06d3-d6df-4296-9cc7-6f430c06beb9
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1499507771083_0005)
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Map 4 ..........   SUCCEEDED      1          1        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
Reducer 3 ......   SUCCEEDED      1          1        0        0       0       0
VERTICES: 04/04  [==========================>>] 100%  ELAPSED TIME: 11.85 s    
Star Wars (1977)	584
Contact (1997)	509
Fargo (1996)	508
Return of the Jedi (1983)	507
Liar Liar (1997)	485
English Patient, The (1996)	481
Scream (1996)	478
Toy Story (1995)	452
Air Force One (1997)	431
Independence Day (ID4) (1996)	429
Time taken: 14.962 seconds, Fetched: 10 row(s)

Nice, huh? You get some diagnostic output, and the top 10 movies according to the number of ratings. Not surprisingly, Star Wars is the number one here.

If you run this from the Ambari UI, you get a HTML formatted table as your output, but you can download the results as a .csv table, as well. This is useful for further analyses or visualization of your results.

Other nice things about Hive

  • You can assign a partitioning to your data if you frequently analyze only specific partitions. If, for example, you have shopping data of many business units, and most your analyses are only specific to a given business unit, this operation will provide a significant speed-up to your queries.
  • You can have structures as column types. For example, to have address as a structure containing street, city, and zip code, you would issue a query like
  name STRING,
  address STRUCT<street:STRING, city:STRING, zip:INT>

The fields are then accessed by etc.

  • You can execute a batch Hive script from the command line by hive -f /some/path/queries.hql
  • Importing and exporting to/from other relational databases such as MySQL, Oracle, or PostgreSQL works with Sqoop, another tool in the Hadoop ecosystem. If your analyzed and processed output is not “big” anymore, you can export that to the standalone databases for other people to work with.