This post is part of my preparation series for the Cloudera CCA175 exam, “Certified Spark and Hadoop Developer”.
Before using Sqoop to import data from MySQL, you should connect to the database and make sure everything exists as expected. From the command line, access a MySQL on localhost via
You could enter a database name at the end, but you don’t have to. Once inside, you can poke around and check the database contents:
Sqoop needs permissions to access the tables. You grant these permissions from MySQL like so:
Import from MySQL to HDFS
The core command here is
sqoop import, along with a lot of parameters. This is an extensive example command:
Additionally to importing a table, this command compresses the HDFS file using the snappy codec (alternatives to
It also imports the file as an Avro file (instead of the default, a text file).
Advanced import commands
--fields-terminated-by '|'changes the field delimiter from the default
\tto another one.
--hive-importcan import the data directly into the Hive warehouse (instead of HDFS, the default). Use
--hive-overwriteto replace an existing table.
--driveris only relevant if the connection string does not begin with
--null-non-string -99to recode NULL values from the database into -99 for non-string data types.
--null-string "NA"the same option for strings
--where "product_id > 10000"can select a subset from the table that gets imported
Export from HDFS to MySQL
To export a table, you must first create it in MySQL. Log into it as before, then create the table in your target database:
Then, a sample export command looks like this:
If you export from a Hive internal table, the export-dir is something like
Check your MySQL database if the table was imported correctly!
Advanced export commands
--input-fields-terminated-by '\0001'is an option you set when exporting from Hive, since the default field delimiter is ASCII value 1 there.
input-null-non-stringto recode from the corresponding values to NULL in the database.