Reading and writing data with Spark and Python
This post is part of my preparation series for the Cloudera CCA175 exam, “Certified Spark and Hadoop Developer”. It is intentionally concise, to serve me as a cheat sheet.
In this post, I’ll briefly summarize the core Spark functions necessary for the CCA175 exam. I also have a longer article on Spark available that goes into more detail and spans a few more topics.
The Spark context (often named
sc) has methods for creating RDDs and is responsible for making RDDs resilient and distributed.
This is how you would use Spark and Python to create RDDs from different sources:
Note that you cannot run this with your standard Python interpreter. Instead, you use
spark-submit to submit it as a batch job, or call
pyspark from the Shell.
Other file sources include JSON, sequence files, and object files, which I won’t cover, though.
The RDD class has a
saveAsTextFile method. However, this saves a string representation of each element. In Python, your resulting text file will contain lines such as
If you want to save your data in CSV or TSV format, you can either use Python’s
csv_modules (described in chapter 5 of the book “Learning Spark”), or, for simple data sets, just map each element (a vector) into a single string, e.g. like this:
Use metastore tables as input and output
Metastore tables store meta information about your stored data, such as the HDFS path to a table, column names and types. The
HiveContext inherits from
SQLContext and can find tables in the Metastore:
To write data from Spark into Hive, do this:
These HiveQL commands of course work from the Hive shell, as well.
You can then load data from Hive into Spark with commands like
To write data from Spark into Hive, you can also transform it into a DataFrame and use this class’s
Hive tables, by default, are stored in the warehouse at
/user/hive/warehouse. This directory contains one folder per table, which in turn stores a table as a collection of text files.
Interacting with the Hive Metastore
This requirement for the CCA175 exam is a fancy way of saying “create and modify Hive tables). The metastore holds meta information about your tables, i.e. column names, numbers, and types.
So the requirement here is to get familiar with the
CREATE TABLE and
DROP TABLE commands from SQL. (
ALTER TABLE does not work from within SPARK and should be done from