This post is part of my preparation series for the Cloudera CCA175 exam, “Certified Spark and Hadoop Developer”. It is intentionally concise, to serve me as a cheat sheet.

This post serves as a brief introduction to Spark with Python. You can follow along and paste these lines into a pyspark shell.

# Create some data as an RDD:
# Each entry consists of a tuple (weekday, gas_price)

myRDD = sc.parallelize([
  ("Monday", 1.24),
  ("Tuesday", 1.42),
  ("Sunday", 1.33),
  ("Sunday", 1.21),
  ("Monday", 1.18)
])

You can now transform the gas price from Euro per liter to USD per gallon, assuming 1.18 USD per Euro and 0.265 gallons per liter. We map the values to a tuple, the first element being the transformed price, and the second element a constant 1, which is helpful for the reduceByKey step later on:

myRDD_USD = myRDD.mapValues(lambda price: (price * 1.18 / 0.265, 1))

View your data set at any time by issuing .collect() on an RDD:

myRDD_USD.collect()

Now compute an RDD containing the average gas price per weekday and show its contents. The reduceByKey function is a lambda that takes two arguments, which correspond to two different keys (not the two elements of the tuple that makes one key). So, you will create a reduced RDD that contains the sum of all prices in the first element, and the number of prices in the second, then map it again to compute the actual average:

price_per_weekday = myRDD_USD. \
    reduceByKey(lambda key1,key2: (key1[0]+key2[0], key1[1]+key2[1])). \
    mapValues(lambda x: x[0]/x[1])

ppw_list = price_per_weekday.collect()

for day in ppw_list:
	print(day)

And your result is a python list which contains the weekday and the corresponding average price in each element!