A basic Spark/Python script
This post is part of my preparation series for the Cloudera CCA175 exam, “Certified Spark and Hadoop Developer”. It is intentionally concise, to serve me as a cheat sheet.
This post serves as a brief introduction to Spark with Python. You can follow along and paste these lines into a pyspark
shell.
You can now transform the gas price from Euro per liter to USD per gallon, assuming 1.18 USD per Euro and 0.265 gallons per liter. We map the values to a tuple, the first element being the transformed price, and the second element a constant 1, which is helpful for the reduceByKey
step later on:
View your data set at any time by issuing .collect()
on an RDD:
Now compute an RDD containing the average gas price per weekday and show its contents. The reduceByKey
function is a lambda that takes two arguments, which correspond to two different keys (not the two elements of the tuple that makes one key). So, you will create a reduced RDD that contains the sum of all prices in the first element, and the number of prices in the second, then map it again to compute the actual average:
And your result is a python list which contains the weekday and the corresponding average price in each element!