A basic Spark/Python script
This post is part of my preparation series for the Cloudera CCA175 exam, “Certified Spark and Hadoop Developer”. It is intentionally concise, to serve me as a cheat sheet.
This post serves as a brief introduction to Spark with Python. You can follow along and paste these lines into a
You can now transform the gas price from Euro per liter to USD per gallon, assuming 1.18 USD per Euro and 0.265 gallons per liter. We map the values to a tuple, the first element being the transformed price, and the second element a constant 1, which is helpful for the
reduceByKey step later on:
View your data set at any time by issuing
.collect() on an RDD:
Now compute an RDD containing the average gas price per weekday and show its contents. The
reduceByKey function is a lambda that takes two arguments, which correspond to two different keys (not the two elements of the tuple that makes one key). So, you will create a reduced RDD that contains the sum of all prices in the first element, and the number of prices in the second, then map it again to compute the actual average:
And your result is a python list which contains the weekday and the corresponding average price in each element!