# A basic Spark/Python script

*This post is part of my preparation series for the Cloudera CCA175 exam, “Certified Spark and Hadoop Developer”. It is intentionally concise, to serve me as a cheat sheet.*

This post serves as a brief introduction to Spark with Python. You can follow along and paste these lines into a `pyspark`

shell.

You can now transform the gas price from Euro per liter to USD per gallon, assuming 1.18 USD per Euro and 0.265 gallons per liter. We map the values to a *tuple*, the first element being the transformed price, and the second element a constant 1, which is helpful for the `reduceByKey`

step later on:

View your data set at any time by issuing `.collect()`

on an RDD:

Now compute an RDD containing the average gas price per weekday and show its contents. The `reduceByKey`

function is a *lambda* that takes two arguments, which correspond to two different keys (*not* the two elements of the tuple that makes *one* key). So, you will create a reduced RDD that contains the *sum* of all prices in the first element, and the *number* of prices in the second, then *map* it again to compute the actual *average*:

And your result is a python list which contains the weekday and the corresponding average price in each element!