DerbyPy Intro to PySpark

Published: Oct 25, 2018 by

This month at DerbyPy I provided a high level introduction to PySpark. For this talk I went over the Spark execution model at a high level, talked about the difference between the PySpark Dataframe and RDD api, and provided some examples of how to use both. As part of this I put together a jupyter notebook and some scripts that can be used via spark-submit along with instructions on how to run PySpark locally.

If you’re interested in the material and presentation they can be found here.