Skip to content


In this class, we cover the Apache Spark framework, explaining Resilient Distributed Datasets, SparkSQL, Spark MLLib, and how to interact with a Spark cluster. We use PySpark in a Jupyter notebook to explore RDDs and see an example of distributed K-Means.

Spark introduction

Spark notebook

Spark notebook on Colab