Introduction to Data Distribution🔗

Course Overview🔗

Data Distribution & Big Data Processing

Harnessing the complexity of large amounts of data is a challenge in itself.

But Big Data processing is more than that: originally characterized by the 3 Vs of Volume, Velocity and Variety, the concepts popularized by Hadoop and Google requires dedicated computing solutions (both software and infrastructure), which will be explored in this module.

Objectives🔗

By the end of this module, participants will be able to:

Understand the differences and usage between main distributed computing architectures (HPC, Big Data, Cloud, CPU vs GPGPU)
Implement the distribution of simple operations via the Map/Reduce principle in PySpark
Understand the principle of Kubernetes
Deploy a Big Data Processing Platform on the Cloud
Implement the distribution of data wrangling/cleaning and training machine learning algorithms using PyData stack, Jupyter notebooks and Dask