Introduction to Data Distribution🔗
Course Overview🔗
- Data Distribution & Big Data Processing
Harnessing the complexity of large amounts of data is a challenge in itself.
But Big Data processing is more than that: originally characterized by the 3 Vs of Volume, Velocity and Variety, the concepts popularized by Hadoop and Google requires dedicated computing solutions (both software and infrastructure), which will be explored in this module.
Objectives🔗
By the end of this module, participants will be able to:
- Understand the differences and usage between main distributed computing architectures (HPC, Big Data, Cloud, CPU vs GPGPU)
- Implement the distribution of simple operations via the Map/Reduce principle in PySpark
- Understand the principle of Kubernetes
- Deploy a Big Data Processing Platform on the Cloud
- Implement the distribution of data wrangling/cleaning and training machine learning algorithms using PyData stack, Jupyter notebooks and Dask