Statistical Foundations of Machine Learning🔗

This 15h course develops important aspects of statistical modelling, which are particularly related to machine learning.

Chapter 1 reviews the mathematical notions that will underlie machine learning. In particular, the notions of random variables, probability density and empirical estimation of model parameters will be rigorously defined and illustrated.

Chapter 2 is central in this course as it presents the linear regression model from a statistical point of view, but opens up questions that are essential in machine learning, such as overfitting and cross-validation.

These issues are developed in chapter 3, which deals with regularization and cross-validation but also develops the concepts of bias-variance trade-off and curse of dimensionality.

Chapters 4 and 5 push the statistical modeling aspects introduced in chapter 3 to make clear how the randomness modeled in different random variables allows to build a statistical test (chapter 4 on ANOVA) or to estimate the parameters of a relatively complex model from observed data (chapter 5 on mixed models).

Finally, chapter 6 makes two openings on two classic linear models, both in machine learning and in statistics, which are the logistic regression and the PLS method.

Schedule🔗

In practice, the courses and practicals will be structured in four blocks, each of them containing 1 to 2 hours of course and 2 hours of practicals. All documents linked below are in French.

Block 1: In this block, chapter 1 and the 1D linear regression of chapter 2 will be seen in class. The practicals will deal with linear regression, outliers detection and an illustration of the concept of maximum likelihood.
Block 2: This block deals with multi-variate linear regression (2nd part of chapter 2), regularisation and cross-validation (chapter 3). These concepts will be manipulated during the practicals.
Block 3: This block is more related to statistical aspects of the linear model in data science and focuses on ANOVA (chapter 4). This method will be studied during the practicals, and some time will also be dedicated to further manipulate the concepts of block 2.
Block 4: Extensions of the methods seen before will be seen in this block: Mixed models (chapter 5) and the mathematical construction of logistic regression and the PLS method. The practicals will first deal with PLS but also open questions on the interpretability of the decision rules in machine learning based on the logistic regression example.

Practical sessions🔗

All notebooks in French.
Utilisation de scikit-learn pour la regression lineaire
Régression linéaire multiple et inférence statistique
Regression multiple avec régularisaton et validation croisée
Utilisation de Pandas et sklearn pour l'analyse de données réelles
ANOVA
Partial Least Squares
Régression logistique et explicabilité

Chapters🔗

All documents in French. These chapters correspond to the exact same contents studied during the 4 blocks above.
Chapter 1 Introduction
Chapter 2 Régression linéaire
Chapter 3 Sélection de modèle en régression linéaire multiple
Chapter 4 Analyse de variance
Chapter 5 Modèle linéaire mixte
Chapter 6 Ouvertures
Annexes