Commitee-based methods🔗

Decision trees🔗

This class introduces the concept of decision trees as supervized learning methods for classification and regression. A decision tree is a simple model that can even be visualized and understood by a human after training. It consists in a graph with a "tree" structure, meaning that there exists a root node with a pair of child nodes, themself having pairs of child nodes, etc., until some leaf nodes. The way data are processed by the model is by flowing through the nodes until reaching a leaf, that corresponds to a class or a value. At each node, a "split" was created at training time, testing some features of the data point. The binary outcome of the test determines whether the next node seen by the data will be the first or the second child of the current node.

Notebook

References

Boosting🔗

In this class, we introduce the principle of boosting, which can be seen as an incremental way to build a "strong" classifier with "weak" classifiers. As is, this technic is an ensemble method. More specifically, the "weak" classifiers are added sequentially, so that the new model compensates the flaws of the ensemble composed of the previous models. Further, we introduce the gradient boosting boosting as a generalization of boosting but using gradients for the incremental addition of models.

Notebook

References

The Boosting Approach to Machine Learning An Overview. R. E. Schapire. MSRI workshop on Nonlinear Estimation and Classification, (2002).

Gradient Boosting and XGBoost🔗

In this class, you will learn to use the XGBoost library, which efficiently implements gradient boosting algorithms.

This Practice Course is composed of 3 parts - each part is meant to be done in about 1 hour : * In the first notebook, you will learn the basic of XGBoost, how to apply it on a dataset and tune it to obtain the best performances. * In the second notebook, we will focus on ensemble methods and explain what makes XGBoost different from other models. * Finally in the last notebook you will see how the choice of hyperparameters is a key element of a tradeoff between Bias and Variance.

Notebook 1: Introduction to XGBoost

Notebook 2: XGBoost and ensemble models

Notebook 3: Regularization

References

XGBoost

Bagging🔗

In this class, we will introduce the bootstrap method and its application to learning a predictor called Bagging (Bootstrap AGGregatING). First we review bootstrap in statistics as a method to estimate the variance of an estimator on any statistic of a random variable (e.g. its mean). Then we extend this notion to machine learning, i.e., to learning a predictor for regression or classification. We discuss the pros and cons of bagging.

Notebook

References

Random Forests🔗

In this class, we will introduce Random Forest as a new boosting machine algorithm using randomness in two ways to incrementally add trees:

By sub-sampling a random training set in the original training set as in bagging methods.
By selecting a random subset of features on which performing tree splits for each choice of split.

The method is then showcased in simple classification tasks.

Notebook

References

Anomaly detection🔗

This class introduces the problem framing and methodology of Anomaly Detection. It illustrates why classical supervised ML algorithms are not suitable for such problems, and provides new approaches with outlier detection and novelty detection. You will discover, by alternating theory and practice exercises, the major algorithms, principles and warning signs for such tasks, including One-Class SVMs, Local Outlier Factor or Isolation Forest. You will also discover semi-supervised approaches where the error of supervised learning models can turn into anomaly scores. At the end, a practical use case with anonymized aircraft sensor data is proposed, where you will have to develop the whole methodology without guidance. It will help you reflect on the main stakes and warning points of such tasks, to prepare you to address customers in your professional life.

Lecture notes
Notebook for class exercises (colab)
Solutions (colab)

Practical use case instructions
Data for use case

Requirements for local installation🔗

To setup the Anaconda environment with required dependencies, execute the following instructions in Anaconda prompt or Linux shell.

# Clone this github repository on your machine
git clone https://github.com/jfabrice/ml-class-anomaly-detection.git

# Change working directory inside the repo
cd ml-class-anomaly-detection

# Create a new virtual environment
conda create -n anomalydetectionenv python==3.6

# Activate the environment
## For Linux / MAC
source activate anomalydetectionenv
## For Windows
activate anomalydetectionenv

# Install the requirements
pip install -r requirements.txt

References

A Fast Algorithm for the Minimum Covariance Determinant Estimator.
P. J. Rousseeuw, and K. V. Driessen. Technometrics, 41(3), 212-223, (1999).
Estimating the support of a high-dimensional distribution.
B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Neural computation, 13(7), 1443-1471, (2001).
Isolation-based anomaly detection.
F. T. Liu, K. M. Ting, and Z. H. Zhou. ACM Transactions on Knowledge Discovery from Data, 6(1), 1-39, (2012).
LOF: identifying density-based local outliers.
M. M. Breunig, H. P. Kriegel, R. T. Ng, and J. Sander. In ACM SIGMOD International Conference on Management of Data, 93-104, (2000).