This course gives an introduction to data analysis via machine learning methods. The focus is on how to use well the powerful existing machine learning methods so as not to do "bad data analysis"; rather than on giving an exhaustive account of all existing algorithms.
We will cover basic notions of data analysis (such as cross-validation or regularization), methods for supervised and unsupervised learning (classification, neural networks, recommendation, clustering), and transverse questions of interpretability, explainability and fairness in machine learning.
The notions are introduced and illustrated through labs and the grading is in part based on a project.
Pre-requisite: Basics of Python (incl. numpy), basics of algebra, basics of probability
Date | Topic | Instructors | Material |
---|---|---|---|
7/2 (Warning: 8:30AM) | Lecture 1: Introduction to basic notions | Patrick | [James et al.] Chap. 2 and 5.1 |
14/2 | Lecture 2: Supervised learning | Eric | [James et al.] Chap. 3, 4.1, 4.2, 4.3 and 8.1 |
21/2 | Lab 1: Supervised learning on Scikit-learn + COMPAS dataset analysis | Eric, Oana | Supervised learning notebook (instructions), Supervised learning notebook (solution), COMPAS exploration notebook |
28/2 | No class (holiday) | ||
6/3 | No class | ||
13/3 (Warning: 9:30AM) | Lecture 3: Neural networks | Patrick | Slides, [Goodfellow et al.] Chap. 6 and 9 |
20/3 | Lab 2: Neural networks | Eric, Patrick | lab instructions, join the team chat |
27/3 | Lecture 4: Recommender systems | Oana | [Leskovec et al.] Chap. 9 |
3/4 | Lab 3: Project | Patrick, Oana | Project's instructions |
10/4 | Lecture 5: Unsupervised learning | Eric | |
17/4 | Lecture 6: Interpretability, Fairness | Patrick | [Barocas et al.] Chap. 1-2 |
24/4 | No class (holiday) | ||
1/5 | No class (bank holiday) | ||
5/5 | Lab 4: Grading the projects by groups | Patrick, Oana |
You will examine the ProPublica COMPAS dataset, which consists of all criminal defendants who were subject to COMPAS screening in Broward County, Florida, during 2013 and 2014. For each defendant, various information fields (‘features’) were also gathered by ProPublica. Broadly, these fields are related to the defendant’s demographic information (e.g., gender and race), criminal history (e.g., the number of prior offenses) and administrative information about the case (e.g., the case number, arrest date, risk of recidivism predicted by the COMPAS tool). Finally, the dataset also contains information about whether the defendant did actually recidivate or not.
Link to dataset: COMPAS ProPublica dataset
The file we will analyze is: compas-scores-two-years.csv
The projects instructions can be found here
Grading will be done on the project (60%) and a final exam (40%)