Topic

This course gives an introduction to data analysis via machine learning methods. The focus is on how to use well the powerful existing machine learning methods so as not to do "bad data analysis"; rather than on giving an exhaustive account of all existing algorithms.

We will cover basic notions of data analysis (such as cross-validation or regularization), methods for supervised and unsupervised learning (classification, neural networks, recommendation, clustering), and transverse questions of interpretability, explainability and fairness in machine learning.

The notions are introduced and illustrated through labs and the grading is in part based on a project.

Pre-requisite: Basics of Python (incl. numpy), basics of algebra, basics of probability

Schedule

Date	Topic	Instructors	Material
7/2 (Warning: 8:30AM)	Lecture 1: Introduction to basic notions	Patrick	[James et al.] Chap. 2 and 5.1
14/2	Lecture 2: Supervised learning	Eric	[James et al.] Chap. 3, 4.1, 4.2, 4.3 and 8.1
21/2	Lab 1: Supervised learning on Scikit-learn + COMPAS dataset analysis	Eric, Oana	Supervised learning notebook (instructions), Supervised learning notebook (solution), COMPAS exploration notebook
28/2	No class (holiday)
6/3	No class
13/3 (Warning: 9:30AM)	Lecture 3: Neural networks	Patrick	Slides, [Goodfellow et al.] Chap. 6 and 9
20/3	Lab 2: Neural networks	Eric, Patrick	lab instructions, join the team chat
27/3	Lecture 4: Recommender systems	Oana	[Leskovec et al.] Chap. 9
3/4	Lab 3: Project	Patrick, Oana	Project's instructions
10/4	Lecture 5: Unsupervised learning	Eric
17/4	Lecture 6: Interpretability, Fairness	Patrick	[Barocas et al.] Chap. 1-2
24/4	No class (holiday)
1/5	No class (bank holiday)
5/5	Lab 4: Grading the projects by groups	Patrick, Oana

Reading Material

[James et al.] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduction to Statistical Learning, with Applications in R.
[Hastie et al.] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction.
[Manning et al.] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval.
[VanderPlas] Jake VanderPlas. Python Data Science Handbook.
[Goodfellow et al.] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning
[Leskovec et al.] Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. Mining of Massive Datasets
[Barocas et al.] Solon Barocas, Moritz Hardt, and Arvind Narayanan Fairness and machine learning: Limitations and Opportunities

Course Project

Dataset

You will examine the ProPublica COMPAS dataset, which consists of all criminal defendants who were subject to COMPAS screening in Broward County, Florida, during 2013 and 2014. For each defendant, various information fields (‘features’) were also gathered by ProPublica. Broadly, these fields are related to the defendant’s demographic information (e.g., gender and race), criminal history (e.g., the number of prior offenses) and administrative information about the case (e.g., the case number, arrest date, risk of recidivism predicted by the COMPAS tool). Finally, the dataset also contains information about whether the defendant did actually recidivate or not.

Link to dataset: COMPAS ProPublica dataset

The file we will analyze is: compas-scores-two-years.csv

Instructions

The projects instructions can be found here

Introduction to Data Science

Instructors

News