Introduction to Data Science


Patrick Loiseau, Eric Gaussier, Oana Goga



This course gives an introduction to data analysis via machine learning methods. The focus is on how to use well the powerful existing machine learning methods so as not to do "bad data analysis"; rather than on giving an exhaustive account of all existing algorithms.

We will cover basic notions of data analysis (such as cross-validation or regularization), methods for supervised and unsupervised learning (classification, neural networks, recommendation, clustering), and transverse questions of interpretability, explainability and fairness in machine learning.

The notions are introduced and illustrated through labs and the grading is in part based on a project.

Pre-requisite: Basics of Python (incl. numpy), basics of algebra, basics of probability


Date Topic Instructors Material
7/2 (Warning: 8:30AM) Lecture 1: Introduction to basic notions Patrick [James et al.] Chap. 2 and 5.1
14/2 Lecture 2: Supervised learning Eric [James et al.] Chap. 3, 4.1, 4.2, 4.3 and 8.1
21/2 Lab 1: Supervised learning on Scikit-learn + COMPAS dataset analysis Eric, Oana Supervised learning notebook (instructions), Supervised learning notebook (solution), COMPAS exploration notebook
28/2 No class (holiday)
6/3 No class
13/3 (Warning: 9:30AM) Lecture 3: Neural networks Patrick Slides, [Goodfellow et al.] Chap. 6 and 9
20/3 Lab 2: Neural networks Eric, Patrick lab instructions, join the team chat
27/3 Lecture 4: Recommender systems Oana [Leskovec et al.] Chap. 9
3/4 Lab 3: Project Patrick, Oana Project's instructions
10/4 Lecture 5: Unsupervised learning Eric
17/4 Lecture 6: Interpretability, Fairness Patrick [Barocas et al.] Chap. 1-2
24/4 No class (holiday)
1/5 No class (bank holiday)
5/5 Lab 4: Grading the projects by groups Patrick, Oana

Jupyter hub

Feel free to use any software to run the jupyter notebooks. However, if you don't have any, you can use the UGA hub.

Reading Material

Course Project


You will examine the ProPublica COMPAS dataset, which consists of all criminal defendants who were subject to COMPAS screening in Broward County, Florida, during 2013 and 2014. For each defendant, various information fields (‘features’) were also gathered by ProPublica. Broadly, these fields are related to the defendant’s demographic information (e.g., gender and race), criminal history (e.g., the number of prior offenses) and administrative information about the case (e.g., the case number, arrest date, risk of recidivism predicted by the COMPAS tool). Finally, the dataset also contains information about whether the defendant did actually recidivate or not.

Link to dataset: COMPAS ProPublica dataset

The file we will analyze is: compas-scores-two-years.csv


The projects instructions can be found here


Grading will be done on the project (60%) and a final exam (40%)