Introduction to Data Science


Instructors

Patrick Loiseau, Eric Gaussier, Oana Goga

News

Topic

This course gives an introduction to data analysis via machine learning methods. The focus is on how to use well the powerful existing machine learning methods so as not to do "bad data analysis"; rather than on giving an exhaustive account of all existing algorithms.

We will cover basic notions of data analysis (such as cross-validation or regularization), methods for supervised and unsupervised learning (classification, neural networks, recommendation, clustering), and transverse questions of interpretability, explainability and fairness in machine learning.

The notions are introduced and illustrated through labs and the grading is in part based on a project.

Pre-requisite: Basics of Python (incl. numpy), basics of algebra, basics of probability

Schedule

Date Topic Instructors Material
7/2 (Warning: 8:30AM) Lecture 1: Introduction to basic notions Patrick [James et al.] Chap. 2 and 5.1
14/2 Lecture 2: Supervised learning Eric [James et al.] Chap. 3, 4.1, 4.2, 4.3 and 8.1
21/2 Lab 1: Supervised learning on Scikit-learn + COMPAS dataset analysis Eric, Oana Supervised learning notebook (instructions), Supervised learning notebook (solution), COMPAS exploration notebook
28/2 No class (holiday)
6/3 No class
13/3 (Warning: 9:30AM) Lecture 3: Neural networks Patrick Slides, [Goodfellow et al.] Chap. 6 and 9
20/3 Lab 2: Neural networks Eric, Patrick lab instructions, join the team chat
27/3 Lecture 4: Recommender systems Oana [Leskovec et al.] Chap. 9
3/4 Lab 3: Project Patrick, Oana Project's instructions
10/4 Lecture 5: Unsupervised learning Eric
17/4 Lecture 6: Interpretability, Fairness Patrick [Barocas et al.] Chap. 1-2
24/4 No class (holiday)
1/5 No class (bank holiday)
5/5 Lab 4: Grading the projects by groups Patrick, Oana

Jupyter hub

Feel free to use any software to run the jupyter notebooks. However, if you don't have any, you can use the UGA hub.

Reading Material

Course Project

Dataset

You will examine the ProPublica COMPAS dataset, which consists of all criminal defendants who were subject to COMPAS screening in Broward County, Florida, during 2013 and 2014. For each defendant, various information fields (‘features’) were also gathered by ProPublica. Broadly, these fields are related to the defendant’s demographic information (e.g., gender and race), criminal history (e.g., the number of prior offenses) and administrative information about the case (e.g., the case number, arrest date, risk of recidivism predicted by the COMPAS tool). Finally, the dataset also contains information about whether the defendant did actually recidivate or not.

Link to dataset: COMPAS ProPublica dataset

The file we will analyze is: compas-scores-two-years.csv

Instructions

The projects instructions can be found here

Grading

Grading will be done on the project (60%) and a final exam (40%)