# Course Project

## Dataset

You will examine the ProPublica COMPAS dataset, which consists of all criminal defendants who were subject to COMPAS screening in Broward County, Florida, during 2013 and 2014. For each defendant, various information fields (‘features’) were also gathered by ProPublica. Broadly, these fields are related to the defendant’s demographic information (e.g., gender and race), criminal history (e.g., the number of prior offenses) and administrative information about the case (e.g., the case number, arrest date, risk of recidivism predicted by the COMPAS tool). Finally, the dataset also contains information about whether the defendant did actually recidivate or not.

The COMPAS score uses answers to 137 questions to assign a risk score to defendants -- essentially a probability of re-arrest. The actual output is two-fold: a risk rating of 1-10 and a "low", "medium", or "high" risk label.

Link to dataset: https://github.com/propublica/compas-analysis

The file we will analyze is: compas-scores-two-years.csv

Link to the ProPublica article:

https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing


## Project goal

The project has three parts: 

- The COMPAS scores have been shown to have biases against certain racial groups. Analyze the dataset to highlight these biases. 

- Based on the features in the COMPAS dataset, train classifiers to predict who will re-offend (hint: no need to use all features, just the ones you find relevant). Study if your classifiers are more or less fair than the COMPAS classifier. 

- Build a fair classifier (last lecture will cover fair classification techniques). Is excluding the race from the feature set enough?


## Today

Explore the dataset and do some initial statistics. 

## Download the data

We first need to load the data from the ProPublica repo:
https://github.com/propublica/compas-analysis


In [1]:
import urllib
import os,sys
import numpy as np
import pandas as pd

from sklearn import feature_extraction
from sklearn import preprocessing
from random import seed, shuffle
#from __future__ import division
#from collections import defaultdict
#import utils as ut

SEED = 1234
seed(SEED)
np.random.seed(SEED)

def check_data_file(fname):
 files = os.listdir(".") # get the current directory listing
 print("Looking for file '%s' in the current directory...",fname)

 if fname not in files:
 print("'%s' not found! Downloading from GitHub...",fname)
 addr = "https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv"
 response = urllib.request.urlopen(addr)
 data = response.read()
 fileOut = open(fname, "wb")
 fileOut.write(data)
 fileOut.close()
 print("'%s' download and saved locally..",fname)
 else:
 print("File found in current directory..")
 
COMPAS_INPUT_FILE = "compas-scores-two-years.csv"
check_data_file(COMPAS_INPUT_FILE) 

Looking for file '%s' in the current directory... compas-scores-two-years.csv
File found in current directory..


## Load data and clean it up

__Load the data__

hint: data is in csv format; pandas is a python library that can read csv files

you can choose to represent your data either as a DataFrame or as a dictionary

- The dataset contains data on how many convicts? 

- What are the features the dataset contains?

hint pandas: check pandas functions shape, column, head

hint dictionary: check keys() function

__Cleanup the data__

- Are there missing values (NaN)? are there outliers? 

hint pandas: check isnull function in pandas

hint dictionary: implement a for and check if the variable is None

- Does ProPublica mentions how to clean the data? 

__What is the effect of the following function?__

df = pd.read_csv(COMPAS_INPUT_FILE)

print(df.shape)

df = df.dropna(subset=["days_b_screening_arrest"]) # dropping missing vals

df = df[
 (df.days_b_screening_arrest <= 30) & 
 (df.days_b_screening_arrest >= -30) & 
 (df.is_recid != -1) &
 (df.c_charge_degree != 'O') &
 (df.score_text != 'N/A')
]

df.reset_index(inplace=True, drop=True) # renumber the rows from 0 again

## Basic analysis of demographics

- What are the different races present in the dataset? 

- What is the number of people by age category?

- What is the number of people by race?

- What is the number of people by COMPAS score (decile_score)?

- What is the number of people by COMPAS risk category (score_text)?

## Basic investigations of gender and race bias in COMPAS scores

decile_score -- is the score given by the COMPAS algorithm that estimates the risk to re-offend.

score_text -- is the level of risk: Low, Medium, High

two_years_recid -- is the ground truth data on whether the offender recidivated or not

- What is the mean COMPAS score (decile_score) per race and gender? 

- What is the distribution (histogram) of decile_score per race and gender? 

The two_year_recid field records whether or not each person was re-arrested for a violent offense within two years, which is what COMPAS is trying to predict.

- How many people were re-arrested? 

- Compute the recidivism (i.e., people that got re-arrested) rates by race and gender

- What is the accuracy of the COMPAS scores to predict recidivism

- Is the accuracy higher/lower if we look at particular races/genders?

- What about false positives and false negatives?
