{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Course Project\n",
    "\n",
    "## Dataset\n",
    "\n",
    "You will examine the ProPublica COMPAS dataset, which consists of all criminal defendants who were subject to COMPAS screening in Broward County, Florida, during 2013 and 2014. For each defendant, various information fields (‘features’) were also gathered by ProPublica. Broadly, these fields are related to the defendant’s demographic information (e.g., gender and race), criminal history (e.g., the number of prior offenses) and administrative information about the case (e.g., the case number, arrest date, risk of recidivism predicted by the COMPAS tool). Finally, the dataset also contains information about whether the defendant did actually recidivate or not.\n",
    "\n",
    "The COMPAS score uses answers to 137 questions to assign a risk score to defendants -- essentially a probability of re-arrest. The actual output is two-fold: a risk rating of 1-10 and a \"low\", \"medium\", or \"high\" risk label.\n",
    "\n",
    "Link to dataset: https://github.com/propublica/compas-analysis\n",
    "\n",
    "The file we will analyze is: compas-scores-two-years.csv\n",
    "\n",
    "Link to the ProPublica article:\n",
    "\n",
    "https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing\n",
    "\n",
    "\n",
    "## Project goal\n",
    "\n",
    "The project has three parts: \n",
    "\n",
    "- The COMPAS scores have been shown to have biases against certain racial groups. Analyze the dataset to highlight these biases.  \n",
    "\n",
    "- Based on the features in the COMPAS dataset, train classifiers to predict who will re-offend (hint: no need to use all features, just the ones you find relevant).  Study if your classifiers are more or less fair than the COMPAS classifier. \n",
    "\n",
    "- Build a fair classifier (last lecture will cover fair classification techniques). Is excluding the race from the feature set enough?\n",
    "\n",
    "\n",
    "## Today\n",
    "\n",
    "Explore the dataset and do some initial statistics. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download the data\n",
    "\n",
    "We first need to load the data from the ProPublica repo:\n",
    "https://github.com/propublica/compas-analysis\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Looking for file '%s' in the current directory... compas-scores-two-years.csv\n",
      "File found in current directory..\n"
     ]
    }
   ],
   "source": [
    "import urllib\n",
    "import os,sys\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "from sklearn import feature_extraction\n",
    "from sklearn import preprocessing\n",
    "from random import seed, shuffle\n",
    "#from __future__ import division\n",
    "#from collections import defaultdict\n",
    "#import utils as ut\n",
    "\n",
    "SEED = 1234\n",
    "seed(SEED)\n",
    "np.random.seed(SEED)\n",
    "\n",
    "def check_data_file(fname):\n",
    "    files = os.listdir(\".\") # get the current directory listing\n",
    "    print(\"Looking for file '%s' in the current directory...\",fname)\n",
    "\n",
    "    if fname not in files:\n",
    "        print(\"'%s' not found! Downloading from GitHub...\",fname)\n",
    "        addr = \"https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv\"\n",
    "        response = urllib.request.urlopen(addr)\n",
    "        data = response.read()\n",
    "        fileOut = open(fname, \"wb\")\n",
    "        fileOut.write(data)\n",
    "        fileOut.close()\n",
    "        print(\"'%s' download and saved locally..\",fname)\n",
    "    else:\n",
    "        print(\"File found in current directory..\")\n",
    "    \n",
    "COMPAS_INPUT_FILE = \"compas-scores-two-years.csv\"\n",
    "check_data_file(COMPAS_INPUT_FILE)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load data and clean it up\n",
    "\n",
    "__Load the data__\n",
    "\n",
    "hint: data is in csv format; pandas is a python library that can read csv files\n",
    "\n",
    "you can choose to represent your data either as a DataFrame or as a dictionary\n",
    "\n",
    "- The dataset contains data on how many convicts? \n",
    "\n",
    "- What are the features the dataset contains?\n",
    "\n",
    "hint pandas: check pandas functions shape, column, head\n",
    "\n",
    "hint dictionary: check keys() function\n",
    "\n",
    "__Cleanup the data__\n",
    "\n",
    "- Are there missing values (NaN)? are there outliers?  \n",
    "\n",
    "hint pandas: check isnull function in pandas\n",
    "\n",
    "hint dictionary: implement a for and check if the variable is None\n",
    "\n",
    "- Does ProPublica mentions how to clean the data?  \n",
    "\n",
    "__What is the effect of the following function?__\n",
    "\n",
    "df = pd.read_csv(COMPAS_INPUT_FILE)\n",
    "\n",
    "print(df.shape)\n",
    "\n",
    "df = df.dropna(subset=[\"days_b_screening_arrest\"]) # dropping missing vals\n",
    "\n",
    "df = df[\n",
    "    (df.days_b_screening_arrest <= 30) &  \n",
    "    (df.days_b_screening_arrest >= -30) &  \n",
    "    (df.is_recid != -1) &\n",
    "    (df.c_charge_degree != 'O') &\n",
    "    (df.score_text != 'N/A')\n",
    "]\n",
    "\n",
    "df.reset_index(inplace=True, drop=True) # renumber the rows from 0 again"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic analysis of demographics\n",
    "\n",
    "- What are the different races present in the dataset? \n",
    "\n",
    "- What is the number of people by age category?\n",
    "\n",
    "- What is the number of people by race?\n",
    "\n",
    "- What is the number of people by COMPAS score (decile_score)?\n",
    "\n",
    "- What is the number of people by COMPAS risk category (score_text)?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic investigations of gender and race bias in COMPAS scores\n",
    "\n",
    "decile_score -- is the score given by the COMPAS algorithm that estimates the risk to re-offend.\n",
    "\n",
    "score_text -- is the level of risk: Low, Medium, High\n",
    "\n",
    "two_years_recid -- is the ground truth data on whether the offender recidivated or not\n",
    "\n",
    "- What is the mean COMPAS score (decile_score) per race and gender? \n",
    "\n",
    "- What is the distribution (histogram) of decile_score per race and gender? \n",
    "\n",
    "The two_year_recid field records whether or not each person was re-arrested for a violent offense within two years, which is what COMPAS is trying to predict.\n",
    "\n",
    "- How many people were re-arrested? \n",
    "\n",
    "- Compute the recidivism (i.e., people that got re-arrested) rates by race and gender\n",
    "\n",
    "- What is the accuracy of the COMPAS scores to predict recidivism\n",
    "\n",
    "- Is the accuracy higher/lower if we look at particular races/genders?\n",
    "\n",
    "- What about false positives and false negatives?\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}