{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "The goal of this practical session is to learn how to use different classification and regression models from sklearn. It is based on the book \"Python Data Science Handbook\" by Jake VanderPlas. You are encouraged to play with the code provided.\n", "\n", "\n", "\n", "# Using sklearn for building ML models\n", "\n", "\n", "The Scikit-Learn API is designed with the following guiding principles in mind, as outlined in the Scikit-Learn API paper:\n", "\n", "* Consistency: All objects share a common interface drawn from a limited set of methods, with consistent documentation.\n", "\n", "* Inspection: All specified parameter values are exposed as public attributes.\n", "\n", "* Limited object hierarchy: Only algorithms are represented by Python classes; datasets are represented in standard formats (NumPy arrays, Pandas DataFrames, SciPy sparse matrices) and parameter names use standard Python strings.\n", "\n", "* Composition: Many machine learning tasks can be expressed as sequences of more fundamental algorithms, and Scikit-Learn makes use of this wherever possible.\n", "\n", "* Sensible defaults: When models require user-specified parameters, the library defines an appropriate default value.\n", "\n", "In practice, these principles make Scikit-Learn very easy to use, once the basic principles are understood. Every machine learning algorithm in Scikit-Learn is implemented via the Estimator API, which provides a consistent interface for a wide range of machine learning applications.\n", "\n", "## Basics of the API\n", "\n", "Most commonly, the steps in using the Scikit-Learn estimator API are as follows (we will step through a handful of detailed examples in the sections that follow).\n", "\n", "* Choose a class of model by importing the appropriate estimator class from Scikit-Learn.\n", "* Choose model hyperparameters by instantiating this class with desired values.\n", "* Arrange data into a features matrix and target vector following the discussion above.\n", "* Fit the model to your data by calling the fit() method of the model instance.\n", "* Apply the Model to new data:\n", " * For supervised learning, often we predict labels for unknown data using the predict() method.\n", " * For unsupervised learning, we often transform or infer properties of the data using the transform() or predict() method.\n", "\n", "We will now step through several simple examples of applying supervised and unsupervised learning methods.\n", "\n", "## Supervised learning example: Simple linear regression\n", "\n", "As an example of this process, let's consider a simple linear regression—that is, the common case of fitting a line to (𝑥,𝑦) data. We will use the following simple data for our regression example:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "x: [3.74540119 9.50714306 7.31993942 5.98658484 1.5601864 1.5599452\n", " 0.58083612 8.66176146 6.01115012 7.08072578 0.20584494 9.69909852\n", " 8.32442641 2.12339111 1.81824967 1.8340451 3.04242243 5.24756432\n", " 4.31945019 2.9122914 6.11852895 1.39493861 2.92144649 3.66361843\n", " 4.56069984 7.85175961 1.99673782 5.14234438 5.92414569 0.46450413\n", " 6.07544852 1.70524124 0.65051593 9.48885537 9.65632033 8.08397348\n", " 3.04613769 0.97672114 6.84233027 4.40152494 1.22038235 4.9517691\n", " 0.34388521 9.09320402 2.58779982 6.62522284 3.11711076 5.20068021\n", " 5.46710279 1.84854456]\n", "y: [ 7.22926896 18.18565441 13.52423055 10.67206599 0.64185082 1.4000462\n", " -0.29896653 17.38064514 11.36591852 11.3984114 -0.26422614 18.01311476\n", " 14.97193082 3.8584585 3.66749887 3.59937032 4.24562734 9.18591626\n", " 7.9701638 5.80012793 10.75788366 1.60421824 3.736558 5.13103024\n", " 8.93392551 16.05975926 2.92146552 10.28822167 11.2099274 -0.7161115\n", " 11.51229264 3.94851904 0.26520582 19.5423544 15.69289556 15.98984947\n", " 5.17932245 0.65443493 12.77642131 5.81548096 1.22109281 9.26065077\n", " 1.16566447 16.66813782 3.36710603 11.74868864 6.14962364 9.73011153\n", " 9.40444538 3.21035654]\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "rng = np.random.RandomState(42)\n", "x = 10 * rng.rand(50)\n", "y = 2 * x - 1 + rng.randn(50)\n", "plt.scatter(x, y)\n", "print(\"x:\",x)\n", "print(\"y:\", y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's the type of x? Of y? Print their value.\n", "\n", "x and y are one-dimensional arrays with 50 elements." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now walk through the process of building an ML model\n", "\n", "1. Choose a class of model\n", "\n", "In Scikit-Learn, every class of model is represented by a Python class. So, for example, if we would like to compute a simple linear regression model, we can import the linear regression class:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Choose model hyperparameters\n", "\n", "An important point is that a class of model is not the same as an instance of a model.\n", "\n", "Once we have decided on our model class, there are still some options open to us. Depending on the model class we are working with, we might need to answer one or more questions like the following:\n", "\n", "Would we like to fit for the offset (i.e., y-intercept)?\n", "Would we like the model to be normalized?\n", "Would we like to preprocess our features to add model flexibility?\n", "What degree of regularization would we like to use in our model?\n", "How many model components would we like to use?\n", "These are examples of the important choices that must be made once the model class is selected. These choices are often represented as hyperparameters, or parameters that must be set before the model is fit to data. In Scikit-Learn, hyperparameters are chosen by passing values at model instantiation. We will explore how you can quantitatively motivate the choice of hyperparameters later.\n", "\n", "For our linear regression example, we can instantiate the LinearRegression class and specify that we would like to fit the intercept using the fit_intercept hyperparameter:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = LinearRegression(fit_intercept=True)\n", "model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Keep in mind that when the model is instantiated, the only action is the storing of these hyperparameter values. In particular, we have not yet applied the model to any data: the Scikit-Learn API makes very clear the distinction between choice of model and application of model to data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Arrange data into a features matrix and target vector\n", "\n", "Previously we detailed the Scikit-Learn data representation, which requires a two-dimensional features matrix and a one-dimensional target array. Here our target variable y is already in the correct form (a length-n_samples array), but we need to massage the data x to make it a matrix of size [n_samples, n_features]. In this case, this amounts to a simple reshaping of the one-dimensional array:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(50, 1)" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X = x[:, np.newaxis]\n", "X.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. Fit the model to your data\n", "\n", "Now it is time to apply our model to data. This can be done with the fit() method of the model:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This fit() command causes a number of model-dependent internal computations to take place, and the results of these computations are stored in model-specific attributes that the user can explore. In Scikit-Learn, by convention all model parameters that were learned during the fit() process have trailing underscores; for example in this linear model, we have the following:" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-0.9033107255311146" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.coef_\n", "model.intercept_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These two parameters represent the slope and intercept of the simple linear fit to the data. Comparing to the data definition, we see that they are very close to the input slope of 2 and intercept of -1.\n", "\n", "5. Predict labels for unknown data\n", "\n", "Once the model is trained, the main task of supervised machine learning is to evaluate it based on what it says about new data that was not part of the training set. In Scikit-Learn, this can be done using the predict() method. For the sake of this example, our \"new data\" will be a grid of x values, and we will ask what y values the model predicts:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "xfit: [-1. -0.75510204 -0.51020408 -0.26530612 -0.02040816 0.2244898\n", " 0.46938776 0.71428571 0.95918367 1.20408163 1.44897959 1.69387755\n", " 1.93877551 2.18367347 2.42857143 2.67346939 2.91836735 3.16326531\n", " 3.40816327 3.65306122 3.89795918 4.14285714 4.3877551 4.63265306\n", " 4.87755102 5.12244898 5.36734694 5.6122449 5.85714286 6.10204082\n", " 6.34693878 6.59183673 6.83673469 7.08163265 7.32653061 7.57142857\n", " 7.81632653 8.06122449 8.30612245 8.55102041 8.79591837 9.04081633\n", " 9.28571429 9.53061224 9.7755102 10.02040816 10.26530612 10.51020408\n", " 10.75510204 11. ]\n" ] } ], "source": [ "xfit = np.linspace(-1, 11)\n", "print(\"xfit: \",xfit)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's the effect of linspace?\n", "\n", "linspace creates an array with 50 (default value) elements the values of which range in between -1 and 11." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As before, we need to coerce these x values into a [n_samples, n_features] features matrix, after which we can feed it to the model:" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Xfit: [[-1. ]\n", " [-0.75510204]\n", " [-0.51020408]\n", " [-0.26530612]\n", " [-0.02040816]\n", " [ 0.2244898 ]\n", " [ 0.46938776]\n", " [ 0.71428571]\n", " [ 0.95918367]\n", " [ 1.20408163]\n", " [ 1.44897959]\n", " [ 1.69387755]\n", " [ 1.93877551]\n", " [ 2.18367347]\n", " [ 2.42857143]\n", " [ 2.67346939]\n", " [ 2.91836735]\n", " [ 3.16326531]\n", " [ 3.40816327]\n", " [ 3.65306122]\n", " [ 3.89795918]\n", " [ 4.14285714]\n", " [ 4.3877551 ]\n", " [ 4.63265306]\n", " [ 4.87755102]\n", " [ 5.12244898]\n", " [ 5.36734694]\n", " [ 5.6122449 ]\n", " [ 5.85714286]\n", " [ 6.10204082]\n", " [ 6.34693878]\n", " [ 6.59183673]\n", " [ 6.83673469]\n", " [ 7.08163265]\n", " [ 7.32653061]\n", " [ 7.57142857]\n", " [ 7.81632653]\n", " [ 8.06122449]\n", " [ 8.30612245]\n", " [ 8.55102041]\n", " [ 8.79591837]\n", " [ 9.04081633]\n", " [ 9.28571429]\n", " [ 9.53061224]\n", " [ 9.7755102 ]\n", " [10.02040816]\n", " [10.26530612]\n", " [10.51020408]\n", " [10.75510204]\n", " [11. ]]\n", "yfit: [-2.88096733 -2.39664326 -1.9123192 -1.42799513 -0.94367106 -0.459347\n", " 0.02497707 0.50930113 0.9936252 1.47794926 1.96227333 2.44659739\n", " 2.93092146 3.41524552 3.89956959 4.38389366 4.86821772 5.35254179\n", " 5.83686585 6.32118992 6.80551398 7.28983805 7.77416211 8.25848618\n", " 8.74281024 9.22713431 9.71145837 10.19578244 10.68010651 11.16443057\n", " 11.64875464 12.1330787 12.61740277 13.10172683 13.5860509 14.07037496\n", " 14.55469903 15.03902309 15.52334716 16.00767122 16.49199529 16.97631936\n", " 17.46064342 17.94496749 18.42929155 18.91361562 19.39793968 19.88226375\n", " 20.36658781 20.85091188]\n" ] } ], "source": [ "Xfit = xfit[:, np.newaxis]\n", "yfit = model.predict(Xfit)\n", "print(\"Xfit: \",Xfit)\n", "print(\"yfit: \",yfit)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's the effect of xfit[:, np.newaxis]? Of model.predict(Xfit)? What's the ype of Xfit ? Of yfit ?\n", "\n", "xfit[:, np.newaxis] creates a matrix (two-dimensional array) from the one-dimensional array xfit. model.predict(Xfit) provides the values predicted by the model learned on the data contained in Xfit, which is a two-dimensional array. The predicted values are stored in yfit, which is a one-dimensional array." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, let's visualize the results by plotting first the raw data, and then this model fit:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.scatter(x, y)\n", "plt.plot(xfit, yfit);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Supervised classification: Naive bayes on Iris dataset\n", "\n", "Let's take a look at another example of this process, using the Iris dataset. Our question will be this: given a model trained on a portion of the Iris data, how well can we predict the remaining labels?\n", "\n", "For this task, we will use an extremely simple generative model known as Gaussian naive Bayes, which proceeds by assuming each class is drawn from an axis-aligned Gaussian distribution. Because it is so fast and has no hyperparameters to choose, Gaussian naive Bayes is often a good model to use as a baseline classification, before exploring whether improvements can be found through more sophisticated models.\n", "\n", "We would like to evaluate the model on data it has not seen before, and so we will split the data into a training set and a testing set. This could be done by hand, but it is more convenient to use the train_test_split utility function:" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_iris\n", "dataset = load_iris()\n", "X_iris = dataset.data\n", "y_iris = dataset.target" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris,\n", " random_state=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's the use of random_state?\n", "\n", "random_state is the seed used by the random number generator." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the data arranged, we can follow our recipe to predict the labels.\n", "\n", "Based on the folowing instructions, learn a Naive Bayes model (see GaussianNB) on the Iris dataset and store the predictions made on Xtest in a vector called y_model:\n", "\n", "from sklearn.naive_bayes import GaussianNB # 1. choose model class\n", "\n", "model = GaussianNB() # 2. instantiate model\n", "\n", "model.fit(Xtrain, ytrain) # 3. fit model to data\n", "\n", "y_model = model.predict(Xtest) # 4. predict on new data" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "y_model: [0 1 1 0 2 1 2 0 0 2 1 0 2 1 1 0 1 1 0 0 1 1 2 0 2 1 0 0 1 2 1 2 1 2 2 0 1\n", " 0]\n", "ytest: [0 1 1 0 2 1 2 0 0 2 1 0 2 1 1 0 1 1 0 0 1 1 1 0 2 1 0 0 1 2 1 2 1 2 2 0 1\n", " 0]\n" ] } ], "source": [ "from sklearn.datasets import load_iris\n", "from sklearn.naive_bayes import GaussianNB\n", "from sklearn.model_selection import train_test_split\n", "\n", "#load data\n", "dataset = load_iris()\n", "X_iris = dataset.data\n", "y_iris = dataset.target\n", "\n", "#create train-test datasets\n", "Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris,\n", " random_state=1)\n", "#learn model\n", "model = GaussianNB() \n", "model.fit(Xtrain, ytrain)\n", "\n", "#compute the values predicted by the learnt model on Xtest\n", "y_model = model.predict(Xtest)\n", "\n", "#simple print to compare the results (predicted in y_model, and true in ytest)\n", "print(\"y_model:\",y_model)\n", "print(\"ytest: \",ytest)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can use the accuracy_score utility to see the fraction of predicted labels that match their true value:" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[13, 0, 0],\n", " [ 0, 15, 1],\n", " [ 0, 0, 9]])" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.metrics import accuracy_score\n", "accuracy_score(ytest, y_model)\n", "\n", "#compute the confusion matrix\n", "from sklearn.metrics import confusion_matrix\n", "confusion_matrix(ytest, y_model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is accuracy? Give its formula and explain it.\n", "\n", "Accuracy measures the proportion of instances taht are well classified by the learned classifier. It is defined as the ratio of correctly classified instances (#CCI) among the all instances (#I): Accuracy = #CCI/#I.\n", "\n", "What's the confusion matrix of the Naive Bayes model on the Iris dataset? Write a code to compute and visualize this matrix.\n", "\n", "The confusion matrix C is KxK matrix (where K is the number of classes) in which rows correspond to the predicted classes and the columns to the actual classes. Cij indicates the number of instances that have been predicted in class i but actually belong to class j.\n", "\n", "Redo the above steps with a k-NN model.\n", "\n", "This one's for you!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Using sklearn to validate ML models and hyperparameters \n", "\n", "In the previous section, we saw the basic recipe for applying a supervised machine learning model:\n", "\n", "1. Choose a class of model\n", "2. Choose model hyperparameters\n", "3. Fit the model to the training data\n", "4. Use the model to predict labels for new data\n", "\n", "The first two pieces of this—the choice of model and choice of hyperparameters—are perhaps the most important part of using these tools and techniques effectively. In order to make an informed choice, we need a way to validate that our model and our hyperparameters are a good fit to the data. While this may sound simple, there are some pitfalls that you must avoid to do this effectively.\n", "\n", "## Thinking about Model Validation\n", "\n", "In principle, model validation is very simple: after choosing a model and its hyperparameters, we can estimate how effective it is by applying it to some of the training data and comparing the prediction to the known value.\n", "\n", "The following sections first show a naive approach to model validation and why it fails, before exploring the use of holdout sets and cross-validation for more robust model evaluation.\n", "\n", "### Model validation the wrong way\n", "\n", "Let's demonstrate the naive approach to validation using the Iris data, which we saw in the previous section. We will start by loading the data:\n", "\n", "Load the Iris dataset storing the features in an array X and the targets in a vector y." ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", " 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n", " 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n", " 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n", " 2 2]\n" ] } ], "source": [ "from sklearn.datasets import load_iris\n", "\n", "#load data\n", "dataset = load_iris()\n", "X = dataset.data\n", "y = dataset.target\n", "print(y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we choose a model and hyperparameters. Here we'll use a k-neighbors classifier with n_neighbors=1. This is a very simple and intuitive model that says \"the label of an unknown point is the same as the label of its closest training point:\"" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "model = KNeighborsClassifier(n_neighbors=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we train the model, and use it to predict labels for data we already know:" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.fit(X, y)\n", "y_model = model.predict(X)\n", "\n", "#compute accuray of the classifier\n", "from sklearn.metrics import accuracy_score\n", "accuracy_score(y, y_model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we compute the fraction of correctly labeled points:\n", "\n", "Compute the accuracy of y_model. What do you observe? Why?\n", "\n", "The classifier is perfect. This is not surprising as the the model select the closest instance (1-nearest neighbor) to predict the target. In this case, the closes neighbor is the instance itself as the model is predicted on the training set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model validation the right way: Holdout sets\n", "\n", "So what can be done? A better sense of a model's performance can be found using what's known as a holdout set: that is, we hold back some subset of the data from the training of the model, and then use this holdout set to check the model performance. This splitting can be done using the train_test_split utility in Scikit-Learn:" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "# split the data with 50% in each set\n", "X1, X2, y1, y2 = train_test_split(X, y, random_state=0,\n", " train_size=0.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fit the model on the first subset (X1, y1).\n", "\n", "Predict the classes on the second subset (X2, y2) and compute the accuracy. What do you observe? What's the effect of random_state?\n", "\n", "In this case, the accuracy is no longer maximum. Instances in the training and test set are no longer equal. As before, random_state is the seed used by the random number generator." ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9066666666666666" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.fit(X1, y1)\n", "y2_model = model.predict(X2)\n", "\n", "#compute accuray of the classifier\n", "from sklearn.metrics import accuracy_score\n", "accuracy_score(y2, y2_model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Model validation via cross-validation\n", "One disadvantage of using a holdout set for model validation is that we have lost a portion of our data to the model training. In the above case, half the dataset does not contribute to the training of the model! This is not optimal, and can cause problems – especially if the initial set of training data is small.\n", "\n", "One way to address this is to use cross-validation; that is, to do a sequence of fits where each subset of the data is used both as a training set and as a validation set. For instance we can two validation trials, alternately using each half of the data as a holdout set. Using the split data from before, we could implement it like this:" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.96, 0.9066666666666666)" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y2_model = model.fit(X1, y1).predict(X2)\n", "y1_model = model.fit(X2, y2).predict(X1)\n", "accuracy_score(y1, y1_model), accuracy_score(y2, y2_model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What comes out are two accuracy scores, which we could combine (by, say, taking the mean) to get a better measure of the global model performance. This particular form of cross-validation is a two-fold cross-validation—that is, one in which we have split the data into two sets and used each in turn as a validation set.\n", "\n", "We could expand on this idea to use even more trials, and more folds in the data—for example we can split the data into five groups, and use each of them in turn to evaluate the model fit on the other 4/5 of the data. This would be rather tedious to do by hand, and so we can use Scikit-Learn's cross_val_score convenience routine to do it succinctly:" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.96666667, 0.96666667, 0.93333333, 0.93333333, 1. ])" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.model_selection import cross_val_score\n", "cross_val_score(model, X, y, cv=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Repeating the validation across different subsets of the data gives us an even better idea of the performance of the algorithm.\n", "\n", "Scikit-Learn implements a number of useful cross-validation schemes that are useful in particular situations; these are implemented via iterators in the cross_validation module. For example, we might wish to go to the extreme case in which our number of folds is equal to the number of data points: that is, we train on all points but one in each trial. This type of cross-validation is known as leave-one-out cross validation, and can be used as follows:" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "scores: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.\n", " 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.\n", " 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1.\n", " 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.\n", " 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.\n", " 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.\n", " 1. 1. 1. 1. 1. 1.]\n" ] }, { "data": { "text/plain": [ "0.96" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.model_selection import LeaveOneOut\n", "scores = cross_val_score(model, X, y, cv=LeaveOneOut())\n", "print(\"scores: \", scores)\n", "np.average(scores)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's the form of scores? Why? Compute the mean.\n", "\n", "scores is a one-dimensional array with 150 elements contained a 1 at index i if the ith instance is correctly predicted and a 0 otherwise." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selecting the Best Model\n", "\n", "Now that we've seen the basics of validation and cross-validation, we will go into a litte more depth regarding model selection and selection of hyperparameters. These issues are some of the most important aspects of the practice of machine learning.\n", "\n", "Of core importance is the following question: if our estimator is underperforming, how should we move forward? There are several possible answers:\n", "\n", "* Use a more complicated/more flexible model\n", "* Use a less complicated/less flexible model\n", "* Gather more training samples\n", "* Gather more data to add features to each sample\n", "\n", "The answer to this question is often counter-intuitive. In particular, sometimes using a more complicated model will give worse results, and adding more training samples may not improve your results! The ability to determine what steps will improve your model is what separates the successful machine learning practitioners from the unsuccessful.\n", "\n", "### Validation curve\n", "\n", "Dans la suite, nous présentons une façon d'évaluer la meilleure complixité de modèle à choisir" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we imagine that we have some ability to tune the model complexity, we would expect the training score and validation score to behave as illustrated in [this figure](https://jakevdp.github.io/PythonDataScienceHandbook/figures/05.03-validation-curve.png)\n", "\n", "The diagram shown here is often called a *validation curve*, and we see the following essential features:\n", "\n", "- The training score is everywhere higher than the validation score. This is generally the case: the model will be a better fit to data it has seen than to data it has not seen.\n", "- For very low model complexity (a high-bias model), the training data is under-fit, which means that the model is a poor predictor both for the training data and for any previously unseen data.\n", "- For very high model complexity (a high-variance model), the training data is over-fit, which means that the model predicts the training data very well, but fails for any previously unseen data.\n", "- For some intermediate value, the validation curve has a maximum. This level of complexity indicates a suitable trade-off between bias and variance.\n", "\n", "The means of tuning the model complexity varies from model to model; when we discuss individual models in depth in later sections, we will see how each model allows for such tuning." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Validation curves in Scikit-Learn\n", "\n", "Let's look at an example of using cross-validation to compute the validation curve for a class of models.\n", "Here we will use a *polynomial regression* model: this is a generalized linear model in which the degree of the polynomial is a tunable parameter.\n", "For example, a degree-1 polynomial fits a straight line to the data; for model parameters $a$ and $b$:\n", "\n", "$$\n", "y = ax + b\n", "$$\n", "\n", "A degree-3 polynomial fits a cubic curve to the data; for model parameters $a, b, c, d$:\n", "\n", "$$\n", "y = ax^3 + bx^2 + cx + d\n", "$$\n", "\n", "We can generalize this to any number of polynomial features.\n", "In Scikit-Learn, we can implement this with a simple linear regression combined with the polynomial preprocessor.\n", "We will use a *pipeline* to string these operations together:" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import PolynomialFeatures\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.pipeline import make_pipeline\n", "\n", "def PolynomialRegression(degree=2, **kwargs):\n", " return make_pipeline(PolynomialFeatures(degree),\n", " LinearRegression(**kwargs))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's the effect of PolynomialRegression? What's the effect of kwargs ?\n", "\n", "\n", "PolynomialRegression creates a pipeline with the linear regression combinezd with a polynomial preprocessor. **kwargs allows one to pass keyworded variable length of arguments to a function." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's create some data to which we will fit our model:" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "def make_data(N, err=1.0, rseed=1):\n", " # randomly sample the data\n", " rng = np.random.RandomState(rseed)\n", " X = rng.rand(N, 1) ** 2\n", " y = 10 - 1. / (X.ravel() + 0.1)\n", " if err > 0:\n", " y += err * rng.randn(N)\n", " return X, y\n", "\n", "X, y = make_data(40)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now visualize our data, along with polynomial fits of several degrees:" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "\n", "X_test = np.linspace(-0.1, 1.1, 500)[:, None]\n", "\n", "# plotting polynomials with increasing degrees\n", "plt.scatter(X.ravel(), y, color='black')\n", "axis = plt.axis()\n", "for degree in [1, 3, 5, 7]:\n", " y_test = PolynomialRegression(degree).fit(X, y).predict(X_test)\n", " plt.plot(X_test.ravel(), y_test, label='degree={0}'.format(degree))\n", "plt.xlim(-0.1, 1.0)\n", "plt.ylim(-2, 12)\n", "plt.legend(loc='best');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot polynomial models with increasing degrees. What do you observe?\n", "\n", "The fit to the data seems to increase with the degree of the polynomial. For the polynomial of degree 7, however, the sharp decrease towards x=1 may not be relevant. Indeed, if the model perfectly fits the training data, its ability to generalize to unseen data may be limited." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The knob controlling model complexity in this case is the degree of the polynomial, which can be any non-negative integer. A useful question to answer is this: what is a good degree of polynomial?\n", "\n", "We can make progress in this by visualizing the validation curve for this particular data and model; this can be done straightforwardly using the validation_curve convenience routine provided by Scikit-Learn. Given a model, data, parameter name, and a range to explore, this function will automatically compute both the training score and validation score across the range:" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0, 0.5, 'score')" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from sklearn.model_selection import validation_curve\n", "degree = np.arange(0, 21)\n", "train_score, val_score = validation_curve(PolynomialRegression(), X, y,\n", " 'polynomialfeatures__degree', degree, cv=7)\n", "\n", "plt.plot(degree, np.median(train_score, 1), color='blue', label='training score')\n", "plt.plot(degree, np.median(val_score, 1), color='red', label='validation score')\n", "plt.legend(loc='best')\n", "plt.ylim(0, 1)\n", "plt.xlabel('degree')\n", "plt.ylabel('score')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Does this correspond to the expected result?\n", "\n", "Yes! The performance on the training set keeps on increasing (even though slightly) while they decrease on the validaiton set (bias-variance tradeoff).\n", "\n", "What's the best polynomial to use? Plot the dataset with this polynomial.\n", "\n", "The best polynomial is a third-order polynomial. Its plot is given below.\n", "\n", "Redo this study by varying the size of the dataset. How do the results vary?\n", "\n", "This one's for you!" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(-0.05673314103942452,\n", " 0.994263633135634,\n", " -0.7459943120970807,\n", " 10.918045992764213)" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.scatter(X.ravel(), y, color='black')\n", "lim = plt.axis()\n", "y_test = PolynomialRegression(3).fit(X, y).predict(X_test)\n", "plt.plot(X_test.ravel(), y_test)\n", "plt.axis(lim)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 2 }