Scikit-learn Crash Course

Mar 22

One of the most convenient and valuable tools in a data scientist’s arsenal is the python library Scikit-learn. Built on NumPy, SciPy, and matplotlib, Scikit-learn is an optimized and standardized collection of many common machine learning algorithms. The workflows for fitting each of its algorithms to a set of data are very similar and with some practice become second nature, but that doesn’t mean there isn’t a steep learning curve.

As with most topics in the field, getting started is the hardest part. Most tutorials and walkthroughs I have found are plagued with what I affectionately call the, “College Professor Teaching Physics 101,” effect. That is, the instructor is so beyond the simple understanding of the basics that they have a difficult time explaining the bare bones fundamentals that all their knowledge is built upon. It is so easy its basically second nature for them, why is it not second nature for you too?

The truth is that eventually physics 101 (or Scikit-learn) will likely be second nature for you too, eventually. Initially though, they both can be intimidating.

My goal for this blog is to help any reader get over the initial learning curve of scikit-learn with a walkthrough of a simple linear model. With the bare bones basics laid out in simple terms, it becomes much easier to piece together more complicated models.

The Coding Environment

Before starting, be sure you have all necessary libraries installed. You could pick and choose what packages to download and install using pip or Miniconda, but honestly I feel like its usually worth it to install all of Anaconda. Their website is super helpful, and in the future if you’d like to slim your coding environment down, installing packages is not difficult.

Start a Jupyter Notebook

Once you have all of your packages installed, the easiest way to work through a Scikit-learn model is on a Jupyter Notebook. If you’ve never used a Jupyter Notebook, I’ll take a second to give a crash course in starting one. If you have used notebooks before, feel free to open a notebook and jump to the next section.

For everyone else, we will skip the frills and make this as easy as possible. Here is a step-by-step to making a directory and your first notebook.

Open a terminal window. On a Mac you can find this in your apps folder, or just command-space and type terminal. On a windows machine, I’ve heard Git Bash is comparable.

Navigate to the folder you want to work in.

You can browse around your computer in the terminal using commands <ls> to see what is in your current directory and <cd <folder name>> to enter a folder in your current directory. <cd ..> will take you up a directory. If you want to be fancy and create a new folder in your documents folder you might have a workflow like this:

cd Documents
mkdir sklearn_practice
cd sklearn_practice

Once in the desired directory, enter command jupyter notebook and create a new notebook using the home page menu.

Gather the Data

To help out beginners, I went ahead and created a very simple 1001 row dataset with two independent variables [X1, X2] and one dependent variable [y]. The csv file can be downloaded here, found on the GitHub repository I made for this blog, or created/modified from scratch using the code in my data_creation notebook.

To make things a little too easy, The dataset is clean, normalized, and ready to use.

If you are just downloading the csv file I made, download it into the directory that we are working in.

Now lets get down to business and fit a model to our data. It may be easiest to have this blog and my linear_model notebook open for reference while creating your own linear regression model.

Explore the Data

Our first step is to import our data into our notebook for analysis and model fitting. Using the Pandas library is very convenient for this. Pandas also gives us tools to examine our data and, if need be, clean, normalize, and transform our data so that models will perform better. In my sample notebook, I included some light exploratory data analysis just to show that all variables are normally distributed, independent variables are not significantly correlated, and there are no missing values.

Split the Data

Once we have explored our data a bit, It is useful to randomly split our data into a training set and a testing set. For supervised learning (machine learning where we know the labels in at least the training data) having a testing dataset is useful to ensure our model isn’t overfit. If the model performs exceptionally with training data but poorly with testing data, you are likely overfit.

Scikit-learn has a very convenient function called train_test_split that can be used to split your independent and dependent variables. It is important to note that if you plan to do any data transformation or normalization, it should be done AFTER splitting to avoid any data leaking. You can drop values and columns and change column datatypes before the split but if values of any column are going to change, that process should be done after the data is split.

To use test_train_split, first separate the dataset into dependent and independent values. Leaving all function parameters as default, the code looks like this:

from sklearn.model_selection import train_test_split
X = df.drop(columns=['y'])
y = df['y']
X_train, X_test, y_train, y_test = train_test_split(X, y)

With data properly split, we can finally create a model.

Easiest Scikit-learn Model you may Ever Create

Here is an important detail that needs to be burned into a new data scientist’s memory: when importing models from Scikit-Learn, you are importing an object class, not a function.

This is critical because it means a model must be INSTANTIATED. In other words, you have imported the blueprint for a linear regression model, and must create an object using that blueprint.

Once the object has been created, it can be fitted to your training data using the built in .fit() method. Once fit to data, the model can be used to predict dependent variable values given independent variable values.

from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_train_predict = linreg.predict(X_train)
y_test_predict = linreg.predict(X_test)

Coefficients and Evaluations

Before evaluating the model’s performance it is valuable to look at the coefficients and intercept that our model predicted. When the model is fit to training data, the model calculates .coef_ and .intercept_ values. The coefficients print as an array where each value corresponds to the training data’s independent variable columns in column order. The intercept represents a constant y-intercept.

A model can be evaluated in almost countless ways. For now we will look at one. The Scikit-learn linear regression model comes with a built in .score() method that returns the coefficient of determination, which is a measure of how well a model captures the variance of data. In other words, it describes how well a model fits the data, with 1.0 being a perfect fit.

It is important to check both training score and testing score. You can usually expect the training score to be better than the testing score, but if the difference is very large you likely are dealing with overfitting.

That may be a topic for a different day.

Conclusions

I am sincerely hoping that this blog can shed some light on the cloudy learning path that is scikit-learn. The model we worked with is very simple and we admittedly are barely scratching the surface of the machine learning library, but even the smallest foothold on the workflow can be the catalyst to becoming an expert.

If you have any questions please feel free to reach out and ask. Learning is easiest when it is done together.

Carlos Garza