Linear Regression in Python using sci-kit learn.

If you like this, Share!

Linear regression is often tedious in python compared to that in R. Using the sci-kit learn module makes it much more easier though.

Let’s start off by importing the required modules.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
%matplotlib inline

Loading the data

data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
#Looking at the first six rows. 
data.head()
  X    TV Radio Newspaper Sales
1 1 230.1  37.8      69.2  22.1
2 2  44.5  39.3      45.1  10.4
3 3  17.2  45.9      69.3   9.3
4 4 151.5  41.3      58.5  18.5
5 5 180.8  10.8      58.4  12.9
6 6   8.7  48.9      75.0   7.2

The 4 variables can be differentiated into-

  • Independent Variables-TV,Radio,Newspaper
  • Dependent Variables-Sales
print (data.shape) #Like the str function in R

(200, 4)

There are 200 rows and 4 columns

Plotting the data-

fig, axs = plt.subplots(1, 3, sharey=True)
data.plot(kind='scatter', x='TV', y='Sales', ax=axs[0], figsize=(20, 10))
data.plot(kind='scatter', x='Radio', y='Sales', ax=axs[1])
data.plot(kind='scatter', x='Newspaper', y='Sales', ax=axs[2])

From the graphs above, it can be seen that TV and sales have a liner correaltion between them. Lets now fit the regression model using the sklearn package. It’s very simple. I will use the least squares method as the way to estimate the coefficients.

Y = Sales(also called “target” data in Python)

and

X = All other variables(or independent variables)

First, I am going to import linear regression from sci-kit learn module. Then I am going to drop the price column as I want only the parameters as my X values. I am going to store linear regression object in a variable called lm.

from sklearn.linear_model import LinearRegression
lm=LinearRegression()

Preparing the X and Y variables

X=data.drop(['Sales'],axis=1)
print (X.head())
Y=data.Sales

Fitting the model

model=lm.fit(X,Y)
#The intercept and coefficients are
print (model.coef_)
print (model.intercept_)
X.columns

Output-

[ 0.04576465  0.18853002 -0.00103749]
2.93888936946
Out[32]:
Index(['TV', 'Radio', 'Newspaper'], dtype='object')

Getting the rsquare value for the model

model.score(X,Y)
0.89721063817895208

#Prediction can be done through model.predict
#Let's predict the first 5 sales values
model.predict(X)[0:5]
array([ 20.52397441,  12.33785482,  12.30767078,  17.59782951,  13.18867186])

Calculating the root least mean squared value.

#As Seen from the results they are close to the original values
#Lets calculate the root mean square error
from sklearn.metrics import mean_squared_error
from math import sqrt
z=model.predict(X)[0:200]
rms = sqrt(mean_squared_error(data.Sales, z))
print (rms)
1.6685701407225697

The RMS value seems to be low at 1.66.

Leave a Reply

  Subscribe  
Notify of