Linear Regression in Python using sci-kit learn.

Linear regression is often tedious in python compared to that in R. Using the sci-kit learn module makes it much more easier though.

Let’s start off by importing the required modules.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
%matplotlib inline

Loading the data

data = pd.read_csv('', index_col=0)
#Looking at the first six rows. 
  X    TV Radio Newspaper Sales
1 1 230.1  37.8      69.2  22.1
2 2  44.5  39.3      45.1  10.4
3 3  17.2  45.9      69.3   9.3
4 4 151.5  41.3      58.5  18.5
5 5 180.8  10.8      58.4  12.9
6 6   8.7  48.9      75.0   7.2

The 4 variables can be differentiated into-

  • Independent Variables-TV,Radio,Newspaper
  • Dependent Variables-Sales
print (data.shape) #Like the str function in R

(200, 4)

There are 200 rows and 4 columns

Plotting the data-

fig, axs = plt.subplots(1, 3, sharey=True)
data.plot(kind='scatter', x='TV', y='Sales', ax=axs[0], figsize=(20, 10))
data.plot(kind='scatter', x='Radio', y='Sales', ax=axs[1])
data.plot(kind='scatter', x='Newspaper', y='Sales', ax=axs[2])

From the graphs above, it can be seen that TV and sales have a liner correaltion between them. Lets now fit the regression model using the sklearn package. It’s very simple. I will use the least squares method as the way to estimate the coefficients.

Y = Sales(also called “target” data in Python)


X = All other variables(or independent variables)

First, I am going to import linear regression from sci-kit learn module. Then I am going to drop the price column as I want only the parameters as my X values. I am going to store linear regression object in a variable called lm.

from sklearn.linear_model import LinearRegression

Preparing the X and Y variables

print (X.head())

Fitting the model,Y)
#The intercept and coefficients are
print (model.coef_)
print (model.intercept_)


[ 0.04576465  0.18853002 -0.00103749]
Index(['TV', 'Radio', 'Newspaper'], dtype='object')

Getting the rsquare value for the model


#Prediction can be done through model.predict
#Let's predict the first 5 sales values
array([ 20.52397441,  12.33785482,  12.30767078,  17.59782951,  13.18867186])

Calculating the root least mean squared value.

#As Seen from the results they are close to the original values
#Lets calculate the root mean square error
from sklearn.metrics import mean_squared_error
from math import sqrt
rms = sqrt(mean_squared_error(data.Sales, z))
print (rms)

The RMS value seems to be low at 1.66.

Please follow and like us:

Leave a Reply

Notify of