Linear regression is often tedious in python compared to that in R. Using the sci-kit learn module makes it much more easier though.
Let’s start off by importing the required modules.
import numpy as np import pandas as pd import matplotlib.pyplot as plt import sklearn %matplotlib inline
Loading the data
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0) #Looking at the first six rows. data.head() X TV Radio Newspaper Sales 1 1 230.1 37.8 69.2 22.1 2 2 44.5 39.3 45.1 10.4 3 3 17.2 45.9 69.3 9.3 4 4 151.5 41.3 58.5 18.5 5 5 180.8 10.8 58.4 12.9 6 6 8.7 48.9 75.0 7.2
The 4 variables can be differentiated into-
- Independent Variables-TV,Radio,Newspaper
- Dependent Variables-Sales
print (data.shape) #Like the str function in R (200, 4)
There are 200 rows and 4 columns
Plotting the data-
fig, axs = plt.subplots(1, 3, sharey=True) data.plot(kind='scatter', x='TV', y='Sales', ax=axs, figsize=(20, 10)) data.plot(kind='scatter', x='Radio', y='Sales', ax=axs) data.plot(kind='scatter', x='Newspaper', y='Sales', ax=axs)
From the graphs above, it can be seen that TV and sales have a liner correaltion between them. Lets now fit the regression model using the sklearn package. It’s very simple. I will use the least squares method as the way to estimate the coefficients.
Y = Sales(also called “target” data in Python)
X = All other variables(or independent variables)
First, I am going to import linear regression from sci-kit learn module. Then I am going to drop the price column as I want only the parameters as my X values. I am going to store linear regression object in a variable called lm.
from sklearn.linear_model import LinearRegression lm=LinearRegression()
Preparing the X and Y variables
X=data.drop(['Sales'],axis=1) print (X.head()) Y=data.Sales
Fitting the model
model=lm.fit(X,Y) #The intercept and coefficients are print (model.coef_) print (model.intercept_) X.columns
[ 0.04576465 0.18853002 -0.00103749] 2.93888936946 Out: Index(['TV', 'Radio', 'Newspaper'], dtype='object')
Getting the rsquare value for the model
model.score(X,Y) 0.89721063817895208 #Prediction can be done through model.predict #Let's predict the first 5 sales values model.predict(X)[0:5] array([ 20.52397441, 12.33785482, 12.30767078, 17.59782951, 13.18867186])
Calculating the root least mean squared value.
#As Seen from the results they are close to the original values #Lets calculate the root mean square error from sklearn.metrics import mean_squared_error from math import sqrt z=model.predict(X)[0:200] rms = sqrt(mean_squared_error(data.Sales, z)) print (rms) 1.6685701407225697
The RMS value seems to be low at 1.66.