*Linear regression is often tedious in python compared to that in R. Using the sci-kit learn module makes it much more easier though.*

*Let’s start off by importing the required modules.*

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
%matplotlib inline
```

*Loading the data*

```
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
#Looking at the first six rows.
data.head()
X TV Radio Newspaper Sales
1 1 230.1 37.8 69.2 22.1
2 2 44.5 39.3 45.1 10.4
3 3 17.2 45.9 69.3 9.3
4 4 151.5 41.3 58.5 18.5
5 5 180.8 10.8 58.4 12.9
6 6 8.7 48.9 75.0 7.2
```

*The 4 variables can be differentiated into-*

*Independent Variables-TV,Radio,Newspaper**Dependent Variables-Sales*

```
print (data.shape) #Like the str function in R
(200, 4)
```

*There are 200 rows and 4 columns*

*Plotting the data-*

```
fig, axs = plt.subplots(1, 3, sharey=True)
data.plot(kind='scatter', x='TV', y='Sales', ax=axs[0], figsize=(20, 10))
data.plot(kind='scatter', x='Radio', y='Sales', ax=axs[1])
data.plot(kind='scatter', x='Newspaper', y='Sales', ax=axs[2])
```

*From the graphs above, it can be seen that TV and sales have a liner correaltion between them. Lets now fit the regression model using the sklearn package. It’s very simple. I will use the least squares method as the way to estimate the coefficients.*

*Y = Sales(also called “target” data in Python)*

*and*

*X = All other variables(or independent variables)*

*First, I am going to import linear regression from sci-kit learn module. Then I am going to drop the price column as I want only the parameters as my X values. I am going to store linear regression object in a variable called lm.*

```
from sklearn.linear_model import LinearRegression
lm=LinearRegression()
```

*Preparing the X and Y variables*

```
X=data.drop(['Sales'],axis=1)
print (X.head())
Y=data.Sales
```

*Fitting the model*

```
model=lm.fit(X,Y)
#The intercept and coefficients are
print (model.coef_)
print (model.intercept_)
X.columns
```

*Output-*

```
[ 0.04576465 0.18853002 -0.00103749]
2.93888936946
Out[32]:
Index(['TV', 'Radio', 'Newspaper'], dtype='object')
```

*Getting the rsquare value for the model*

```
model.score(X,Y)
0.89721063817895208
#Prediction can be done through model.predict
#Let's predict the first 5 sales values
model.predict(X)[0:5]
array([ 20.52397441, 12.33785482, 12.30767078, 17.59782951, 13.18867186])
```

*Calculating the root least mean squared value.*

```
#As Seen from the results they are close to the original values
#Lets calculate the root mean square error
from sklearn.metrics import mean_squared_error
from math import sqrt
z=model.predict(X)[0:200]
rms = sqrt(mean_squared_error(data.Sales, z))
print (rms)
1.6685701407225697
```

*The RMS value seems to be low at 1.66.*

## Leave a Reply

Be the First to Comment!