# Linear Regression in Python using sci-kit learn.

Linear regression is often tedious in python compared to that in R. Using the sci-kit learn module makes it much more easier though.

Let’s start off by importing the required modules.

```import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
%matplotlib inline
```

```data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
#Looking at the first six rows.
1 1 230.1  37.8      69.2  22.1
2 2  44.5  39.3      45.1  10.4
3 3  17.2  45.9      69.3   9.3
4 4 151.5  41.3      58.5  18.5
5 5 180.8  10.8      58.4  12.9
6 6   8.7  48.9      75.0   7.2
```

The 4 variables can be differentiated into-

• Dependent Variables-Sales
```print (data.shape) #Like the str function in R

(200, 4)

```

There are 200 rows and 4 columns

Plotting the data-

```fig, axs = plt.subplots(1, 3, sharey=True)
data.plot(kind='scatter', x='TV', y='Sales', ax=axs[0], figsize=(20, 10))
data.plot(kind='scatter', x='Newspaper', y='Sales', ax=axs[2])
```

From the graphs above, it can be seen that TV and sales have a liner correaltion between them. Lets now fit the regression model using the sklearn package. It’s very simple. I will use the least squares method as the way to estimate the coefficients.

Y = Sales(also called “target” data in Python)

and

X = All other variables(or independent variables)

First, I am going to import linear regression from sci-kit learn module. Then I am going to drop the price column as I want only the parameters as my X values. I am going to store linear regression object in a variable called lm.

```from sklearn.linear_model import LinearRegression
lm=LinearRegression()
```

Preparing the X and Y variables

```X=data.drop(['Sales'],axis=1)
Y=data.Sales
```

Fitting the model

```model=lm.fit(X,Y)
#The intercept and coefficients are
print (model.coef_)
print (model.intercept_)
X.columns
```

Output-

```[ 0.04576465  0.18853002 -0.00103749]
2.93888936946
Out[32]:

```

Getting the rsquare value for the model

```model.score(X,Y)
0.89721063817895208

#Prediction can be done through model.predict
#Let's predict the first 5 sales values
model.predict(X)[0:5]
array([ 20.52397441,  12.33785482,  12.30767078,  17.59782951,  13.18867186])
```

Calculating the root least mean squared value.

```#As Seen from the results they are close to the original values
#Lets calculate the root mean square error
from sklearn.metrics import mean_squared_error
from math import sqrt
z=model.predict(X)[0:200]
rms = sqrt(mean_squared_error(data.Sales, z))
print (rms)
1.6685701407225697
```

The RMS value seems to be low at 1.66.