In article ‘REGRESSION FOR BUSINESS, USING R & PYTHON: PART I – INTRO’, I have explained the application of regression in business for improving its Operations and Revenues. Every data produced by the business need to be analyzed with different regression models since we cannot fit one model for every dataset as each model has its own importance and specific conditions. In this article, I will go in details about Linear Regression model and its applications using R and Python.

** LINEAR REGRESSION**

Linear Regression is basic and most common regression model used for data analysis. It gives a relationship between independent variables(x1, x2 ….) and a dependent variable(y) i.e. **‘ CAUSE AND EFFECT’** relationship. Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data.

**SIMPLE LINEAR REGRESSION**

Here the value of **response variable(y)** depends on the value of the single **exploratory variable(x)**.

** **

**MULTIPLE LINEAR REGRESSION**

Here the value of **response variable(y)** depends on the value of multiple **Exploratory variables(x1,x2,x3…..)**.

**LEAST SQUARES METHOD**

The most common method for fitting a regression line is ‘**METHOD OF LEAST SQUARES’**. This method calculates the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line.

**DATASET**

A dataset with 5 variables and 50 Observations has been selected for this analysis. In this dataset ‘**Profit**’ is Dependent/Response variable(y) and remaining variables are Independent/Exploratory variables (x1, x2, x3, and x4). The first 5 observations of the dataset are shown in the table below.

R&D Spend |
Administration |
Marketing Spend |
State |
Profit |

165349.2 | 136897.8 | 471784.1 | New York | 192261.83 |

162597.7 | 151377.59 | 443898.53 | California | 191792.06 |

153441.51 | 101145.55 | 407934.54 | Florida | 191050.39 |

144372.41 | 118671.85 | 383199.62 | New York | 182901.99 |

142107.34 | 91391.77 | 366168.42 | Florida | 166187.94 |

**IMPORTING DATASET IN R AND PYTHON**

Data, which is in ‘CSV’ format, has been imported using following code.

**R**

# Importing the dataset dataset = read.csv('50_Startups.csv')

here **‘read’** function is used to import the file.

**PYTHON**

# Importing the dataset import pandas dataset = pandas.read_csv('50_Startups.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 4].values

Here **‘pandas’** class is used for importing the file. **‘iloc’** is used to select columns required.

**DATA PRE-PROCESSING**

Before fitting any regression model to the dataset, data should be pre-processed as explained in the article ‘REGRESSION FOR BUSINESS, USING R & PYTHON: PART II – DATA’.

Steps involved in Data Preprocessing

- Replacing the Missing Data
- Converting categorical variable to numerical variable
- Splitting Data to Training and Test set
- Adding Dummy Variable(only for Python)

‘**Featured Scaling**’ should not be done for the linear regression model.

**RESULT**

After data pre-processing, the dataset with 50 observation and 5 variables is divided as follows:

- Training Set of 40 observation and 5 variables
- Test Set of 10 observation and 5 variables

**LINEAR REGRESSION USING R**

**STEP- ONE**

# Fitting Multiple Linear Regression to the Training set regressor = lm(formula = Profit ~ State+Marketing.Spend+Administration+R.D.Spend, data = training_set)

Here **‘lm’** is Linear Models is used for linear regression. Independent variables are after **‘ ~ ’**.

**STEP- TWO**

Now, fitted **‘regressor’** model is tested to predict profits of test set using independent variables of the test set.

# Predicting the Test set results y_pred = predict(regressor, newdata = test_set)

Here **‘predict’** function is used to calculate the dependent variable

**LINEAR REGRESSION USING PYTHON**

**STEP- ONE**

# Fitting Multiple Linear Regression to the Training set from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train)

‘**LinearRegression**’ is class used for fitting Linear model for dataset

**STEP- TWO**

# Predicting the Test set results y_pred = regressor.predict(X_test)

Here **‘predict’** function is used to calculate the dependent variable

**FINAL RESULT**

Difference between values of predicted and real profits of the test set is shown below

Y_Predict |
Y_Real |

173981.09 | 182901.99 |

172655.64 | 166187.94 |

160250.02 | 155752.6 |

135513.9 | 146121.95 |

146059.36 | 129917.04 |

114151.03 | 122776.86 |

117081.62 | 118474.03 |

110671.31 | 108733.99 |

98975.29 | 99937.59 |

96867.03 | 97483.56 |

**CONCLUSION**

Since predicted values are approximately equal to Real values, ‘**regressor**’ can be used to find the ‘profits’ for any value of independent variables i.e. R&D Spend, Administration, Marketing Spend and State. The accuracy of ‘**regressor**’ improves with an increase in the number of observations.

**APPLICATIONS**

- Evaluating Trends and Sales Estimates
- Analyzing the Impact of Price Changes
- Assessing Risk
- Effect of interest rates on stock price
- Sensitivity of Sales on advertising expenditures
- Predicting the Future sales, requirements etc.

**-Avinash Reddy**