In article ‘REGRESSION FOR BUSINESS, USING R & PYTHON: PART I – INTRO’, I have explained the application of regression in business to improve its Operations and Revenues. In this article, I will brief about POLYNOMIAL REGRESSION and its application using R and Python.
When Response variable(y) changes linearly with Exploratory variable(x), Linear Regression is applied as explained in article ‘REGRESSION FOR BUSINESS, USING R & PYTHON: PART III – LINEAR REGRESSION’ but what happens when Response variable(y) changes exponentially with Exploratory variable(x) then simple linear regression model cannot predict accurate results. For this kind of datasets, POLYNOMIAL REGRESSION is used.
Y= β_{0 }+ β_{1} X_{1 }^{1 }+ β_{2} X_{1}^{2 }+ ε
In above Polynomial Regression equation, β is the coefficient of the variable(x) and ε is the error. Whereas, x power two is the degree.
For example, Fixed cost per unit of a firm decreases with increase in the number of units produced. In this case, nonlinear line fits better than the linear line.
DATASET
For this analysis, a dataset with 2 variables and 10 different levels is selected. In this dataset, ‘SALARY’ is Dependent/Response variable(y) and ‘LEVEL’ is Independent/Exploratory variable (x).
Position |
Level |
Salary |
Business Analyst |
1 |
45000 |
Junior Consultant |
2 |
50000 |
Senior Consultant |
3 |
60000 |
Manager |
4 |
80000 |
Country Manager |
5 |
110000 |
Region Manager |
6 |
150000 |
Partner |
7 |
200000 |
Senior Partner |
8 |
300000 |
C-level |
9 |
500000 |
CEO | 10 |
1000000 |
IMPORTING DATASET IN R AND PYTHON
Data, which is in ‘CSV’ format, has been imported using the following codes.
R
# Importing the dataset dataset = read.csv('Position_Salaries.csv') dataset = dataset[2:3]
Here, ‘read’ function is used to import the file. Deleted the ‘Position’ column since it is not a variable.
PYTHON
# Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd
# Importing the dataset dataset = pd.read_csv('Position_Salaries.csv') X = dataset.iloc[:, 1:2].values y = dataset.iloc[:, 2].values
Here, ‘pandas’ class is used for importing the file. ‘iloc’ is used to select the columns required. Deleted the ‘Position’ column since it is not a variable.
DATA PRE-PROCESSING
Before fitting any regression model to the dataset, data should be pre-processed as explained in the article ‘REGRESSION FOR BUSINESS, USING R & PYTHON: PART II – DATA’. However, the given dataset does not need any of these data pre-processing steps, as it is cleaned and arranged.
LINEAR & POLYNOMIAL REGRESSION USING R
STEP- ONE: Fitting both simple linear and polynomial regression lines to the datasets
# Fitting Linear Regression to the dataset lin_reg = lm(formula = Salary ~ Level, data = dataset)
# Fitting Polynomial Regression to the dataset dataset$Level2 = dataset$Level^2 dataset$Level3 = dataset$Level^3 dataset$Level4 = dataset$Level^4 dataset$Level5 = dataset$Level^5 poly_reg = lm(formula = Salary ~ ., data = dataset)
Here ‘lm’ (Linear Model) function had been used for linear regression. Independent variables are after ‘ ~ ’. ‘.’ represents remaining variables that dataset.
STEP- TWO: Visualization of Regression lines
# Visualising the Polynomial Regression results install.packages('ggplot2') library(ggplot2) ggplot() + geom_point(aes(x = dataset$Level, y = dataset$Salary), colour = 'navyblue') + geom_line(aes(x = dataset$Level, y = predict(lin_reg, newdata = dataset)), colour = 'green4') + geom_line(aes(x = dataset$Level, y = predict(poly_reg, newdata = dataset)), colour = 'red3') + ggtitle('Polynomial Regression with Degree FIVE') + xlab('Level') + ylab('Salary')
‘ggplot2‘ has been used for the plotting the lines.
RESULT
STEP- THREE: Predicting the values
# Predicting a new result with Linear Regression predict(lin_reg, data.frame(Level = 8.5)) # Predicting a new result with Polynomial Regression predict(poly_reg, data.frame(Level = 6.25, Level2 = 6.25^2, Level3 = 6.25^3, Level4 = 6.25^4, Level4 = 6.25^5))
Here, ‘predict’ function had been used to calculate the dependent variable.
RESULT
Linear Regression= Rs.310159.1
Polynomial Regression= Rs.163491.7
LINEAR & POLYNOMIAL REGRESSION USING PYTHON
STEP- ONE: Fitting both LINEAR & POLYNOMIAL regression lines to the datasets
# Fitting Linear Regression to the dataset from sklearn.linear_model import LinearRegression lin_reg = LinearRegression() lin_reg.fit(X, y)
# Fitting Polynomial Regression to the dataset from sklearn.preprocessing import PolynomialFeatures poly_reg = PolynomialFeatures(degree = 5) X_poly = poly_reg.fit_transform(X) poly_reg.fit(X_poly, y) lin_reg_2 = LinearRegression() lin_reg_2.fit(X_poly, y)
‘LinearRegression’ is a class used for fitting the linear model for the dataset. ‘PolynomialFeatures’ is a class to form Polynomial equation. ‘fit_transform’ had been to fit and transform the dataset.
STEP- TWO: Visualization of regression lines
# Visualising the Polynomial Regression results plt.scatter(X, y, color = 'blue') plt.plot(X, lin_reg_2.predict(poly_reg.fit_transform(X)), color = 'red') plt.plot(X, lin_reg.predict(X), color = 'green') plt.title('Polynomial Regression with degree FIVE') plt.xlabel('Position level') plt.ylabel('Salary') plt.show()
‘matplotlib.pyplot‘ is used for plotting.
RESULT
STEP- THREE: Predicting the values
# Predicting a new result with Linear Regression lin_reg.predict(6.25) # Predicting a new result with Polynomial Regression lin_reg_2.predict(poly_reg.fit_transform(6.25))
Result
Linear Regression= Rs.310159.09090909
Polynomial Regression= Rs.163491.70775529
How to Determine Right Degree?
Polynomial Equation had to be built with a right degree, for accurate predictions. Power of x had been increased until the line fits the dataset. In the given dataset, polynomial equation with degree FIVE gives the best fit, as shown in the figure below.
CONCLUSION
Predicted value from Polynomial Regression is more accurate than the value from Linear regression and it differs by Rs. 2,00,000. From this, we can conclude that for polynomial regression is better when Response variable(y) changes exponentially with Exploratory variable(x).
APPLICATION OF POLYNOMIAL REGRESSION
- The average cost for particular output can be Predicted
- To predict Output for labor hired
- Estimating Ordering Cost for the specific number of units
- To estimate salaries of job applicants with specific years of experience etc.
-Avinash Reddy
great advice you give