REGRESSION FOR BUSINESS, USING R & PYTHON: PART V – LOGIT & PROBIT REGRESSION

If you like this, Share!

In article ‘REGRESSION FOR BUSINESS, USING R & PYTHON: PART I – INTRO’, I have explained the application of regression in business to improve its Operations and Revenues. In this article, I will brief about  LOGIT Regression and PROBIT Regression and their modeling using R and Python.

For one or more Exploratory variables(x) if Response variable (y) is a dichotomous variable, i.e. variable with only two values like 0 or 1, TRUE or FALSE etc. then limited dependent models like Logit and Probit are used.

Logit Regression Equation:

Logit(P) = Ln(P/1-P) = β+ β1 X + β2 X2 …..+ βn Xn 

where P denotes probability, and Ln is the Cumulative Logistic Function of the logistic distribution.

Probit Regression Equation

Probit(P) = ɸ(P/1-P) = β+ β1 X1 + β2 X2 …..+ βn Xn

Here P denotes probability, and Φ is the Cumulative Distribution Function (CDF) of the standard normal distribution.

 

As said in the article ‘REGRESSION FOR BUSINESS, USING R & PYTHON: PART III – LINEAR REGRESSION’, Simple linear regression equation has an error term, which is calculated by ‘Method of Least Squares’. Whereas in logit and probit models, for the latent variable interpretations, Logit assumes a standard logistic distribution of errors and Probit uses a standard normal distribution of errors.

Both models form Sigmoid Curves i.e. ‘S’ shaped curves with the result as 0 or 1 as shown in the figure above.

DATASET

For this analysis, a dataset with 5 variables and 400 observations is selected. In this dataset, ‘Purchased’ is Dependent/Response variable(y) and ‘EstimatedSalary’ and ‘Age’ are Independent/Exploratory variables(x). However, the dataset has been reduced to 3 variables since only these are used for analysis. Head of the dataset looks like below.

User ID Gender Age EstimatedSalary Purchased
15624510 Male 19 19000 0
15810944 Male 35 20000 0
15668575 Female 26 43000 0
15603246 Female 27 57000 0
15804002 Male 19 76000 0
15728773 Male 27 58000 0

IMPORTING DATASET IN R AND PYTHON

Data, which is in ‘CSV’ format, has been imported using the following codes.

R

# Importing the dataset

dataset = read.csv('Social_Network_Ads.csv')

dataset = dataset[3:5]

Here, ‘read’ function is used to import the file.

PYTHON

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset

dataset = pd.read_csv('Social_Network_Ads.csv')

x = dataset.iloc[:, [2,3]].values

y = dataset.iloc[:, 4].values

Here, ‘pandas’ class is used for importing the file. ‘iloc’ is used to select the columns required.

DATA PRE-PROCESSING

Before fitting any regression model to the dataset, data should be pre-processed as explained in the article ‘REGRESSION FOR BUSINESS, USING R & PYTHON: PART II – DATA’. However, the given dataset needs only two of these data pre-processing steps

  1. Featured Scaling
  2. Splitting data into training and test sets

R

# Splitting the dataset into the Training set and Test set

install.packages('caTools')

library(caTools)

split = sample.split(dataset$Purchased, SplitRatio = 0.75)

training_set = subset(dataset, split == TRUE)

test_set = subset(dataset, split == FALSE)
# Feature Scaling

training_set[-3] = scale(training_set[-3])

test_set[-3] = scale(test_set[-3])

PYTHON

# Splitting the dataset into the Training set and Test set

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Feature Scaling

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)

X_test = sc.transform(X_test)

Result

The dataset is split into a training set of 3 variables with 300 observations and a test set of 3 variables with 100 observations.

LOGIT & PROBIT REGRESSION USING R

STEP-ONE: Fitting both Logit and Probit regression models to the datasets

# Fitting Logistic Regression to the Training set

Regressor.Logit = glm(formula = Purchased ~ .,

                 family = binomial(link="logit"),

                 data = training_set)
# Fitting Probit Regression to the Training set

Regressor.Probit = glm(formula = Purchased ~ .,

                 family = binomial(link="probit"),

                 data = training_set)

‘glm’ means generalized linear model function.

STEP-TWO: Testing the test set

# Predicting the Test set results using Logit

prob_pred_Logit = predict(Regressor.Logit,type = 'response', newdata = test_set[-3])

y_pred_Logit = ifelse(prob_pred > 0.5, 1, 0)
# Predicting the Test set results using Probit

prob_pred_Probit = predict(Regressor.Probit, type = 'response', newdata = test_set[-3])

y_pred_Probit = ifelse(prob_pred > 0.5, 1, 0)

STEP-THREE: Validation Score

# Making the Confusion Matrix for logit

cm_Logit = table(test_set[, 3], y_pred_Logit > 0.5)
# Making the Confusion Matrix for probit

cm_Probit = table(test_set[, 3], y_pred_Probit > 0.5)

 

LOGIT REGRESSION USING PYTHON

STEP-ONE: Fitting Logit regression model to the dataset

# Fitting Logistic Regression to the Training set

from sklearn.linear_model import LogisticRegression

regressor = LogisticRegression(random_state = 0)

regressor.fit(X_train, y_train)

STEP-TWO: Testing the test set

# Predicting the Test set results

y_pred = regressor.predict(X_test)

STEP-THREE: Validation Score

# Making the Confusion Matrix

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

RESULT

R

FALSE

TRUE

0

57

7

1

10

26

PYTHON

0

1

0

65

3

1

8

24

Above table is a confusion matrix, which gives type I and type II errors of model i.e. Number of true and false predictions. R model made 83% true predictions and in case of Python true predictions rate is 89%.

CONCLUSION

Despite following different methods, Both Logit and Probit Regressions comes up with similar results. As the number of observations increases the accuracy increases.

APPLICATIONS

  • To predict customer retention i.e. yes or no
  • For guessing the pattern of customer buying the product
  • To see if credit card transaction is fraud or not
  • Predicting if a customer will default on a loan
  • a given user will buy an insurance product or not
  • To predict whether the viewers click the advertisement

-Avinash Reddy

2
Leave a Reply

1 Comment threads
1 Thread replies
0 Followers
 
Most reacted comment
Hottest comment thread
2 Comment authors
  Subscribe  
newest oldest most voted
Notify of
Bijesh Cb

In python sklearn Module cross_validation is deprecated. model_selection have to be used instead