In article ‘REGRESSION FOR BUSINESS, USING R & PYTHON: PART I – INTRO’, I have explained the application of regression in business to improve its Operations and Revenues. In this article, I will brief about LOGIT Regression and PROBIT Regression and their modeling using R and Python.
For one or more Exploratory variables(x) if Response variable (y) is a dichotomous variable, i.e. variable with only two values like 0 or 1, TRUE or FALSE etc. then limited dependent models like Logit and Probit are used.
Logit Regression Equation:
Logit(P) = Ln(P/1P) = β_{0 }+ β_{1} X_{1 }^{ }+ β_{2} X_{2}^{ }…..+ β_{n} X_{n}^{ }
where P denotes probability, and Ln is the Cumulative Logistic Function of the logistic distribution.
Probit Regression Equation
Probit(P) = ɸ(P/1P) = β_{0 }+ β_{1} X_{1}^{ }+ β_{2} X_{2}^{ }…..+ β_{n} X_{n}
Here P denotes probability, and Φ is the Cumulative Distribution Function (CDF) of the standard normal distribution.
As said in the article ‘REGRESSION FOR BUSINESS, USING R & PYTHON: PART III – LINEAR REGRESSION’, Simple linear regression equation has an error term, which is calculated by ‘Method of Least Squares’. Whereas in logit and probit models, for the latent variable interpretations, Logit assumes a standard logistic distribution of errors and Probit uses a standard normal distribution of errors.
Both models form Sigmoid Curves i.e. ‘S’ shaped curves with the result as 0 or 1 as shown in the figure above.
DATASET
For this analysis, a dataset with 5 variables and 400 observations is selected. In this dataset, ‘Purchased’ is Dependent/Response variable(y) and ‘EstimatedSalary’ and ‘Age’ are Independent/Exploratory variables(x). However, the dataset has been reduced to 3 variables since only these are used for analysis. Head of the dataset looks like below.
User ID  Gender  Age  EstimatedSalary  Purchased 
15624510  Male  19  19000  0 
15810944  Male  35  20000  0 
15668575  Female  26  43000  0 
15603246  Female  27  57000  0 
15804002  Male  19  76000  0 
15728773  Male  27  58000  0 
IMPORTING DATASET IN R AND PYTHON
Data, which is in ‘CSV’ format, has been imported using the following codes.
R
# Importing the dataset dataset = read.csv('Social_Network_Ads.csv') dataset = dataset[3:5]
Here, ‘read’ function is used to import the file.
PYTHON
# Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd
# Importing the dataset dataset = pd.read_csv('Social_Network_Ads.csv') x = dataset.iloc[:, [2,3]].values y = dataset.iloc[:, 4].values
Here, ‘pandas’ class is used for importing the file. ‘iloc’ is used to select the columns required.
DATA PREPROCESSING
Before fitting any regression model to the dataset, data should be preprocessed as explained in the article ‘REGRESSION FOR BUSINESS, USING R & PYTHON: PART II – DATA’. However, the given dataset needs only two of these data preprocessing steps
 Featured Scaling
 Splitting data into training and test sets
R
# Splitting the dataset into the Training set and Test set install.packages('caTools') library(caTools) split = sample.split(dataset$Purchased, SplitRatio = 0.75) training_set = subset(dataset, split == TRUE) test_set = subset(dataset, split == FALSE)
# Feature Scaling training_set[3] = scale(training_set[3]) test_set[3] = scale(test_set[3])
PYTHON
# Splitting the dataset into the Training set and Test set from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Feature Scaling from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test)
Result
The dataset is split into a training set of 3 variables with 300 observations and a test set of 3 variables with 100 observations.
LOGIT & PROBIT REGRESSION USING R
STEPONE: Fitting both Logit and Probit regression models to the datasets
# Fitting Logistic Regression to the Training set Regressor.Logit = glm(formula = Purchased ~ ., family = binomial(link="logit"), data = training_set)
# Fitting Probit Regression to the Training set Regressor.Probit = glm(formula = Purchased ~ ., family = binomial(link="probit"), data = training_set)
‘glm’ means generalized linear model function.
STEPTWO: Testing the test set
# Predicting the Test set results using Logit prob_pred_Logit = predict(Regressor.Logit,type = 'response', newdata = test_set[3]) y_pred_Logit = ifelse(prob_pred > 0.5, 1, 0)
# Predicting the Test set results using Probit prob_pred_Probit = predict(Regressor.Probit, type = 'response', newdata = test_set[3]) y_pred_Probit = ifelse(prob_pred > 0.5, 1, 0)
STEPTHREE: Validation Score
# Making the Confusion Matrix for logit cm_Logit = table(test_set[, 3], y_pred_Logit > 0.5)
# Making the Confusion Matrix for probit cm_Probit = table(test_set[, 3], y_pred_Probit > 0.5)
LOGIT REGRESSION USING PYTHON
STEPONE: Fitting Logit regression model to the dataset
# Fitting Logistic Regression to the Training set from sklearn.linear_model import LogisticRegression regressor = LogisticRegression(random_state = 0) regressor.fit(X_train, y_train)
STEPTWO: Testing the test set
# Predicting the Test set results y_pred = regressor.predict(X_test)
STEPTHREE: Validation Score
# Making the Confusion Matrix from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred)
RESULT
R
FALSE 
TRUE 

0 
57 
7 
1 
10 
26 
PYTHON
0 
1 

0 
65 
3 
1 
8 
24 
Above table is a confusion matrix, which gives type I and type II errors of model i.e. Number of true and false predictions. R model made 83% true predictions and in case of Python true predictions rate is 89%.
CONCLUSION
Despite following different methods, Both Logit and Probit Regressions comes up with similar results. As the number of observations increases the accuracy increases.
APPLICATIONS
 To predict customer retention i.e. yes or no
 For guessing the pattern of customer buying the product
 To see if credit card transaction is fraud or not
 Predicting if a customer will default on a loan
 a given user will buy an insurance product or not
 To predict whether the viewers click the advertisement
Avinash Reddy
In python sklearn Module cross_validation is deprecated. model_selection have to be used instead
For sklearn 0.18 this has been deprecated but for previous versions, it is working.