# REGRESSION FOR BUSINESS, USING R & PYTHON: PART II – DATA

This is the continuation of my previous article REGRESSION FOR BUSINESS, USING R & PYTHON. In this article, I would like to go forward explaining about Data and Data preprocessing since DATA is the Foundation for any Regression model.

W. E. Demling

### WHAT DOES DATA MEAN?

Data is a collection of values, which are either Quantitative or Qualitative in nature or both. Quantitative data can be measured numerically whereas qualitative data are observed and cannot be measured with a numerical value.

### HOW IS DATA GENERATED?

Generally, every business produces a voluminous amount of data through their operations, sales, financial activities, workforce etc. that has to be then refined and selected according to the Aim of the Research.

### WHY PRE-PROCESS THE DATA AND HOW TO?

Data from business activities is known as Raw Data, which contains outliers (abnormalities), and irregularities. These outliers will affect the results of the analysis and reduce the accuracy. To avoid these effects we need to pre-process the data before analyzing it, by cleaning and adjusting the variables. This cleansed data is called as Cooked Data.

Here I have considered a sample dataset to explain the method for data pre-processing. In following data set, ‘Purchased’ is dependent variable and other three variables are independent variables.

#### Purchased

France 44 72000 No
Spain 27 48000 Yes
Germany 30 54000 No
Spain 38 61000 No
Germany 40 Yes
France 35 58000 Yes
Spain 52000 No
France 48 79000 Yes
Germany 50 83000 No
France 37 67000 Yes

### DATA REDUCTION

Data reduction is to be done when irrelevant variables or values are present in the data. This can be done manually by deleting variables or by deleting the observations, which are extraneous by basic commands in R and Python. Random sampling can be done to scale down the data.

### MISSING DATA

Missing data are gaps in observations, which are due to lack of response from source. This missing data can be filled with either mean, mode or median of the rest other values in that particular variable.

#### R CODE

```# Taking care of missing data

dataset\$Age = ifelse(is.na(dataset\$Age),

ave(dataset\$Age, FUN = function(x) mean(x, na.rm = TRUE)),

dataset\$Age)

dataset\$Salary = ifelse(is.na(dataset\$Salary),

ave(dataset\$Salary, FUN = function(x) mean(x, na.rm = TRUE)),

dataset\$Salary)```

Here ‘ifelse’ loop has being used to create a loop and ‘is.na’ is used to find missing data and ‘ave’ is used to get mean, which has been used to fill the gaps.

#### PYTHON

```# Taking care of missing data

from sklearn.preprocessing import Imputer

imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)

imputer = imputer.fit(X[:, 1:3])

X[:, 1:3] = imputer.transform(X[:, 1:3])```

Here ‘imputer’ is the class (model to build) to deal with missing data. ‘sklearn’ is the library which contains imputer class.

#### RESULT

 Country Age Salary Purchased France 44 72000 No Spain 27 48000 Yes Germany 30 54000 No Spain 38 61000 No Germany 40 63777 Yes France 35 58000 Yes Spain 38.8 52000 No France 48 79000 Yes Germany 50 83000 No France 37 67000 Yes

### CATEGORICAL DATA

Categorical variables cannot be fitted into the regression equation, as few machine-learning algorithms cannot read label data directly so, these should be assigned with numeric values.

For example, when we have three categories of countries namely India, Australia, and USA we assign numeric values to them like India as 1, Australia as 2, and the USA as 3. Here it does not imply that one country is greater than the other. It is just assigning arbitrary numerical values to general categorical data.

#### R CODE

```#encoding categorical data

dataset\$Country = factor(dataset\$Country,

levels = c('France','Spain','Germany'),

labels = c(1,2,3))

dataset\$Purchased = factor(dataset\$Purchased,

levels = c('No','Yes'),

labels = c(0,1))```

‘factor’ function converts categorical variable into arbitrary numerical values.

#### RESULT

 Country Age Salary Purchased 1 44 72000 0 2 27 48000 1 3 30 54000 0 2 38 61000 0 3 40 63777 1 1 35 58000 1 2 38.8 52000 0 1 48 79000 1 3 50 83000 0 1 37 67000 1
###### PYTHON
```# encoding categorical data

# Encoding the Independent Variable

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

labelencoder_X = LabelEncoder()

X[:, 0] = labelencoder_X.fit_transform(X[:, 0])

onehotencoder = OneHotEncoder(categorical_features = [0])

X = onehotencoder.fit_transform(X).toarray()

# Encoding the Dependent Variable

labelencoder_y = LabelEncoder()

y = labelencoder_y.fit_transform(y)```

Here ‘LabelEncoder’ class is used for labeling. Since one country is not greater than the other, ‘OneHotEncoder’ is used to introduce dummy variable, which gives 0 or 1 depending on the presence of the variable.

#### RESULT

 France Germany Spain Age Salary Purchased 1 0 0 44 72000 0 0 0 1 27 48000 1 0 1 0 30 54000 0 0 0 1 38 61000 0 0 1 0 40 63777.8 1 1 0 0 35 58000 1 0 0 1 38.8 52000 0 1 0 0 48 79000 1 0 1 0 50 83000 0 1 0 0 37 67000 1

### SPLITTING OF DATA

Data has to be split into two sets namely training set and test set for fitting and testing respectively. By using training and testing sets, we can minimize the effects of data discrepancies and understand the characteristics of the model in a better way. Generally, 80% of data is considered as training set and the remaining 20% as a test set but it can vary according to preference.

#### R CODE

```# splitting the dataset into the training set and test set

install.packages('caTools')

library('caTools')

set.seed(123)

split = sample.split(dataset\$Purchased, SplitRatio = 0.8)

training_set = subset(dataset,split == TRUE)

test_set = subset(dataset,split == FALSE)```

‘caTools’ library contains ‘split’ function which is used to split the data into training and test set.

#### RESULT

##### Training Set
 Country Age Salary Purchased 2 27 48000 1 3 30 54000 0 2 38 61000 0 3 40 63777.8 1 1 35 58000 1 2 38.8 52000 0 1 48 79000 1 3 50 83000 0
##### Test Set
 Country Age Salary Purchased 1 44 72000 0 1 37 67000 1

#### PYTHON

```#splitting the dataset into the training set and test set

from sklearn.cross_validation import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.2, random_state= 0)```

‘train_test_split’ class is used to split the data into test and training sets.

##### Training Set
 X Training Y Training France Germany Spain Age Salary Purchased 0 0 1 27 48000 1 0 1 0 30 54000 0 0 0 1 38 61000 0 0 1 0 40 63777.8 1 1 0 0 35 58000 1 0 0 1 38.8 52000 0 1 0 0 48 79000 1 1 0 0 37 67000 1
##### Test Set
 X Test Y Test France Germany Spain Age Salary Purchased 1 0 0 44 72000 0 0 1 0 50 83000 0

#### FEATURED SCALING

Scaling variable values to make them fall within a specified range. This is also known as Normalization. This is to reduce the effect of measuring units and scales on analysis.

#### R CODE

```# Feature Scaling

training_set[,2:3] = scale(training_set[,2:3])

test_set[,2:3] = scale(test_set[,2:3])```

‘scale’ function will scale the variables, Scaling should be done for independent variables alone.

#### RESULT

##### Training Set
 Country Age Salary Purchased 2 -1.42857869 -1.1397581 1 3 -1.05088836 -0.6631119 0 2 -0.04371416 -0.1070247 0 3 0.20807939 0.1136448 1 1 -0.42140448 -0.3453478 1 2 0.05420556 -0.821994 0 1 1.2152536 1.3229138 1 3 1.46704715 1.640678 0
##### Test Set
 Country Age Salary Purchased 1 0.7071068 0.7071068 0 1 -0.7071068 -0.7071068 1

#### PYTHON

```# Feature Scaling

from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()

X_train = sc_X.fit_transform(X_train)

X_test = sc_X.transform(X_test)

sc_y = StandardScaler()

y_train = sc_y.fit_transform(y_train)```

Here ‘StandardScaler’ class is used for scaling the variables. ‘Fit.transform’ is used to change the dataset after fitting it.

#### RESULT

##### Training Set
 X Training Y Training France Germany Spain Age Salary Purchased 0 0 1 -1.4286 -1.1397581 1 0 1 0 -1.0509 -0.6631119 0 0 0 1 -0.0437 -0.1070247 0 0 1 0 0.2081 0.1136448 1 1 0 0 -0.4214 -0.3453478 1 0 0 1 0.0542 -0.821994 0 1 0 0 1.2153 1.3229138 1 1 0 0 1.467 1.640678 1
##### Test Set
 X Test Y Test France Germany Spain Age Salary Purchased 1 0 0 0.71 0.70711 0 0 1 0 -0.71 -0.7071 0

In coming up articles, I would be explaining about each regression models in specific by using machine learning in R and Python. Stay tuned. Thank you for reading.

– Avinash Reddy