Artificial Neural Network for Sentiment Analysis using Keras & Tensorflow in Python

Currently as the world is witnessing hyper usage of social media, all businesses sail on digital marketing and tail the trends in the digital world since it is the fastest and the most effective means to express. Any company’s ultimate goal is customer satisfaction which can be measured by customer’s feedback and reviews about company’s new product and advertisement etc. SENTIMENT ANALYSIS makes it a cakewalk for a company to understand the opinion/reaction of the customer. Now the question is ‘How to automate the classification process which classifies the reviews/comments into ‘Positive’ or ‘Negative’?

DATASET:

A twitter data containing tweets on UNITED AIRLINES has been selected which is a subset of kaggle’s twitter data on USA Airlines. It has 3125 observations and 2 variables. First, five observations are shown below

airline_sentiment
text

positive

Yes!! Thanks so much!!! 💜“@united: @whitterbug We see you spoke with our Reservations team and they’ve reinstated the flight. Thanks. ^EY”

negative

Why even ask me to DM you and offer help if you “can’t do anything” @united #terriblecustomerservice #unitedairlines http://t.co/feC4i3Vwq7

negative

we all watched as the crew was escorted from the back to 1st class. Prebooked? Really? @united: @JenniferWalshPR Your dissatisfaction is und

negative

Wanted to get my bag benefit, but instead get $25 pricing on all three tickets. When adding a card, MP Visa is only option. @united

positive

Very quick! TY. @united: @auciello I am sorry to hear this. Can you please follow and DM me the details of what transpired? ^JH

SENTIMENT ANALYSIS

IMPORT DATA

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Tweets_united.csv')
Result:

Required CSV file has been imported as ‘dataset’

DATA CLEANING

# Cleaning the texts
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 3125):
    review = re.sub('[^a-zA-Z]', ' ', dataset['text'][i])
    review = review.lower()
    review = review.split()
    review.remove('united')
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

Here ‘for’ loop selects every comment(observation) for cleaning. ‘re.sub’ function selects words containing alphabets ‘A-Z’ and ‘a-z’.

‘.lower’ converts all alphabets to lower case. ‘.split’ breaks sentences into a list of words. For this dataset word, ‘united’ has been deleted since it appears in every observation.

‘PortStemmer’ is a class for stemming i.e. reductions of words to its roots. for example, the word ‘playing’ changes to ‘play’ since play is the root. Finally, list of words are joined back into the sentence by function ‘.join’

Result:

13th observation Thank you @united for your prompt assistance is reduced to thank prompt assist

CREAT BAG OF WORDS AND AN ARRAY

create an array with each word as columns thereby considering them as independent variables

# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:,0].values
Result:

array with 3125 observations and 1500 variables is created

DATA PREPROCESSING

Encoding of the dependent variable(y) since it has two different categories, scaling of Independent variables(X) for ANN and finally splitting X and y into Training and Test sets.

# Encoding the Dependent Variable
from sklearn.preprocessing import LabelEncoder
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Result:

Scaled training set with 2500 observation and Encoded test set with 625 observations are created.

 

ARTIFICIAL NEURAL NETWORKS:

 

It is a sequence of layers containing nodes which are connected to each other. As process repeats, machine learns and changes the weights thereby increasing the accuracy.

CREATE LAYERS FOR ANN

Initially, input and output layers are created then the required number of hidden layers are attached to the network. Later this model is fitted to the X and y training set.

# Importing the Keras libraries and packages
import tensorflow
import keras
from keras.models import Sequential
from keras.layers import Dense

# Initialising the ANN
classifier = Sequential()

# Adding the input layer and the first hidden layer
classifier.add(Dense(output_dim = 750, init = 'uniform', activation = 'relu', input_dim = 1500))

# Adding the second hidden layer
classifier.add(Dense(output_dim = 750, init = 'uniform', activation = 'relu'))

# Adding the output layer
classifier.add(Dense(output_dim = 1, init = 'uniform', activation = 'softmax'))

# Compiling the ANN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# Fitting the ANN to the Training set
classifier.fit(X_train, y_train, batch_size = 10, nb_epoch = 100)

Here 1500 variables are taken as inputs and output is set to 750 for the first layer. The final output is set to one. ‘relu’ indicates ‘rectifier’ function and ‘softmax’ is a type of sigmoid function. ‘.compile’ is used for adjusting weights and loss is set to ‘binary_crossentropy’ for binary classification. While fitting ‘batch_size’ is used to select the number of inputs at a time and ‘nb_epoch’ is the number of times the entire dataset is processed for adjusting weights.

Result:

Artificial Neural Network is created and training set is fitted to ANN by adjusting weights.

PREDICTION AND ACCURACY CHECK

# Predicting the Test set results
y_pred = classifier.predict(X_test)
y_pred = (y_pred > 0.5)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
Result:

Calculated accuracy of the model is ~90% as 557 out of 625 are correctly predicted.

Negative Positive

Negative

515

21

Positive 47

42

CONCLUSION

Model’s accuracy can be improvised by considering more observations and by changing parameters like ‘batch_size’ and also by a better text cleaning process.

 

-Avinash Reddy

Please follow and like us:

1
Leave a Reply

1 Comment threads
0 Thread replies
0 Followers
 
Most reacted comment
Hottest comment thread
1 Comment authors
  Subscribe  
newest oldest most voted
Notify of
abozyigit

where can I download the dataset?