Considering that there are many frameworks for implementing deep learning in R and Python. It may prove quite challenging for beginners to use Tensor Flow or Keras. Where as H2o in R is quite easy to implement and it has some nifty features –
- It’s open source
- Supports R, Python and Flow
- Supports multiple hidden nets
- It’s quite fast
Below is a small implementation of H2o on the MNSIT data-set (http://yann.lecun.com/exdb/mnist/).
The MNSIT dataset
The MNSIT data-set is a database of handwritten digits. I have used Kaggle’s training and test sets of MNIST. The training set has 42,000 records and a label for each record indicating the digit. The test set has 28,000 records.
Lets start by initializing h2o and reading in the data-set.
train <- read.csv(file.choose(), h=T) test <- read.csv(file.choose(), h=T)
library(h2o) localH2O = h2o.init(ip = "localhost", port = 54321, min_mem_size = "6g") #A h2o cluster with 6gb of size.
train$label <- as.factor(train$label) train.h2o <- as.h2o(train)
Now we build the model. The model takes in an input of 784 nodes. It also has 4 hidden nodes with 500 neurons in each layer. The activation layer is Relu (with dropout) and the final output layer is the Softmax layer. I have also put in some dropouts so as to make sure there is no over fitting. Each layer has a dropout of 15% and the input layer has a dropout of 25%. And finally the model is run 100 time (epochs = 100).
dl<- h2o.deeplearning(x = 2:785, y = 1, training_frame = train.h2o, activation = "RectifierWithDropout", input_dropout_ratio = 0.25, hidden_dropout_ratios = c(0.15,0.15,0.15,0.15), balance_classes = TRUE, hidden = c(500,500,500,500), nfolds = 5, fold_assignment = "Modulo", keep_cross_validation_predictions = TRUE, epochs = 100)
After the model is run, it’s time to do the predictions.
test.h2o <- as.h2o(test) pred <- h2o.predict(dl, test.h2o) ImageId <- as.numeric(seq(1,28000)) names(ImageId) <- "ImageId" predictions <- cbind(as.data.frame(ImageId),as.data.frame(pred[,1])) names(predictions) <- "Label" write.table(as.matrix(predictions), file="DNN_pred.csv", row.names=FALSE, sep=",")
This particular model received a 97.432% on the leader board. This can be improved by tuning the hyper parameters and stacking the model if necessary.