Introduction

I made my first contact with Natural Language Processing (NLP) in a kaggle competition where the goal was to classify Wikipedia comments into toxic and non-toxic, insulting and non-insulting, etc. Specifically, the problem is one of multilabel classification, where you have more than one outcome label, and each comment can be assigned an arbitrary number of labels. In this challenge, labels included “toxic”, “threat”, and “insult”, and it is common that a comment classified as insult also belongs to the toxic class.

I started with a logistic regression on each separate label, which already yields a pretty good score. But, because the input data is text data, and natural language in particular, a neural network is the more appropriate solution. Current state-of-the-art models focus on getting a good word embedding mechanism and an appropriate recurrent neural network (RNN) architecture. The discussion forum showed that most other contestants chose the same route.

Outline

To be able to implement a first RNN, I needed to collect and organize a bunch of information. This post outlines my journey. This is not an elaborate tutorial, but a short summary of the most important points and a collection of resources for further study.

The crude outline of this post is:

  1. Extensions to neural networks
  2. Software implementations
  3. Training a small RNN locally
  4. Training the full RNN on AWS

1. Extensions to neural networks

  • The best introduction to “standard” neural networks I’ve come across is Andrew Ng’s MOOC on Machine Learning. There, you learn the basics of how a network operates and even code up your own implementation on gradient descent to estimate its parameters.
  • Then, there is an entire deep learning specialization on Coursera, which is an excellent follow-up. Both courses take quite some time, but the insights are well worth it!
  • Deep neural networks are simply neural networks with (usually many) more than one hidden layer. The more interesting special cases extend the basic neural network in two directions:
    • Convolutional neural networks (CNNs) are able to transform and subset the input space (e.g. pixels) into many smaller sets (e.g. smaller squares of an image). They are good tools for tasks like object detection and image classification, since this makes it irrelevant whether the cat is in the upper left or lower right corner of the image. Examples include Microsoft’s ResNet50, a specialized version of CNNs that uses a technique called residual learning that enables stacking up to 100 and more layers. CNNs can also be used for some uncommon but aesthetic tasks such as recreating a photo in the learned style of some artist (a process called neural style transfer).
    • Recurrent neural networks (RNNs) forward their predictions as an input to a copied model of themselves. A special class called LSTMs are good models for natural language processing, e.g. classifying comments as toxic or insulting. They can be used for fun stuff too, e.g. for generating artificial Shakespeare text or even Linux source code (shown in this blog post, but of course, it’s not functional code).

2. Software implementations

The multitude of deep learning frameworks can be intimidating in the beginning. After a brief review, I chose to work with Python and Keras (on top of TensorFlow). Keras seems to be the most minimalist and flexible framework, allowing you to quickly try out different network architectures and inspect its effects on your predictions. You can check some sample code for yourself - this is my GitHub repository for the kaggle competition described above.

3. Training a small RNN locally

It doesn’t make sense to fire up an AWS instance right away.

First, you should subset your training data to a very small slice, and develop the data processing pipeline locally. This includes reading/writing the raw data, preprocessing (from within Python, in my case), and preparing a first very basic “placeholder” model, just to verify the pipeline works correctly.

For me, I developed a logistic regression script as well as a first RNN locally, and switched over to AWS only after those worked well on a small slice of the data.

4. Training the full RNN on AWS

As soon as the processing pipeline works well locally, it’s time to start up an AWS instance and move all the data and processing there. I recommend starting with a less expensive instance, for example a t2.large machine without GPU support and little RAM. This way, the time-consuming first setup steps cost you a bit less.

You’ll have to create a storage medium, attach it to your instance, upload your local data there, and get your scripts up there as well. I’ve spawned and killed close to 20 instances in my first days, so I quickly set up a GitHub repo for my AWS scripts that contains a step-by-step manual as well as the necessary shell scripts to quickly configure a fresh AWS instance to serve you a Jupyter Notebook.

Training on a GPU instead of CPU

The switch from CPU to GPU can increase your training speed by a lot - common speed increase factors I’ve seen range from 80x to 100x.

Unfortunately, a standard AWS account doesn’t allow you to start up a GPU enabled instance. You must first request a limit increase via a support ticket. This requires human intervention and takes a few days (2 days for me).

Once that is done, however, you can start up such an instance. The smallest one as of now is a p2.xlarge instance, with one GPU, and 61GiB of RAM.

I’ve published a script on GitHub that sets up a newly started p2.xlarge instance with the necessary software and starts a Jupyter Notebook to access via your local browser. From there, most things then proceed exactly as before.

The only difference is that you should increase the batch size in your model training, to make use of the large number of cores on a GPU. I’ve increased mine from 512 to 4096, for example. For a batch size of 2048, I still didn’t utilize all of the GPU’s power, meaning I’ve wasted resources. Also, note that increasing the batch size will lead to fewer parameter updates for gradient descent within one epoch, so you’ll probably need a few more epochs to get the same performance from your model.