I’ve been experimenting with convolutional neural networks (CNN) for the past few months or so on the CIFAR-10 dataset (object recognition). CNN have been around since the 90s but seem to be getting more attention ever since ‘deep learning’ became a hot new buzzword.
Most of my time was spent learning the architecture and writing my own code so I could understand them better. My first attempt was a CPU version, which worked correctly but was not fast enough for any serious use. CNN with complex architectures are notoriously slow to train, that’s why everyone these days use the GPU. It wasn’t until recently that I got a CUDA version of my code up and running. To keep things simple I didn’t do any fancy optimisation. In fact, I didn’t even use shared memory, mostly due to the way I structured my algorithm. Despite that, it was about 10-11x faster than the CPU version (single thread). But hang on, there’s already an excellent CUDA CNN code on the net, namely cuda-convnet, why bother rolling out my own one? Well, because my GPU is a laptop GTS 360M (circa 2010 laptop), which only supports CUDA compute 1.2. Well below the minimum requirements of cuda-convnet. I could get a new computer but where’s the fun in that 🙂 And also, it’s fun to re-invent the wheel for learning reasons.
As mentioned previously I’m working with the CIFAR-10 dataset, which has 50,000 training images and 10,000 test images. Each image is a tiny 32×32 RGB image. I split the 50,000 training images into 40,000 and 10,000 for training and validation, respectively. The dataset has 10 categories ranging from dogs, cats, cars, planes …
The images were pre-processed by subtracting each image by the average image over the whole training set, to centre the data.
The architecture I used was inspired from cuda-convnet and is
Input – 32×32 image, 3 channels
Layer 1 – 5×5 convolution filter, 32 output channels/features, Rectified Linear Unit neurons
Layer 2 – 2×2 max pool, non-overlapping
Layer 3 – 5×5 convolution filter, 32 output channels/features, Rectified Linear Unit neurons
Layer 4 – 2×2 max pool, non-overlapping
Layer 5 – 5×5 convolution filter, 64 output channels/features, Rectified Linear Unit neurons
Layer 6 – fully connected neural network hidden layer, 64 output units, Rectified Linear Unit neurons
Layer 7 – fully connected neural network hidden layer, 10 output units, linear neurons
Layer 8 – softmax, 10 outputs
I trained using a mini-batch of 128, with a learning rate of 0.001 and momentum of 0.9. At each epoch (one pass through the training data), the data is randomly shuffled. At around the 62th epoch I reduced the learning rate to 0.0001. The weights are updated for each mini-batch processed. Below shows the validation errors vs epoch.
After 85 epochs the results are:
– training error 7995/40000 ~ 20%
– validation error 3156/10000 = 31.56%
– test error 3114/10000 = 31.14%
Results seem okay until I compared them with results reported by cuda-convnet simplest architecture  : ~8 epochs (?), 80 seconds, 26% testing error. Where as mine took a few hours and many more epochs, clearly I’m doing something wrong!!! But what? I did a rough back of the envelope calculation and determined that their GPU code runs 33x faster than mine, based on timing values they reported. Which means my CUDA code and hardware sucks badly.
On the plus side I did manage to generate some cool visualisation of the weights for layer 1. These are the convolution filters it learnt. This result is typical of what you will find published in the literature, so I’m confident I’m doing something right.
You can see it has learnt some edge and colour filters.
One thing I really want to try at the moment is to get my hands on a newer Nvidia card and see how much speed up I get without doing anything to the code.
I’m not releasing any code yet because it’s very experimental and too ugly to show.