Convolutional neural network and CIFAR-10, part 2

Spent like the last 2 weeks trying to find a bug in the code that prevented it from learning. Somehow it miraculously works now but I haven’t been able to figure out why. First thing I did immediately was commit it to my private git in case I messed it up again. I’ve also ordered a new laptop to replace my non-gracefully aging Asus laptop with a Clevo/Sager, which sports a GTX 765M. Never tried this brand before, crossing my fingers I won’t have any problems within 2 years of purchase, unlike every other laptop I’ve had …

I’ve gotten better results now by using a slightly different architecture than before. But what improved performance noticeably was increasing the training samples by generating mirrored versions, effectively doubling the size. Here’s the architecture I used

Layer 1 – 5×5 convolution, Rectified Linear units, 32 output channels

Layer 2 – Average pool, 2×2

Layer 3 – 5×5 convolution, Rectified Linear units, 32 output channels

Layer 4 – Average pool, 2×2

Layer 5 – 4×4 convolution, Rectified Linear units, 64 output channels

Layer 6 – Average pool, 2×2

Layer 7 – Hidden layer, Rectified Linear units, 64 output neurons

Layer 8 – Hidden layer, Linear units, 10 output neurons

Layer 9 – Softmax

The training parameters changed a bit as well:

  • learning rate = 0.01, changed to 0.001 at epoch 28
  • momentum = 0.9
  • mini batch size = 64
  • all weights initialised using a Gaussian of u=0 and stdev=0.1

For some reason my network is very sensitive to the weights initialised. If I use a stdev=0.01, the network simply does not learn at all, constant error of 90% (basically random chance). My first guess is maybe something to do with 32bit floating point precision, particularly when small numbers keep getting multiply with other smaller numbers as they pass through each layer.

cnn2The higher learning rate of 0.01 works quite well and speeds up the learning process compared to using a rate of 0.001 I used previously. Using a batch size of 64 instead of 128 means I perform twice as many updates per epoch, which should be a good thing. A mini batch of 128 in theory should give a smoother gradient than 64 but since we’re doing twice as many updates it sort of compensates.

At epoch 28 I reduce the learning rate to 0.001 to get a bit more improvement. The final results are:

  • training error – 9%
  • validation error – 23.3%
  • testing error – 24.4%

The results are similar to the ones by cuda-convnet for that kind of architecture. The training error being much lower than the other values indicates the network has enough capacity to model most of the data, but is limited by how well it generalises to unseen data.

Numbers alone are a bit boring to look at so I thought it’d be cool to see visually how the classifier performs. I’ve made it output 20 correct/incorrect classifications on the test datase4t with the probability of it belonging to a particular category (10 total).

Correctly classified

correct-19 correct-18 correct-17 correct-16 correct-15 correct-14 correct-13 correct-12 correct-11 correct-10 correct-09 correct-08 correct-07 correct-06 correct-05 correct-04 correct-03 correct-02 Correctly classifiedcorrect-20

Incorrectly classified

error-00 Incorrectly classified error-18 error-17 error-16 error-15

error-14 error-13 error-12 error-11 error-10 error-09 error-08 error-07 error-06 error-05 error-04 error-03 error-02 error-01

The miss classification are interesting because it gives us some idea what trips up the neural network. For example, the animals tend to get mix up a bit because they share similar physical characteristics eg. eyes, legs, body.

Next thing I’ll try is to add translated versions of the training data. This is done by cropping the original 32×32 image into say 9 overlapping 24×24 images, evenly sampled. For each of the cropped images we can mirror them as well. This improves robustness to translation and has been reported to give a big boost in classification accuracy. It’ll expand the training data up to 18 times (9 images, plus mirror) ! Going to take a while to run …

I’m also in the process of cleaning the code. Not sure on a release date, if ever. There are probably better implementation of convolutional neural network (EBlearn, cuda-convnet) out there but if you’re really keen to use my code leave a comment below.

Convolutional neural network and CIFAR-10

I’ve been experimenting with convolutional neural networks (CNN) for the past few months or so on the CIFAR-10 dataset (object recognition). CNN have been around since the 90s but seem to be getting more attention ever since ‘deep learning’ became a hot new buzzword.

Most of my time was spent learning the architecture and writing my own code so I could understand them better. My first attempt was a CPU version, which worked correctly but was not fast enough for any serious use. CNN with complex architectures are notoriously slow to train, that’s why everyone these days use the GPU.  It wasn’t until recently that I got a CUDA version of my code up and running. To keep things simple I didn’t do any fancy optimisation. In fact, I didn’t even use shared memory, mostly due to the way I structured my algorithm. Despite that, it was about 10-11x faster than the CPU version (single thread). But hang on, there’s already an excellent CUDA CNN code on the net, namely cuda-convnet,  why bother rolling out my own one? Well, because my GPU is a laptop GTS 360M (circa 2010 laptop), which only supports CUDA compute 1.2. Well below the minimum requirements of cuda-convnet. I could get a new computer but where’s the fun in that 🙂 And also, it’s fun to re-invent the wheel for learning reasons.


As mentioned previously I’m working with the CIFAR-10 dataset, which has 50,000 training images and 10,000 test images. Each image is a tiny 32×32 RGB image. I split the 50,000 training images into 40,000 and 10,000 for training and validation, respectively. The dataset has 10 categories ranging from dogs, cats, cars, planes …

The images were pre-processed by subtracting each image by the average image over the whole training set, to centre the data.

The architecture I used was inspired from cuda-convnet and is

Input – 32×32 image, 3 channels

Layer 1 – 5×5 convolution filter, 32 output  channels/features, Rectified Linear Unit neurons

Layer 2 – 2×2 max pool, non-overlapping

Layer 3 – 5×5 convolution filter, 32 output  channels/features, Rectified Linear Unit neurons

Layer 4 – 2×2 max pool, non-overlapping

Layer 5 – 5×5 convolution filter, 64 output  channels/features, Rectified Linear Unit neurons

Layer 6 – fully connected neural network hidden layer, 64 output units, Rectified Linear Unit neurons

Layer 7 – fully connected neural network hidden layer, 10 output units, linear neurons

Layer 8 – softmax, 10 outputs

I trained using a mini-batch of 128, with a learning rate of 0.001 and momentum of 0.9. At each epoch (one pass through the training data), the data is randomly shuffled. At around the 62th epoch I reduced the learning rate to 0.0001. The weights are updated for each mini-batch processed. Below shows the validation errors vs epoch.


After 85 epochs the results are:

– training error 7995/40000 ~ 20%

– validation error 3156/10000 = 31.56%

– test error 3114/10000 = 31.14%

Results seem okay until I compared them with results reported by cuda-convnet simplest architecture [1] [2]: ~8 epochs (?), 80 seconds, 26% testing error. Where as mine took a few hours and many more epochs, clearly I’m doing something wrong!!! But what? I did a rough back of the envelope calculation and determined that their GPU code runs 33x faster than mine, based on timing values they reported. Which means my CUDA code and hardware sucks badly.

On the plus side I did manage to generate some cool visualisation of the weights for layer 1. These are the convolution filters it learnt. This result is typical of what you will find published in the literature, so I’m confident I’m doing something right.

Features learnt by Layer 1

You can see it has learnt some edge and colour filters.

One thing I really want to try at the moment is to get my hands on a newer Nvidia card and see how much speed up I get without doing anything to the code.

I’m not releasing any code yet because it’s very experimental and too ugly to show.

Fun with ABS datapack, top 20 Viet suburbs in Victoria

Just downloaded the 2011 ABS (Australian Bureau of Statistics) data pack the other day. I first heard of it from Slashdot, where they mentioned it was a pain in the ass to download the data directly. The alternative is to fork out $200 to get a DVD delivered!! Fortunately, someone was being a true aussie and packaged it all up into a single 4.9GB torrent file. When decompressed it expands to a whopping 22 GB of CSV and some sort of map file.

Navigating the CSV files is a bit tricky because they make heavy use of acronyms and id codes that require a separate lookup file. Nonetheless, after 30 min or so I thought I’d compile some simple stats. For fun I made a list of the top 20 Viet suburbs in Victoria, Australia. Why? coz I’m Viet.

Suburb 2011 count (possible random noise added by ABS)
1 Springvale 4183
2 St Albans – South 3111
3 Braybrook 2891
4 Sunshine North 2462
5 St Albans – North 2386
6 Noble Park 2293
7 Springvale South 2227
8 Sunshine West 2144
9 Keysborough 2005
10 Kings Park (Vic.) 1639
11 Deer Park – Derrimut 1575
12 Cairnlea 1565
13 Richmond (Vic.) 1343
14 Footscray 1239
15 Maribyrnong 1125
16 Thomastown 1051
17 Sunshine 990
18 Keilor East 891
19 West Footscray – Tottenham 824
20 Lalor 790

I believe the count is based on people born in Vietnam, not sure about Viets born in Australia. The data above tends to correlate with what I’ve observed.

What I found interesting about the data in general is the issue of confidentiality. To protect the data from pin pointing back to individuals they added random noise to the data and even advised against using stats that have small numbers. How small is small? I have no idea. Also of interest is it’s a fairly old Act:

Under the Census and Statistics Act (1905) it is an offence to release any information collected under the Act that is likely to enable identification of any particular individual or organisation. Introduced random error is used to ensure that no data are released which could risk the identification of individuals in the statistics.

Here are some links of interest of this topic:

I’ll probably spend more time playing with the data trying to come up with more racially targeted stats, because they’re cool, interesting and this is Australia 🙂