Convolutional neural network and CIFAR-10, part 3

This is a continuation from the last post. This time I implemented translation + horizontal flipping. The translation works by cropping the 32×32 image into smaller 24×24 sub-images (9 to be exact) to expand the training set and avoid over fitting.

This is the network I used

  • Layer 1 – 5×5 Rectified Linear Unit, 64 output maps
  • Layer 2 – 2×2 Max-pool
  • Layer 3 – 5×5 Rectified Linear Unit, 64 output maps
  • Layer 4 – 2×2 Max-pool
  • Layer 5 – 3×3 Rectified Linear Unit, 64 output maps
  • Layer 6 – Fully connected Rectified Linear Unit, 64 output neurons
  • Layer 7 – Fully connected linear units, 10 output neurons
  • Layer 8 – Softmax

Below is the validation error during training. I’ve changed the way the training data is loaded to save memory. Each dataset is loaded and trained one at a time, instead of loading it all into memory. After a dataset is trained I output the validation error to file. Since I use 4 datasets for training each data point on the graph represents 1/4 of an epoch (one pass through all the training data).


I used an initial learning rate of 0.01, then changed to 0.001 at x=26 then finally 0.0001 at x=30. The other training parameters are

  • momentum = 0.9
  • mini batch size = 64
  • all the data centred (mean subtracted) prior to training
  • all weights initialised using a Gaussian of u=0 and stdev=0.1 (for some reason it doesn’t work with 0.01 like most people do)

The final results are:

  • training error ~ 17.3%
  • validation error ~ 18.5%
  • testing error ~ 20.1%

My last testing error was 24.4% so there is some slight improvement, though at the cost of much more computation. The classification code has been modified to better suit the 24×24 cropped sub-images. Rather than classify using only the centre sub-image all 9 sub-images are used. The softmax results from each sub-image is accumulated and the highest score picked. This works much better than using the centre image only. This is idea is borrowed from cuda-convnet.

Here are the features learnt for the first layer.


Using cropped sub-images and horizontal flipping the training set has expanded 18 times. The error gap between training error and validation error is now much smaller than before. This suggests I can gain improvements by using a neural network with a larger modeling capacity. This is true for the network used by cuda-convnet to get < 13% training error. Their network is more complex than what I’m using to achieve those results. This seems to be a ‘thing’ with neural networks where to get that extra bit of oomph the complexity of the network can grow monstrously, which is rather off putting.

Based on results collected for the CIFAR-10 dataset by this blog post the current best is using something called a Multi-Column Deep Neural Network, which achieves an error of 11.21%. It uses 8 different convolution neural networks (dubbed ‘column’) and aggregate the results together (like a random forest?). Each column receives the original RGB images plus some pre-processed variations. The individual neural network column themselves are fairly big beasts consisting of 10 layers.

I think there should be a new metric (or maybe there already is) along the lines of “best bangs for bucks”, where the state of the art algorithms are ranked based on something like [accuracy]/[number of model parameters], which is of particular interest in resource limited applications.



To compile and run this code you’ll need

  • CodeBlocks
  • OpenCV 2.x
  • C++ compiler that supports the newer C++11 stuff, like GCC

Instructions are in the README.txt file.

Convolutional neural network and CIFAR-10, part 2

Spent like the last 2 weeks trying to find a bug in the code that prevented it from learning. Somehow it miraculously works now but I haven’t been able to figure out why. First thing I did immediately was commit it to my private git in case I messed it up again. I’ve also ordered a new laptop to replace my non-gracefully aging Asus laptop with a Clevo/Sager, which sports a GTX 765M. Never tried this brand before, crossing my fingers I won’t have any problems within 2 years of purchase, unlike every other laptop I’ve had …

I’ve gotten better results now by using a slightly different architecture than before. But what improved performance noticeably was increasing the training samples by generating mirrored versions, effectively doubling the size. Here’s the architecture I used

Layer 1 – 5×5 convolution, Rectified Linear units, 32 output channels

Layer 2 – Average pool, 2×2

Layer 3 – 5×5 convolution, Rectified Linear units, 32 output channels

Layer 4 – Average pool, 2×2

Layer 5 – 4×4 convolution, Rectified Linear units, 64 output channels

Layer 6 – Average pool, 2×2

Layer 7 – Hidden layer, Rectified Linear units, 64 output neurons

Layer 8 – Hidden layer, Linear units, 10 output neurons

Layer 9 – Softmax

The training parameters changed a bit as well:

  • learning rate = 0.01, changed to 0.001 at epoch 28
  • momentum = 0.9
  • mini batch size = 64
  • all weights initialised using a Gaussian of u=0 and stdev=0.1

For some reason my network is very sensitive to the weights initialised. If I use a stdev=0.01, the network simply does not learn at all, constant error of 90% (basically random chance). My first guess is maybe something to do with 32bit floating point precision, particularly when small numbers keep getting multiply with other smaller numbers as they pass through each layer.

cnn2The higher learning rate of 0.01 works quite well and speeds up the learning process compared to using a rate of 0.001 I used previously. Using a batch size of 64 instead of 128 means I perform twice as many updates per epoch, which should be a good thing. A mini batch of 128 in theory should give a smoother gradient than 64 but since we’re doing twice as many updates it sort of compensates.

At epoch 28 I reduce the learning rate to 0.001 to get a bit more improvement. The final results are:

  • training error – 9%
  • validation error – 23.3%
  • testing error – 24.4%

The results are similar to the ones by cuda-convnet for that kind of architecture. The training error being much lower than the other values indicates the network has enough capacity to model most of the data, but is limited by how well it generalises to unseen data.

Numbers alone are a bit boring to look at so I thought it’d be cool to see visually how the classifier performs. I’ve made it output 20 correct/incorrect classifications on the test datase4t with the probability of it belonging to a particular category (10 total).

Correctly classified

correct-19 correct-18 correct-17 correct-16 correct-15 correct-14 correct-13 correct-12 correct-11 correct-10 correct-09 correct-08 correct-07 correct-06 correct-05 correct-04 correct-03 correct-02 Correctly classifiedcorrect-20

Incorrectly classified

error-00 Incorrectly classified error-18 error-17 error-16 error-15

error-14 error-13 error-12 error-11 error-10 error-09 error-08 error-07 error-06 error-05 error-04 error-03 error-02 error-01

The miss classification are interesting because it gives us some idea what trips up the neural network. For example, the animals tend to get mix up a bit because they share similar physical characteristics eg. eyes, legs, body.

Next thing I’ll try is to add translated versions of the training data. This is done by cropping the original 32×32 image into say 9 overlapping 24×24 images, evenly sampled. For each of the cropped images we can mirror them as well. This improves robustness to translation and has been reported to give a big boost in classification accuracy. It’ll expand the training data up to 18 times (9 images, plus mirror) ! Going to take a while to run …

I’m also in the process of cleaning the code. Not sure on a release date, if ever. There are probably better implementation of convolutional neural network (EBlearn, cuda-convnet) out there but if you’re really keen to use my code leave a comment below.

Convolutional neural network and CIFAR-10

I’ve been experimenting with convolutional neural networks (CNN) for the past few months or so on the CIFAR-10 dataset (object recognition). CNN have been around since the 90s but seem to be getting more attention ever since ‘deep learning’ became a hot new buzzword.

Most of my time was spent learning the architecture and writing my own code so I could understand them better. My first attempt was a CPU version, which worked correctly but was not fast enough for any serious use. CNN with complex architectures are notoriously slow to train, that’s why everyone these days use the GPU.  It wasn’t until recently that I got a CUDA version of my code up and running. To keep things simple I didn’t do any fancy optimisation. In fact, I didn’t even use shared memory, mostly due to the way I structured my algorithm. Despite that, it was about 10-11x faster than the CPU version (single thread). But hang on, there’s already an excellent CUDA CNN code on the net, namely cuda-convnet,  why bother rolling out my own one? Well, because my GPU is a laptop GTS 360M (circa 2010 laptop), which only supports CUDA compute 1.2. Well below the minimum requirements of cuda-convnet. I could get a new computer but where’s the fun in that 🙂 And also, it’s fun to re-invent the wheel for learning reasons.


As mentioned previously I’m working with the CIFAR-10 dataset, which has 50,000 training images and 10,000 test images. Each image is a tiny 32×32 RGB image. I split the 50,000 training images into 40,000 and 10,000 for training and validation, respectively. The dataset has 10 categories ranging from dogs, cats, cars, planes …

The images were pre-processed by subtracting each image by the average image over the whole training set, to centre the data.

The architecture I used was inspired from cuda-convnet and is

Input – 32×32 image, 3 channels

Layer 1 – 5×5 convolution filter, 32 output  channels/features, Rectified Linear Unit neurons

Layer 2 – 2×2 max pool, non-overlapping

Layer 3 – 5×5 convolution filter, 32 output  channels/features, Rectified Linear Unit neurons

Layer 4 – 2×2 max pool, non-overlapping

Layer 5 – 5×5 convolution filter, 64 output  channels/features, Rectified Linear Unit neurons

Layer 6 – fully connected neural network hidden layer, 64 output units, Rectified Linear Unit neurons

Layer 7 – fully connected neural network hidden layer, 10 output units, linear neurons

Layer 8 – softmax, 10 outputs

I trained using a mini-batch of 128, with a learning rate of 0.001 and momentum of 0.9. At each epoch (one pass through the training data), the data is randomly shuffled. At around the 62th epoch I reduced the learning rate to 0.0001. The weights are updated for each mini-batch processed. Below shows the validation errors vs epoch.


After 85 epochs the results are:

– training error 7995/40000 ~ 20%

– validation error 3156/10000 = 31.56%

– test error 3114/10000 = 31.14%

Results seem okay until I compared them with results reported by cuda-convnet simplest architecture [1] [2]: ~8 epochs (?), 80 seconds, 26% testing error. Where as mine took a few hours and many more epochs, clearly I’m doing something wrong!!! But what? I did a rough back of the envelope calculation and determined that their GPU code runs 33x faster than mine, based on timing values they reported. Which means my CUDA code and hardware sucks badly.

On the plus side I did manage to generate some cool visualisation of the weights for layer 1. These are the convolution filters it learnt. This result is typical of what you will find published in the literature, so I’m confident I’m doing something right.

Features learnt by Layer 1

You can see it has learnt some edge and colour filters.

One thing I really want to try at the moment is to get my hands on a newer Nvidia card and see how much speed up I get without doing anything to the code.

I’m not releasing any code yet because it’s very experimental and too ugly to show.

Fun with ABS datapack, top 20 Viet suburbs in Victoria

Just downloaded the 2011 ABS (Australian Bureau of Statistics) data pack the other day. I first heard of it from Slashdot, where they mentioned it was a pain in the ass to download the data directly. The alternative is to fork out $200 to get a DVD delivered!! Fortunately, someone was being a true aussie and packaged it all up into a single 4.9GB torrent file. When decompressed it expands to a whopping 22 GB of CSV and some sort of map file.

Navigating the CSV files is a bit tricky because they make heavy use of acronyms and id codes that require a separate lookup file. Nonetheless, after 30 min or so I thought I’d compile some simple stats. For fun I made a list of the top 20 Viet suburbs in Victoria, Australia. Why? coz I’m Viet.

Suburb 2011 count (possible random noise added by ABS)
1 Springvale 4183
2 St Albans – South 3111
3 Braybrook 2891
4 Sunshine North 2462
5 St Albans – North 2386
6 Noble Park 2293
7 Springvale South 2227
8 Sunshine West 2144
9 Keysborough 2005
10 Kings Park (Vic.) 1639
11 Deer Park – Derrimut 1575
12 Cairnlea 1565
13 Richmond (Vic.) 1343
14 Footscray 1239
15 Maribyrnong 1125
16 Thomastown 1051
17 Sunshine 990
18 Keilor East 891
19 West Footscray – Tottenham 824
20 Lalor 790

I believe the count is based on people born in Vietnam, not sure about Viets born in Australia. The data above tends to correlate with what I’ve observed.

What I found interesting about the data in general is the issue of confidentiality. To protect the data from pin pointing back to individuals they added random noise to the data and even advised against using stats that have small numbers. How small is small? I have no idea. Also of interest is it’s a fairly old Act:

Under the Census and Statistics Act (1905) it is an offence to release any information collected under the Act that is likely to enable identification of any particular individual or organisation. Introduced random error is used to ensure that no data are released which could risk the identification of individuals in the statistics.

Here are some links of interest of this topic:

I’ll probably spend more time playing with the data trying to come up with more racially targeted stats, because they’re cool, interesting and this is Australia 🙂

Octave’s amusing fact() function

Using Octave today I was taking a guess at the function for factorial might be fact() only to find this amusing snippet:

— Command:  fact
— Function File: T = fact()
Display an amazing and random fact about the world’s greatest

Running it a few time came up with Chuck Norris equivalent jokes for Richard Stallman:

octave:4&gt; fact

When Richard Stallman executes ps -e, you show up.

octave:5&gt; fact

Richad Stallman's pinky finger is really a USB memory stick. 

octave:6&gt; fact

Richard Stallman discovered extra-terrestrial life but killed them because they
used non-free software.

octave:7&gt; fact

Behind Richard Stallman's beard there is another fist, to code faster.

octave:8&gt; fact

Richard Stallman's doesn't kill a process, he just dares it to stay running.

That’s pretty cute.

Sparse model denoising in action

I would like to share some denoising results using sparse modelling based on the SPAMS package. This was inspired by the topic of sparse modeling from the course ‘Image and video processing: From Mars to Hollywood with a stop at the hospital’ at Coursera. I highly recommend this course because it presents some state of the art methods in computer vision that I would have missed otherwise.

I was interested to see how well it would work on real noisy images so I found an old photo I took using a Canon Powershot S3 shot at night on 1600 ISO. Here is a 100% crop of the photo. As you can see, the noise is pretty awful!

Since there is a black sky in the background I thought it’ll be interesting to see what the noise distribution looks like. What do you know, it’s very Gaussian like! This is great because the square error formulation is well suited to this type of noise.

There are a few sparse model parameters one can fiddle around with, but in my experiement I’ve the kept the following fixed and only adjusted lambda since it seems to have the most pronounced effect

  • K = 200, dictionary size (number of atoms)
  • iterations = 100 – Ithink it’s used to optimize the dictionary + sparse vector
  • patch size = 8 (8×8 patch)
  • patch stride = 2 (skip every 2 pixels)

Here is with lambda = 0.1

lambda = 0.2

lambda = 0.9

It’s amazing how well it preserves the detail, especially the edges.

Python code can be downloaded here denoise.py_ (right click save as and remove the trailing _)

RBM and sparsity penalty

This is a follow up on my last post about trying to use RBM to extract meaningful visual features. I’ve been experimenting with a sparsity penalty that encourages the hidden units to rarely activate. I had a look at ‘A Practical Guide to Training Restricted Boltzmann Machines’ for some inspiration but had trouble figuring out how their derived formula fits into the training process. I also found some interesting discussions on MetaOptimize pointing to various implementations. But in the end I went for a simple approach and used the tools I learnt from the Coursera course.

The sparsity is the average probability of a unit being active, so they are applicable to sigmoid/logistic units. For my RBM this will be the hidden layer. If you look back in my previous post you can see the weights generate random visual patterns. The hidden units are active about 50% of the time, hence the random looking pattern.

What we want to do is reduce the sparsity so that the units are activated on average a small percentage of the time, which we’ll call the ‘target sparsity’. Using a typical square error, we can formulate the penalty as:

p = K\frac{1}{2} \left(s - t\right)^2

s = average\left(sigmoid\left(wx + b\right)\right)

  • K is a constant multiplier to tune the gradient step.
  • s is the current sparsity and is a scalar value, it is the average of all the MxN matrix elements.
  • t is the target sparsity, between [0,1].

Let the forward pass starting from the visible layer to the hidden layer be:

z = wx + b

h = sigmoid(z)

s = \frac{1}{M} \frac{1}{N} \sum\limits_{i} \sum\limits_{j} h_{ij}

  • w is the weight matrix
  • x is the data matrix
  • b is the bias vector
  • z is the input to the hidden layer

The derivative of the sparsity penalty, p, with respect to the weights, w, using the chain rule is:

\frac{\partial p}{\partial w} = \frac{\partial p}{\partial s} \frac{\partial s}{\partial h} \frac{\partial h}{\partial z} \frac{\partial z}{\partial w}

The derivatives are:

\frac{\partial p}{\partial s} = K\left(s - t\right) \;\; \leftarrow scalar

\frac{\partial s}{\partial h} = \frac{1}{NM} \;\; \leftarrow constant \; scalar

\frac{\partial h}{\partial z} = h\left(1-h\right)

\frac{\partial z}{\partial w} = x

The derivative of the sparsity penalty with respect to the bias is the same as above except the last partial derivative is replaced with:

\frac{\partial z}{\partial b} = 1

In actual implementation I omitted the constant  \frac{\partial s}{\partial h} = \frac{1}{NM}  because it made the gradients very small, and I had to crank K up quite high (in the 100s). If I take it out, good working values of K are around 1.0, which is a nicer range.


I used an RBM with the following settings:

  • 5000 input images, normalized to \mu=0, \sigma=1
  • no. of visible units (linear) = 64 (16×16 greyscale images from the CIFAR the database)
  • no. of hidden units (sigmoid) = 100
  • sparsity target = 0.01 (1%)
  • sparsity multiplier K = 1.0
  • batch training size = 100
  • iterations = 1000
  • momentum = 0.9
  • learning rate = 0.05
  • weight refinement using an autoencoder with 500 iterations and a learning rate of 0.01

and this is what I got

RBM sparsityQuite a large portion of the weights are nearly zerod out. The remaining ones have managed to learn a general contrast pattern. It’s interesting to see how smooth they are. I wonder if there is an implicit L2 weight decay, like we saw in the previous post, from introducing the sparsity penalty. There are also some patterns that look like they’re still forming but not quite finished.

The results are certainly encouraging but there might be an efficiency drawback. Seeing as a lot of the weights are zeroed out, it means we have to use more hidden units in the hope of finding more meaningful patterns. If I keep the same number of units but increase the sparsity target then it approaches the random like patterns.



Have a look at the README.txt for instructions on obtaining the CIFAR dataset.

RBM, L1 vs L2 weight decay penalty for images

I’ve been fascinated for the past months or so on using RBM (restricted Boltzmann machine) to automatically learn visual features, as oppose to hand crafting them. Alex Krizhevsky’s master thesis, Learning Multiple Layers of Features from Tiny Images, is a good source on this topic. I’ve been attempting to replicate the results on a much smaller set of data with mix results. However, as a by product of I did manage generate some interesting results.

One of the tunable parameters of an RBM (neural network as well) is a weight decay penalty. This regularisation penalises large weight coefficients to avoid over-fitting (used conjunction with a validation set). Two commonly used penalties are L1 and L2, expressed as follows:

weight\;decay\;L1 = \displaystyle\sum\limits_{i}\left|\theta_i\right|

weight\;decay\;L2 = \displaystyle\sum\limits_{i}\theta_i^2

where theta is the coefficents of the weight matrix.

L1 penalises the absolute value and L2 the squared value. L1 will generally push a lot of the weights to be exactly zero while allowing some to grow large. L2 on the other hand tends to drive all the weights to smaller values.


To see the effect of the two penalties I’ll be using a single RBM with the following configuration:

  • 5000 input images, normalized to \mu=0, \sigma=1
  • no. of visible units (linear) = 64 (16×16 greyscale images from the CIFAR the database)
  • no. of hidden units (sigmoid) = 100
  • batch training size = 100
  • iterations = 1000
  • momentum = 0.9
  • learning rate = 0.01
  • weight refinement using an autoencoder with 500 iterations and learning rate of 0.01

The weight refinement step uses a 64-100-64 autoencoder with standard backpropagation.


For reference, here are the 100 hidden layer patterns without any weight decay applied:

As you can see they’re pretty random and meaningless There’s no obvious structure. Though what is amazing is that even with such random patterns you can reconstruct the original 5000 input images quite well using a weighted linear combinations of them.

Now applying an L1 weight decay with a weight decay multiplier of 0.01 (which gets multiplied with the learning rate) we get something more interesting:

We get stronger localised “spot” like features.

And lastly, applying L2 weight decay with a multiplier of 0.1 we get

which again looks like a bunch of random patterns except smoother. This has the effect of smoothing the reconstruction image.

Despite some interesting looking patterns I haven’t really observed the edge or Gabor like patterns reported in the literature. Maybe my training data is too small? Need to spend some more time …

HP Pavilion DM1 + AMD E-450 APU, you both suck!

Last year I bought myself a small HP Pavilion DM1 for traveling overseas. On paper the specs look great compared to any of Asus EEE PC offering in terms of processing power and consumption. It’s got a dual core AMD E-450, Radeon graphics card,  2GB RAM and about 4-5 hrs of battery life. In reality? I’ve had to take this laptop in for warranty repair TWICE in a few short months. First time was a busted graphics chip that only worked if I plugged in an external monitor. The second time was a faulty hard disk singing the click of the death. Both were repaired within a day, so I can’t get too mad. But when I finally got around to installing all the tools I need to do some number crunching under Linux, this happens …

Yes that’s right, Octave crashes on a matrix multiply!!! WTF?!?

I ran another program I wrote in GDB and it reveals this interesting snippet

My guess is it’s trying to call some AMD assembly instruction that doesn’t exist on this CPU. Octave uses /usr/lib/libblas as well, which explains the crash earlier. Oh well, bug report time …

RBM Autoencoders

I’ve just finished the wonderful “Neural Networks for Machine Learning” course on Coursera and wanted to apply what I learnt (or what I think I learnt). One of the topic that I found fascinating was an autoencoder neural network. This is a type of neural network that can “compress” data similar to PCA. An example of the network topology is shown below.

The network is fully connected and symmetrical, but I’m too lazy to draw all the connections. Given some input data the network will try to reconstruct it as best as it can on the output. The ‘compression’ is controlled mainly by the middle bottleneck layer. The above example has 8 input neurons, which gets squashed to 4 then to 2. I will use the notation 8-4-2-4-8 to describe the above autoencoder networks.

An autoencoder has the potential to do a better job of PCA for dimensionality reduction, especially for visualisation since it is non-linear.

My autoencoder

I’ve implemented a simple autoencoder that uses RBM (restricted Boltzmann machine) to initialise the network to sensible weights and refine it further using standard backpropagation. I also added common improvements like momentum and early termination to speed up training.

I used the CIFAR-10 dataset to train 100 small images of dogs. The images are 32×32 (1024 vector) colour images, which I converted to grescale. The network I train on is:


The input, output and bottleneck are linear with the rest being sigmoid units. I expected this autoencoder to reconstruct the image better than PCA, because it has much more parameters. I’ll compare the results with PCA using the first 8 principal components.


Here are 10 random results from the 100 images I trained on.

The autoencoder does indeed give a better reconstruction than PCA. This gives me confidence that my implementation is somewhat correct.

The RMSE (root mean squared error) for the autoencoder is 9.298, where as for PCA it is 30.716, pixel values range from [0,255].

All the parameters used can be found in the code.


You can download the code here

Last update: 27/07/2013


You’ll need the following libraries installed

  • Armadillo (
  • OpenBLAS (or any other BLAS alternative, but you’ll need to edit the Makefile/Codeblocks project)
  • OpenCV (for display)

On Ubuntu 12.10 I use the OpenBLAS package in the repo. Use the latest Armadillo from the website if the Ubuntu one doesn’t work, I use some newer function introduced recently. I recommend using OpenBLAS over Atlas with Armadillo on Ubuntu 12.10, because multi-core support works straight out of the box. This provides a big speed up.

You’ll also need the dataset

Edit main.cpp and change DATASET_FILE to point to your CIFAR dataset path. Compile via make or using CodeBlocks.

All parameter variables can be found in main.cpp near the top of the file.