Image analysis on the cheap

Paul Hammond, 23 August 2010

The recently rebuilt Favcol presented an surprisingly interesting challenge: how to analyze the images.

Image processing at scale is effectively a solved problem. The algorithms are well optimized, and it’s trivial to scale horizontally by adding more hardware to your image processing cluster. Sites like Flickr and Picasa have optimized the process enough to resize images on the fly if needed while serving thousands of requests a second.

Scaling image processing down is a different story. I think everyone I’ve ever talked to about processing images on a small site has a horror story. The story of Favcol is fairly typical.

The first version of Favcol was a Rails application, and used RMagick to manipulate images in memory. It was a disaster. Memory leaks caused processes to grow until the box crashed hard. Reaping processes helped a little, but the server I was running it on was supposed to be doing other things at the same time, and couldn’t really wait 60 seconds to recover.

The next version shelled out to the gm GraphicsMagick command to manipulate files, then read the results back from disk. In theory this should have been slower and more expensive, in practise it was significantly more efficient. If there’s one piece of advice I can give to anyone thinking about doing any kind of handling of large images, it’s to do the hard work in a seperate process unless you really know what you’re doing. And if you think you know what you’re doing, do the hard work in a seperate process anyway because you’re probably wrong.

Even so reading a few hundred huge files every five minutes was still killing my server. One day Favcol crashed the machine again. The cron job got disabled. The intent was to fix it quickly but kids and work and life got in the way and that never happened.

Eventually I started looking at alternatives. Upgrading my virtual server was more expensive than I’m willing to pay to host something like Favcol. I could make the bills cheaper by bringing up an EC2 instance to batch process images for half an hour each day, but part of the fun of favcol is seeing your photo appear within a few minutes. I looked for online services for image processing, and found many different ways to resize or post-process images and no services to give me an average color. I even briefly considered doing the work on visitor’s computers with <canvas>.

Google App Engine kept bubbling up as a potential solution - it’s free if you stay below a quota and has a built in image manipulation API. The only problem was that App Engine offers no easy way to get at the raw pixel data for an image that has been processed, which is the only data I needed.

Eventually I realized there is a workaround.

The trick is that PNG files are an easy to read, even from high level scripting languages like Python. So you can use the App Engine Image Manipulation Service to convert an image into a smallish PNG, then read the raw data using a pure python library like pypng:

# go grab the image
result = urllib2.urlopen(url)

# resize to a 20px thumbnail
img = images.Image(result.read())
img.resize(width=20, height=20)
thumbnail = img.execute_transforms(
              output_encoding=images.PNG)

# read the thumbnail
r = png.Reader(bytes = thumbnail)
png_w,png_h,pixels,info = r.asDirect()

It’s a hack, but it works well enough to process a few thousand images throughout the day without costing me any money.

The full code I use is up on github. It only does basic RGB mean average at the moment, but it should be easy to add other metrics like dominant colour.

I hope it’s useful.