October, 2013.
@misc{ sentdex_image_2013,
title = {Image Recognition and Python Part 1},
abstract = {Sample code for this series: http://pythonprogramming.net/image-recognition-python/

There are many applications for image recognition. One of the largest that people are most familiar with would be facial recognition, which is the art of matching faces in pictures to identities. Image recognition goes much further, however. It can allow computers to translate written text on paper into digital text, it can help the field of machine vision, where robots and other devices can recognize people and objects.

Here, our goal is to begin to use machine learning, in the form of pattern recognition, to teach our program what text looks like. In this case, we'll use numbers, but this could translate to all letters of the alphabet, words, faces, really anything at all. The more complex the image, the more complex the code will need to become. When it comes to letters and characters, it is relatively simplistic, however.

How is it done? Just like any problem, especially in programming, we need to just break it down into steps, and the problem will become easily solved. Let's break it down!

First, we know we want to show the program an image, and have it compare it to patterns that it knows to make an educated guess on what the current image is. This means we're going to need some "memory" of sorts, filled with examples. In the case of this tutorial, we'd like to do image recognition for the numbers zero through nine. So we'd like to be able to show it any random 2, and have it know the image to be a 2 based on the previous examples of 2's that it has seen and memorized.

Next, we need to consider how we'll do this. A computer doesn't read text like we read text. We naturally put things together into a pattern, but a machine just reads the data. In the case of a picture, it reads in the image data, and displays, pixel by pixel, what it is told to display. Past that, a machine makes no attempt to decide whether it is showing a couch or a bird. So, our database of what examples are will actually be pixel information. To keep things simple, we should probably "threshold" the images. This means we store everything as black or white. In {RGB} code, that's a 255, 255, 255, or 0, 0, 0. That is per pixel. Sometimes there is alpha too! What we can then do is take any image, and, if the pixel coloring is say greater than 125, we could say, this is more of a "white" and convert it to 255 (the entire pixel). If it is less than 125 or equal to it, we could say this is more of a "black" and convert it to black. This might be problematic in some circumstances where we have a dark color on a darker color, usually a type of image meant to fool machines. We could have something in place instead to find the "middle" color on average for the current image, and threshold anything lighter to white and anything darker to black. This works very well for two-dimensional images of things like characters, but less well for things with shading that are meant to accompany the image, say of something like a ball.

Once we've done this, all we need to do is save the string of pixel definitions for a bunch of "example" texts. We can start with a bunch of fonts, plus some hand drawn examples. There are data dumps of a bunch of examples. This is an example of "training" our data.

If we have a decently sized database, then we are ready to try to compare some numbers. A good idea would be to hand-draw an example for your program to compare to. To compare, we'd just simply do the same thing to the question-image. We'd threshold the image into black or white pixels, then we take that pixel list, and compare it to all of our examples. In the end, we will have so many possible "hits." Whichever character has the most "hits" is likely to be the correct one. Done, we've recognized that image.

If you think about it, this is actually very similar to how we humans recognize things. Naturally, many children do not immediately distinguish between couches and love seats. What is the difference many of them ask. There is a bit of a grey area between them, and they have many similarities. Generally, a lot of learning comes by example. After seeing hundreds of couches, thousands of chairs, and hundreds of love-seats, a person soon begins to easily distinguish between them, because they have quite a bit of sample data to compare to. This is even how we read text. A number 5 really does mean nothing to a baby. They only begin to learn what a number 5 is as they are shown it over and over, being told it is "5." Eventually, they understand that to be a 5, and they can see 5 in multiple font types and still recognize it to be a 5.

Sentdex.com
}