This page is a part of CVprimer.com, a wiki devoted to computer vision. It focuses on low level computer vision, digital image analysis, and applications. It is designed as an online textbook but the exposition is informal. It geared towards software developers, especially beginners, and CS students. The wiki contains mathematics, algorithms, code examples, source code, compiled software, and some discussion. If you have any questions or suggestions, please contact me directly.

# Challenge: image recognition

Image test collections such as Caltech101 [1] make image recognition too easy by, for example, placing the object in the middle of the image. A quick look at Caltech101 reveals how extremely limited it is. In the sample images the objects are indeed centered. This means that the photographer gives the computer a hand. Also, there is virtually no background – it’s either all white or very simple, like grass behind the elephant. Now the size of the collection: 101 categories with most having about 50 images. So far this looks too easy.

Let’s look at the pictures now. It turns out that there is another problem elsewhere. The task is in fact too hard! The computer is supposed to see that the side view of a crocodile represents the same object as the front view. How? By training. Suggested number of training images: 1, 3, 5, 10, 15, 20, 30.

The idea “training” (or machine learning) is that you collect as much information about the image as possible and then let the computer sort it out by some sort of clustering [2]. One approach is appropriately called “a bag of words” - patches in images are treated the way Google treats words in text, with no understanding of the content. You can only hope that you have captured the relevant information that will make image recognition possible. Since there is no understanding of what that relevant information is, there is no guarantee.

Then how come some researchers claim that their methods work? Good question. My guess is that by tweaking your algorithm long enough you can make it work with a small collection of images. Also, just looking at color distribution could give you enough information to “categorize” some images – in a very small collection, with very few categories. Recall one of our principles: always start with black-and-white images.

So, what to do? In this wiki we start with a very low level – find the objects, their locations, sizes, shapes, etc. Once you’ve got those abstract objects, you can try to figure out what those objects represent.

Often the recognition task is very simple. For example, for a home security system you don’t need to detect faces to sound alarm. The right range of sizes will do. A moving object larger than a dog and smaller than a car will trigger the alarm. All you need is a couple of sliders to set it up.

# Challenge: learning addition

The approach is generally as follows. Imagine you have a task the people do easily but you don’t understand how they do it. Now you solve the problem in these three "easy" steps.

• You set up a program that supposedly behaves like the human brain (something you don’t really understand),
• you teach it how to do the task by providing nothing but feedback (because you don’t understand how it’s done),
• the program runs pattern recognition and solves the problem for you.

If you can teach computer to recognize objects, you can teach it simpler things.

How about teaching computer how to add based entirely on feedback.

This would be a simpler experiment than image recognition because

• it is simpler and faster,
• the feedback is unambiguous,
• the ability to add is verifiable.

Essentially you supply it with all sums of all pairs of numbers from 0 to 99 and then see if it can compute 100+100.

The answer I believe is: Numerically this is easy, symbolically impossible.

Questions:

• Given a function with f(’1′,’1′)=’2′, will the computer figure out that f(’1′,’2′)=’3′?
• Where would the idea of “is bigger than” or “is the following number of” come from if not from the person who creates the network?

If it's training examples, then computers should be able to form concepts.

As for trying to imitate human vision, here's my thoughts. We simply don't know how a person looking at an apple forms the word ‘apple’ in his brain. This is also a part of another pattern – trying to emulate nature to create new technology. The idea is very popular but when has it ever been successful? Do cars have legs? Do planes flap their wings? Do ships have fins? What about electric bulb, radio, phone? One exception may be Stereo vision...

# Challenge: counting objects in an image

This time, let’s challenge machine vision by a very simple computer vision problem.

Given an image, find out whether it contains one object or more.

Let’s assume that the image is binary so that the notion of “object” is unambiguous. Essentially, you have one object if any two 1’s can be connected by a sequence of adjacent 1’s. Anyone with a minimal familiarity with computer vision (some even without) can write a program that solves this problem. But that’s irrelevant here because the computer is supposed to learn on its own, as follows.

• You have a computer with some general purpose machine learning program (meaning that no person provides insight into the nature of the problem).
• Then you show the computer images one by one and tell it which ones contain one object.
• After you’ve had enough images, the computer will gradually start to classify images on its own.
• Questions:

First, why "gradually"? Why not give the computer all information at once and instantly become good at the task? One drawback is that you can’t keep tweaking the algorithm. But I think the main reason why machine learning is popular is that everyone likes to teach. It’s fun to see your child/student/computer learn something new and become better and better at it. This is very human – and also totally misplaced.

My second question is, This can work but what would guarantee that it will? More narrowly, what information about the image should be passed to the computer to ensure that the computer will succeed more then 50% of the time, sooner or later?

Option 1: we pass all information. Then, this could work. For example, the computer represents every 100x100 image a point in the 10,000-dimensional space and then runs clustering. First, this may be impractical and, second,.. does it really work? Will the one-object images form a cluster? Or maybe a hyperplane? One thing is clear, these images will be very close to the rest because the difference between one object and two may be just a single pixel.

Option 2: we pass some information. What if you pass information that cannot possibly help to classify the images the way we want? For example, you may pass just the value of the (1,1) pixel, or the number of 1’s, or 0’, or their proportion. Who will make sure that the relevant information (adjacency) isn’t left out? Computer doesn’t know what is relevant – it’s hasn’t learned “yet”. If it’s the human, then he would have to solve the problem first, if not algorithmically then at least mathematically. Then the point of machine learning as a way to solve problems is lost.

BTW, this “simple” challenge of counting the number of objects may also be posed for texts instead of images.

My conclusion is, don’t apply machine learning in image search, image recognition, etc, anywhere software is expected to replace a human and solve a problem.

So, when can machine learning be useful? In cases where the problem can’t be, or hasn’t been, solved by a human. For example, a researcher is trying to find a cause of a certain phenomenon and there is a lot of unexplained data. Then - with some luck - machine learning may suggest to the researcher a way to proceed. Pattern recognition is a better name then.

The problem of collecting enough information from the image is also shared by CBIR (Content Based Image Retrieval).

Out approach to computer vision does not use machine learning or related ideas, see Overview.