October 22, 2009

A common view of digital imaging

February 16, 2009

Object recognition demo from Numenta

Filed under: computer vision/machine vision/AI, image search, rants, reviews — Peter @ 6:01 pm

The link to this demo was sent to me by Ricardo Niederberger Cabral (thanks!). The demo program is called Vision4 and was created by Numenta. This is its main point:

This program demonstrates some capabilities of Numenta’s Hierarchical Temporal Memory (HTM) technology applied to visual object recognition. .. The HTM network contained in this demo has been trained to recognize four types of objects: cell phones, sailboats, cows, and rubber ducks.

Every image is given four ratings. Each represents how much the image resembles one of the four types.

As you can see, the goal is modest and there are no unsubstantiated claims of how this is ready to be applied in real life (and don’t get me started on academic publications!). This is refreshing. The program is also fun to play with. You can load your own images, you can add noise, blur etc to the images and see the effect on the recognition. The recognition results are often good and when they aren’t, it’s still interesting.

For serious purposes, it is unclear where this is going though.

It’s fine with me that there are only four categories – just one would be enough to test the concept. It does not bother me when a face is rated high in the cow category and another face high in the duck category. My main complaint is the instability of recognition under image transformations. For example, after turning “sailboat” a few degrees it became “cell phone”. A few degrees more and it becomes mixed – half “cow” (first image below). Adding noise, occlusion, etc has similar effect (second image).

Certainly, one does not expect rotations to affect image recognition. Meanwhile, a mixed recognition is a failed recognition and should be presented as such.

I am certainly biased here. I don’t believe in “build[ing] machines that work on principles used by the brain”. I don’t believe in trying to imitate brain and I’ve written a few times about that. Traditionally, a scientist tries to understand nature by observing it, analyzing it, etc. Instead, it is suggested to try to understand nature by first understanding how the brain understands it? Seems like a roundabout to me, bordering on a vicious circle. I also have serious reservations about the use of machine learning in computer vision.

Annoying bug: every time I start it, the program would turn on my webcam and it would keep it on even after I shut it down.

February 10, 2009

Image search engines keep launching: Milabra

Filed under: computer vision/machine vision/AI, image search, rants, reviews — Peter @ 2:26 am

TechCrunch is happy to do PR for another visual search company: Milabra.

Milabra claims that it can categorize images, “from puppies to porn”:

…when searching through a library of images for dogs, Milabra doesn’t need to constantly compare each image with its database of known ‘dog’ images – instead, it can look for traits that it has learned to associate with “doggyness”…

The two examples in the demo are “beach” and “dog”. You upload an image with people on the beach, click “Search” and you get a page of beach photos… Wait, you don’t get to upload anything – this is just a video! So, there is no way to test their claims. Unfortunately, this is not unusual in this area and in computer vision in general.

If your software can recognize a puppy in an image (95% of the time as you claim), it should be easy for you to demonstrate this ability. Create a little web application (or desktop, I don’t care) that allows me to upload my own image which is then identified as “puppy” (or “tree”, or “street”, I don’t care). There is no such program. Why not? The answer is obvious.

In response to some skepticism, this is what one of the founders wrote:

…if you think that this cannot be done, then you are completely clueless: object classifiers have been made for more than 10 years now at leading CS labs around the world.

That reminds me of the episode of Seinfeld when Kramer decides to build levels in his apartment:

KRAMER: It’s a simple job. Why, you don’t think I can?

JERRY: Oh, no. It’s not that I don’t think you can. I know that you can’t, and I’m positive that you won’t.

This is Millabra’s team:

  • MBA
  • MS in Biological Engineering and PhD in neuroscience
  • MS in Computer Science and Ph.D. in Biophysics
  • Professional Project Manager
  • Expert in computer networking, user interface design

JERRY: I don’t see it happening.

And what about TechCrunch? Same story again and again since I started to keep track a couple of years ago: they publish an enthusiastic report about a company doing image analysis/search/recognition, and then silence. The company slips into obscurity and there is no follow-up, nothing. These people never learn…

The people who do seem to learn, slowly, are the investors: Riya (like.com) $20 million or more, Polar Rose $5 million, Milabra $1.4 million. Or maybe this is just the effect of the economic downturn?

December 22, 2008

Algebraic topology and digital image analysis

Filed under: computer vision/machine vision/AI, mathematics, rants — Peter @ 10:23 pm

In my last paper, I made a comment about topology of binary images: “These issues have been studied over the last 100 years or so and they are well understood”. It was pointed out to me that digital image analysis didn’t start until the 1960s, so how come?

Let me set the record straight.

The history is this. Algebraic topology was founded by Poincare around 1900 (the title of his book “Analysis Situs” converted from Latin to Greek turns into “topology”). There was no talk about binary images, obviously. What they studied was cell complexes, collections of cells attached to each other in an appropriate way. The cells were initially only triangular but later of any shape. It was also informally assumed that all topological theorems are independent of the cell decomposition or representation. This fact was formally proven by the 1950s, roughly. By then all the issues had been settled and algebraic topology had become one of the central disciplines in mathematics. The fist monographs were written in the 1930s (Alexandroff&Hopf) and first (graduate) textbooks were written in the 1960s (Hilton&Wiley, Mac Lane, Spanier, and many more).

Undergraduate books are rare (one that I like the most and use is Topology of Surfaces by Kinsey). Courses are even rarer. As a result, computer scientists (and even mathematicians) are often unfamiliar with the well established ways of dealing with even the most elementary topological issues (and I mean really elementary: how many objects, which ones have holes or tunnels and how many, etc.)

Even though relevant papers pop up once in a while, the connection of image analysis to algebraic topology is not a common knowledge among practitioners of computer vision and image analysis. I know this from personal experience…

The main reference on the subject is Computational Homology by Kaczynski, Mischaikow, and Mrozek. This is still very much a graduate text. Hopefully, our wiki is more accessible.

November 28, 2008

Image search engines still keep launching: Incogna

Filed under: image search, rants, reviews — Peter @ 1:19 am

The screenshot tells the whole story. The image of a table in the upper left corner is the query image. The rest are supposed to be “similar”. What is the image filled with numbers doing here you ask? Hmm… Oh yes, it’s a table of numbers!

Previous posts on the topic are here.

September 15, 2008

Image search engines still keep launching

Filed under: computer vision/machine vision/AI, image search, rants — Peter @ 12:08 am

Last time I noticed that image-to-image search engines launch in batches was in May. Of course, “launch” usually means private beta. I also found it interesting that there are so many of them and yet they never mention or discuss each other.

Now, another batch – within a few days from each other.

First, Gazopa (what an awful name!) from Hitachi. Private beta.

Second, Imprezzeo. “Coming soon”.

Third, Picasa launched a face recognition feature. By most accounts it does not work well.

Fourth, VideoSurf “Unveils First Computer Vision Search for Video”. Private beta.

Finally, Idee updated its TinEye. Apparently, now it can match an image and its rotated version. That was my main problem with the application.

June 15, 2008

What is image segmentation?

Filed under: computer vision/machine vision/AI, mathematics, rants, reviews — Peter @ 2:11 pm

Let’s go to Wikipedia. The first sentence is:

“Image segmentation is partitioning a digital image into multiple regions”.

This description isn’t what I would call a definition as it suffers from a few very serious flaws.

First, what does “partitioning” mean? A partition is a representation of something as the union of non-overlapping pieces. Then partitioning is a way of obtaining a partition. The part about the regions not overlapping each other is missing elsewhere in the article: “The result of image segmentation is a set of regions that collectively cover the entire image” (second paragraph).

Then, is image segmentation a process (partitioning) or the output of that process? The description clearly suggests the former. That’s a problem because it emphasizes “how” over “what”. That suggests human involvement in the process that is supposed to be objective and reproducible.

Next, a segmentation is a result of partitioning but not every partitioning results in a segmentation. A segmentation is supposed to have something to do with the content of the image.

More nitpicking. Do the regions have to be “multiple”? The image may be blank or contain a single object. Does the image has to be “digital”? Segmentation of analogue images makes perfect sense.

A slightly better “definition” I could suggest is this:

A segmentation of an image is a partition of the image that reveals some of its content.

This is far from perfect. First, strictly speaking, what we partition isn’t the image but what’s often called its “carrier” – the rectangle itself. Also, the background is a very special element of the partition. It shouldn’t count as an object…

Another issue is with the output of the analysis. The third sentence is “Image segmentation is typically used to locate objects and boundaries (lines, curves, etc.) in images.” It is clear that “boundaries” should be read “their boundaries” here – boundaries of the objects. The image does not contain boundaries – it contains objects and objects have boundaries. (A boundary without an object is like Cheshire Cat’s grin.)

Once the object is found, finding its boundary is an easy exercise. This does not work the other way around. The article says: “The result of image segmentation [may be] a set of contours extracted from the image.” But contours are simply level curves of some function. They don’t have to be closed (like a circle). If a curve isn’t closed, it does not enclose anything – it’s a boundary without an object! More generally, searching for boundaries instead of objects is called “edge detection”. In the presence of noise, one ends up with just a bunch of pixels – not even curves… And by the way, the language of “contours”, “edges”, etc limits you to 2D images. Segmentation of 3D images is out of the window?

I plan to write a few posts about specific image segmentation methods in the coming weeks.

June 2, 2008

Pattern recognition in computer vision, part 3

In part 1 and part 2 I discussed a paper on face recognition and the methods it relies on. Recall, each 100×100 gray scale image is a table of 100×100 = 10,000 numbers that can be rearranged into a 10,000-vector or a point in the 10,000-dimensional Euclidean space. As we discovered in part 2, using the closedness of these points as a measurement of similarity between images ignores the way the pixels are attached to each other. A deeper problem is that unless the two images are aligned first, there is no way to use this representation to discover that they depict the same or similar thing. The proper term for this alignment is image registration.

The similarity between images represented this way will be entirely based on their overlap. As result, the distance can be large even between images that we would consider similar. In part 2 we had examples of one-pixel images. More realistic examples are these:

  • image with an object in one corner onewith the same object in another corner;
  • image of a cross and the same cross turned 45 degrees;
  • etc.

Back to face identification. As the faces are points in the 10,000-dimensional space, these points should be grouped somehow. The point is that all images of the same individual should belong to one group and not any other. It is common to consider “clusters” of points, i.e., groups formed of point close to each other. This was discussed above.

Now, in this paper the approach is different: a new point (the face to be identified) is represented as a linear combination of all other points (all faces in the collection).

As we know from linear algebra, this implies the following. (1) the entire collection has to be linearly dependent, (2) you can find a subcollection that adds up to 0! In other words, everything cancels out and you end up with a blank photo. Is it possible? If the dimension is low or the collection is large (the images are small relative to the number of images), maybe. What if the collection is small? (It is small – see below.) It seems unlikely. Why do I think so? Consider this very extreme case: you may need the negative for each face to cancel it: same shape with dark vs. light hair, skin, eyes, teeth (!).…

Second, the new image in the collection has to be a linear combination of training images of the same person. In other words, any image of person A is represented as a linear combination of other images of A in the collection, ideally. (More likely this image is supposed to be closer to the linear space spanned by these images.) The approach could only work under the assumption that people are linearly independent:

No face in the collection can be represented as a linear combination of the rest of the faces.

It’s a bold assumption.

If it is true, then the challenge is to make the algorithm efficient enough. The idea is that you don’t need all of those pixels/features and they in fact could be random. That must be the point of the paper.

The testing was done on two collections with several thousand images each. That sounds OK, but the number of individuals in these collections was 38 and 114!

To summarize, there is nothing wrong with the theory but its assumptions are unproven and the results are untested.

P.S. It’s strange but after so many years computer vision still looks like an academic discipline and not an industry.

May 12, 2008

Pattern recognition in computer vision, part 2

Let’s review part 1 first. If you have a 100×100 gray scale image, it is simply a table of 100×100 = 10,000 numbers. You rearrange the rows of this table into a 10,000-vector and represent the image as a point in the 10,000-dimensional Euclidean space. This enables you to measure distances between images, discover patterns, match images, etc. Now, what is wrong with this approach?

Suppose A, B, and C are images with a single black pixel in the left upper corner, next to it, and the right bottom corner respectively. Then, the distances will be the equal: d(A,B) = d(B,C) = d(C,A), no matter how you define the distance d(,) between points in this space. The conclusion: if A and B are in the same cluster, then so is C. So adjacency of pixels and distance between them is lost in this representation!

Of course this can be explained, as follows. The three images are essentially blank so it’s not surprising that they are close to the blank image and to each other. So as long as pixels are “small” the difference between these four images is justifiably negligible.

Of course, “small” pixels means “small” with respect to the size of the image. This means high resolution. High resolution means larger image (for the same “physical” object), which means higher dimension of the Euclidean space, which means higher computational costs. Not a good sign.

To take this line of thought all the way to the end, we have to ask the question: what if we keep increasing resolution?

The image will simply turn into an exact copy of the “physical” object. Initially, the image is a table of numbers. Now, you can think of the table as a rectangle subdivided into small squares, then the image is a function to the reals constant on each of these squares. As the resolution grows, the rectangle remains the same but the squares become smaller. In the end we have a – possibly continuous – function (as the limit of this sequence of functions). This is the “real” image and the rest are its approximations.

It’s not as clear what happens to the representations of images in the Euclidean space. The dimension of this space grows and in the end becomes infinite! It also seems that this new space should be made of infinite strings of numbers. That does not work out.

Indeed, consider this (“real”) image: a white square with a black upper left quarter. Let’s represent it first as a 2×2 image. Then in the 4-dimensional Euclidean space this image is (1,0,0,0). Now let’s increase the resolution. If this is a 4×4 image, it is (1,1,0,0,1,1,0,0,..,0) in the 16-dimensional space. In the 32-dimensional space it’s (1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,0,…,0). You can see the pattern. But what is the end result (as the limit of this sequence of points)? It can’t be (1,1,1,…), can it? It definitely isn’t the original image. That image can’t even be represented as a string of numbers, not in any obvious way…

OK, these are just signs that there may be something wrong with this approach. A more tangible problem is that unless the two images are aligned first, there is no way to use this representation to discover that they depict the same or similar thing. About that in the next post.

May 7, 2008

Image search engines keep launching

Filed under: image search, news, rants — Peter @ 10:25 pm

After Google “launched” its ImageRank - by presenting a paper about it, now there are two more.

First, Idée “publicly launched” its image search engine (report here). If you want to try it, they’ll put you on a waiting list. How is it different from what we saw before?

Second, “Pixsta launches image search engine” (report here). Testing is also closed. What is the difference from what we saw before?

The only good thing here is that I discovered a better term for visual image search, CBIR, etc. It’s “image-to-image search“, as opposed to text-to-text and text-to-image we are familiar with.

March 30, 2008

TechCrunch finally realizes that image recognition has not been solved

Filed under: computer vision/machine vision/AI, image search, news, rants — Peter @ 4:34 am

A new post at TechCrunch just appeared: Image Recognition Problem Finally Solved: Let’s Pay People To Tag Photos. A new company apparently provides image recognition for photo tagging – but only with human help! That’s not surprising to me. What is interesting is the change of attitude at TechCrunch: “A trail of failed startups have tried to tackle the problem… Google has effectively thrown in the towel…” After so many enthusiastic articles about image recognition technology somebody finally saw the light. And so did the founder of Riya. For a much longer list of “failed startups” in this area try this article about visual image search engines.

P.S. When I tried to reply to their post with a two-sentence comment, it was rejected. How odd! 

P.S.S. The TechCrunch post was about TagCow, now I see a very recent post (elsewhere) about Picollator. They claim they have a visual image search engine for faces. People will keep trying….

March 23, 2008

Computers can’t imitate brains

Filed under: computer vision/machine vision/AI, news, rants — Peter @ 6:00 pm

I liked this recent blog post 10 Important Differences Between Brains and Computers. The reason is that it gives plentiful evidence in favor of my contention that in designing computer vision systems one shouldn’t try to imitate the human. There are two main reasons. Firstly, computers and brains are very different. Secondly and more importantly, we don’t really know how brains operate!

March 4, 2008

ImageJ vs. Pixcavator, a follow-up

Filed under: image processing/image analysis software, rants, reviews, updates — Peter @ 2:16 pm

In the last post I provided a list that compared the capabilities of ImageJ (without plug-ins) and Pixcavator 2.4 in analysis of gray scale images. Then I submitted the link to the ImageJ’s forum.

The premise was very simple. The list contained enough features of ImageJ’s to show that they are comparable (in a certain narrow sense). It also contained some Pixcavator’s features that ImageJ doesn’t have to make the comparison interesting. I expected people to try it and give me some feedback. This is done every day because it’s a fair trade: people get to try something new and I get to learn something new. That didn’t happen.

My post was taken as an attack on ImageJ. The responses were along these lines:

  1. ImageJ is free (as in “free speech” as I was explained).
  2. ImageJ works on all platforms not just Windows.
  3. ImageJ’s plug-ins include “particle tracking, deconvolution, fourier transform, FRET analysis, 3D reconstruction, neuron tracing…”

Clearly, this wasn’t the kind of feedback I expected. I thought they were simply off topic.

To resolve the issue somewhat I added the first two items to the table and also promised to have a post to compare ImageJ with plug-ins to Pixcavator SDK (it’s free but unlike free speech it will only stay free for some time…).

Even though this was very unsatisfying, it wasn’t all bad – there were a few positive/neutral responses (thanks!) and there were spikes in the number of visits and downloads.

In retrospect, I should have made it clear that the comparison was from the point of view of a user not a developer. In this light, the main advantage of Pixcavator becomes evident – its simplicity!

So I didn’t learn anything new and didn’t get to improve my software, so what? I can turn this around and say that the end result is in fact a good news:

None of the statements in the post has been refuted.

The only statement that has been refuted – multiple times – is: “Pixcavator is better than ImageJ”, the statement I never made or implied.

One interesting reaction came from Mark Burge: “I would hazard to say that everything in Pixcavator is surely available through a plugin”. I wagered $100 that he was wrong. No response so far. How about we make this a bit more interesting? Here’s is a challenge:

$300 for the first person who shows that all of these features of Pixcavator’s are reproducible by the existing ImageJ’s plug-ins!

Meanwhile life goes on. We had a couple of milestones recently. First, we reached 30,000 downloads of Pixcavator since January 2007 (versions 2.2 – 2.4). Second, the wiki – the main page – has been visited 10,000 times since August 2007. Recently we are getting over 80 daily visitors.

January 26, 2008

“Computer vision not as good as thought”, who thought?!

Filed under: computer vision/machine vision/AI, news, rants — Peter @ 7:05 pm

A study came out of MIT a couple of days ago. According to the press release the study “cautions that this apparent success may be misleading because the tests being used are inadvertently stacked in favor of computers”. The image test collections such as Caltech101 make image recognition too easy by, for example, placing the object in the middle of the image.

The titles of the press releases were “Computer vision may not be as good as thought” or similar. I must ask, Who thought that computer vision was good? Who thought that testing image recognition on such a collection proves anything?

A quick look at Caltech101 reveals how extremely limited it is. In the sample images the objects are indeed centered. This means that the photographer gives the computer a hand. Also, there is virtually no background – it’s either all white or very simple, like grass behind the elephant. Now the size of the collection: 101 categories with most having about 50 images. So far this looks too easy.

Let’s look at the pictures now. It turns out that there is another problem elsewhere. The task is in fact too hard! The computer is supposed to see that the side view of a crocodile represents the same object as the front view. How? By training. Suggested number of training images: 1, 3, 5, 10, 15, 20, 30.

The idea “training” (or machine learning) is that you collect as much information about the image as possible and then let the computer sort it out by some sort of clustering. One approach is appropriately called “a bag of words” – patches in images are treated the way Google treats words in text, with no understanding of the content. You can only hope that you have captured the relevant information that will make image recognition possible. Since there is no understanding of what that relevant information is, there is no guarantee.

Then how come some researchers claim that their methods work? Good question. My guess is that by tweaking your algorithm long enough you can make it work with a small collection of images. Also, just looking at color distribution could give you enough information to “categorize” some images – in a very small collection, with very few categories.

My suggestion: try black-and-white images first!

January 4, 2008

Google’s text-in-image patent – another one-click fiasco?

Filed under: computer vision/machine vision/AI, image search, news, rants, reviews — Peter @ 5:43 pm

Via TechCrunch. Google filed a patent application “Recognizing Text In Images”. It is supposed to improve indexing of internet images and also read text in images from street views, store shelves etc. Since it is related to computer vision, I decided to take a look.

First I looked at the list of claims of the patent. Those are legal statements (a few dozen in this case) that should clearly define what is new in this invention. That’s what would be protected by the patent. The claims seem a bit strange. For example it seems that the “independent” claims (ones that don’t refer to other claims) end with “and performing optical character recognition on the enhanced image”. So, there seems to be no new OCR algorithms here…

According to some comments, the idea is that there is no “OCR platform being able to do images on the level that is suggested here”. It may indeed be about a “platform” because the claims are filled with generalities. Can you patent a “platform”? Basically you put together some well known image processing-manipulation-indexing methods plus some unspecified OCR algorithms and it works! This sounds like a “business method” patent. In fact, in spite of its apparent complexity the patent reminds me of the one-click patent. And what is the point of this patent? Is it supposed to prevent Yahoo or MS from doing the same? Are we supposed to forget that it has been done before?

Next Page »