June 15, 2008
Let’s go to Wikipedia. The first sentence is:
“Image segmentation is partitioning a digital image into multiple regions”.
This description isn’t what I would call a definition as it suffers from a few very serious flaws.
First, what does “partitioning” mean? A partition is a representation of something as the union of non-overlapping pieces. Then partitioning is a way of obtaining a partition. The part about the regions not overlapping each other is missing elsewhere in the article: “The result of image segmentation is a set of regions that collectively cover the entire image” (second paragraph).
Then, is image segmentation a process (partitioning) or the output of that process? The description clearly suggests the former. That’s a problem because it emphasizes “how” over “what”. That suggests human involvement in the process that is supposed to be objective and reproducible.
Next, a segmentation is a result of partitioning but not every partitioning results in a segmentation. A segmentation is supposed to have something to do with the content of the image.
More nitpicking. Do the regions have to be “multiple”? The image may be blank or contain a single object. Does the image has to be “digital”? Segmentation of analogue images makes perfect sense.
A slightly better “definition” I could suggest is this:
A segmentation of an image is a partition of the image that reveals some of its content.
This is far from perfect. First, strictly speaking, what we partition isn’t the image but what’s often called its “carrier” – the rectangle itself. Also, the background is a very special element of the partition. It shouldn’t count as an object…
Another issue is with the output of the analysis. The third sentence is “Image segmentation is typically used to locate objects and boundaries (lines, curves, etc.) in images.” It is clear that “boundaries” should be read “their boundaries” here - boundaries of the objects. The image does not contain boundaries – it contains objects and objects have boundaries. (A boundary without an object is like Cheshire Cat’s grin.)
Once the object is found, finding its boundary is an easy exercise. This does not work the other way around. The article says: “The result of image segmentation [may be] a set of contours extracted from the image.” But contours are simply level curves of some function. They don’t have to be closed (like a circle). If a curve isn’t closed, it does not enclose anything – it’s a boundary without an object! More generally, searching for boundaries instead of objects is called “edge detection”. In the presence of noise, one ends up with just a bunch of pixels – not even curves… And by the way, the language of “contours”, “edges”, etc limits you to 2D images. Segmentation of 3D images is out of the window?
I plan to write a few posts about specific image segmentation methods in the coming weeks.
June 2, 2008
In part 1 and part 2 I discussed a paper on face recognition and the methods it relies on. Recall, each 100×100 gray scale image is a table of 100×100 = 10,000 numbers that can be rearranged into a 10,000-vector or a point in the 10,000-dimensional Euclidean space. As we discovered in part 2, using the closedness of these points as a measurement of similarity between images ignores the way the pixels are attached to each other. A deeper problem is that unless the two images are aligned first, there is no way to use this representation to discover that they depict the same or similar thing. The proper term for this alignment is image registration.
The similarity between images represented this way will be entirely based on their overlap. As result, the distance can be large even between images that we would consider similar. In part 2 we had examples of one-pixel images. More realistic examples are these:
- image with an object in one corner onewith the same object in another corner;
- image of a cross and the same cross turned 45 degrees;
- etc.
Back to face identification. As the faces are points in the 10,000-dimensional space, these points should be grouped somehow. The point is that all images of the same individual should belong to one group and not any other. It is common to consider “clusters” of points, i.e., groups formed of point close to each other. This was discussed above.
Now, in this paper the approach is different: a new point (the face to be identified) is represented as a linear combination of all other points (all faces in the collection).
As we know from linear algebra, this implies the following. (1) the entire collection has to be linearly dependent, (2) you can find a subcollection that adds up to 0! In other words, everything cancels out and you end up with a blank photo. Is it possible? If the dimension is low or the collection is large (the images are small relative to the number of images), maybe. What if the collection is small? (It is small – see below.) It seems unlikely. Why do I think so? Consider this very extreme case: you may need the negative for each face to cancel it: same shape with dark vs. light hair, skin, eyes, teeth (!).…
Second, the new image in the collection has to be a linear combination of training images of the same person. In other words, any image of person A is represented as a linear combination of other images of A in the collection, ideally. (More likely this image is supposed to be closer to the linear space spanned by these images.) The approach could only work under the assumption that people are linearly independent:
No face in the collection can be represented as a linear combination of the rest of the faces.
It’s a bold assumption.
If it is true, then the challenge is to make the algorithm efficient enough. The idea is that you don’t need all of those pixels/features and they in fact could be random. That must be the point of the paper.
The testing was done on two collections with several thousand images each. That sounds OK, but the number of individuals in these collections was 38 and 114!
To summarize, there is nothing wrong with the theory but its assumptions are unproven and the results are untested.
P.S. It’s strange but after so many years computer vision still looks like an academic discipline and not an industry.
May 12, 2008
Let’s review part 1 first. If you have a 100×100 gray scale image, it is simply a table of 100×100 = 10,000 numbers. You rearrange the rows of this table into a 10,000-vector and represent the image as a point in the 10,000-dimensional Euclidean space. This enables you to measure distances between images, discover patterns, match images, etc. Now, what is wrong with this approach?
Suppose A, B, and C are images with a single black pixel in the left upper corner, next to it, and the right bottom corner respectively. Then, the distances will be the equal: d(A,B) = d(B,C) = d(C,A), no matter how you define the distance d(,) between points in this space. The conclusion: if A and B are in the same cluster, then so is C. So adjacency of pixels and distance between them is lost in this representation!
Of course this can be explained, as follows. The three images are essentially blank so it’s not surprising that they are close to the blank image and to each other. So as long as pixels are “small” the difference between these four images is justifiably negligible.
Of course, “small” pixels means “small” with respect to the size of the image. This means high resolution. High resolution means larger image (for the same “physical” object), which means higher dimension of the Euclidean space, which means higher computational costs. Not a good sign.
To take this line of thought all the way to the end, we have to ask the question: what if we keep increasing resolution?
The image will simply turn into an exact copy of the “physical” object. Initially, the image is a table of numbers. Now, you can think of the table as a rectangle subdivided into small squares, then the image is a function to the reals constant on each of these squares. As the resolution grows, the rectangle remains the same but the squares become smaller. In the end we have a - possibly continuous – function (as the limit of this sequence of functions). This is the “real” image and the rest are its approximations.
It’s not as clear what happens to the representations of images in the Euclidean space. The dimension of this space grows and in the end becomes infinite! It also seems that this new space should be made of infinite strings of numbers. That does not work out.
Indeed, consider this (“real”) image: a white square with a black upper left quarter. Let’s represent it first as a 2×2 image. Then in the 4-dimensional Euclidean space this image is (1,0,0,0). Now let’s increase the resolution. If this is a 4×4 image, it is (1,1,0,0,1,1,0,0,..,0) in the 16-dimensional space. In the 32-dimensional space it’s (1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,0,…,0). You can see the pattern. But what is the end result (as the limit of this sequence of points)? It can’t be (1,1,1,…), can it? It definitely isn’t the original image. That image can’t even be represented as a string of numbers, not in any obvious way…
OK, these are just signs that there may be something wrong with this approach. A more tangible problem is that unless the two images are aligned first, there is no way to use this representation to discover that they depict the same or similar thing. About that in the next post.
May 7, 2008
After Google “launched” its ImageRank - by presenting a paper about it, now there are two more.
First, Idée “publicly launched” its image search engine (report here). If you want to try it, they’ll put you on a waiting list. How is it different from what we saw before?
Second, “Pixsta launches image search engine” (report here). Testing is also closed. What is the difference from what we saw before?
The only good thing here is that I discovered a better term for visual image search, CBIR, etc. It’s “image-to-image search“, as opposed to text-to-text and text-to-image we are familiar with.
March 30, 2008
A new post at TechCrunch just appeared: Image Recognition Problem Finally Solved: Let’s Pay People To Tag Photos. A new company apparently provides image recognition for photo tagging - but only with human help! That’s not surprising to me. What is interesting is the change of attitude at TechCrunch: “A trail of failed startups have tried to tackle the problem… Google has effectively thrown in the towel…” After so many enthusiastic articles about image recognition technology somebody finally saw the light. And so did the founder of Riya. For a much longer list of “failed startups” in this area try this article about visual image search engines.
P.S. When I tried to reply to their post with a two-sentence comment, it was rejected. How odd!
P.S.S. The TechCrunch post was about TagCow, now I see a very recent post (elsewhere) about Picollator. They claim they have a visual image search engine for faces. People will keep trying….
March 23, 2008
I liked this recent blog post 10 Important Differences Between Brains and Computers. The reason is that it gives plentiful evidence in favor of my contention that in designing computer vision systems one shouldn’t try to imitate the human. There are two main reasons. Firstly, computers and brains are very different. Secondly and more importantly, we don’t really know how brains operate!
March 4, 2008
In the last post I provided a list that compared the capabilities of ImageJ (without plug-ins) and Pixcavator 2.4 in analysis of gray scale images. Then I submitted the link to the ImageJ’s forum.
The premise was very simple. The list contained enough features of ImageJ’s to show that they are comparable (in a certain narrow sense). It also contained some Pixcavator’s features that ImageJ doesn’t have to make the comparison interesting. I expected people to try it and give me some feedback. This is done every day because it’s a fair trade: people get to try something new and I get to learn something new. That didn’t happen.
My post was taken as an attack on ImageJ. The responses were along these lines:
- ImageJ is free (as in “free speech” as I was explained).
- ImageJ works on all platforms not just Windows.
- ImageJ’s plug-ins include “particle tracking, deconvolution, fourier transform, FRET analysis, 3D reconstruction, neuron tracing…”
Clearly, this wasn’t the kind of feedback I expected. I thought they were simply off topic.
To resolve the issue somewhat I added the first two items to the table and also promised to have a post to compare ImageJ with plug-ins to Pixcavator SDK (it’s free but unlike free speech it will only stay free for some time…).
Even though this was very unsatisfying, it wasn’t all bad - there were a few positive/neutral responses (thanks!) and there were spikes in the number of visits and downloads.
In retrospect, I should have made it clear that the comparison was from the point of view of a user not a developer. In this light, the main advantage of Pixcavator becomes evident – its simplicity!
So I didn’t learn anything new and didn’t get to improve my software, so what? I can turn this around and say that the end result is in fact a good news:
None of the statements in the post has been refuted.
The only statement that has been refuted – multiple times – is: “Pixcavator is better than ImageJ”, the statement I never made or implied.
One interesting reaction came from Mark Burge: “I would hazard to say that everything in Pixcavator is surely available through a plugin”. I wagered $100 that he was wrong. No response so far. How about we make this a bit more interesting? Here’s is a challenge:
$300 for the first person who shows that all of these features of Pixcavator’s are reproducible by the existing ImageJ’s plug-ins!
Meanwhile life goes on. We had a couple of milestones recently. First, we reached 30,000 downloads of Pixcavator since January 2007 (versions 2.2 – 2.4). Second, the wiki - the main page – has been visited 10,000 times since August 2007. Recently we are getting over 80 daily visitors.
January 26, 2008
A study came out of MIT a couple of days ago. According to the press release the study “cautions that this apparent success may be misleading because the tests being used are inadvertently stacked in favor of computers”. The image test collections such as Caltech101 make image recognition too easy by, for example, placing the object in the middle of the image.
The titles of the press releases were “Computer vision may not be as good as thought” or similar. I must ask, Who thought that computer vision was good? Who thought that testing image recognition on such a collection proves anything?
A quick look at Caltech101 reveals how extremely limited it is. In the sample images the objects are indeed centered. This means that the photographer gives the computer a hand. Also, there is virtually no background – it’s either all white or very simple, like grass behind the elephant. Now the size of the collection: 101 categories with most having about 50 images. So far this looks too easy.
Let’s look at the pictures now. It turns out that there is another problem elsewhere. The task is in fact too hard! The computer is supposed to see that the side view of a crocodile represents the same object as the front view. How? By training. Suggested number of training images: 1, 3, 5, 10, 15, 20, 30.
The idea “training” (or machine learning) is that you collect as much information about the image as possible and then let the computer sort it out by some sort of clustering. One approach is appropriately called “a bag of words” - patches in images are treated the way Google treats words in text, with no understanding of the content. You can only hope that you have captured the relevant information that will make image recognition possible. Since there is no understanding of what that relevant information is, there is no guarantee.
Then how come some researchers claim that their methods work? Good question. My guess is that by tweaking your algorithm long enough you can make it work with a small collection of images. Also, just looking at color distribution could give you enough information to “categorize” some images – in a very small collection, with very few categories.
My suggestion: try black-and-white images first!
January 4, 2008
Via TechCrunch. Google filed a patent application “Recognizing Text In Images”. It is supposed to improve indexing of internet images and also read text in images from street views, store shelves etc. Since it is related to computer vision, I decided to take a look.
First I looked at the list of claims of the patent. Those are legal statements (a few dozen in this case) that should clearly define what is new in this invention. That’s what would be protected by the patent. The claims seem a bit strange. For example it seems that the “independent” claims (ones that don’t refer to other claims) end with “and performing optical character recognition on the enhanced image”. So, there seems to be no new OCR algorithms here…
According to some comments, the idea is that there is no “OCR platform being able to do images on the level that is suggested here”. It may indeed be about a “platform” because the claims are filled with generalities. Can you patent a “platform”? Basically you put together some well known image processing-manipulation-indexing methods plus some unspecified OCR algorithms and it works! This sounds like a “business method” patent. In fact, in spite of its apparent complexity the patent reminds me of the one-click patent. And what is the point of this patent? Is it supposed to prevent Yahoo or MS from doing the same? Are we supposed to forget that it has been done before?
December 16, 2007
I am being sarcastic here, of course. I own a Roomba and I am aware that its navigation is based on bumping into walls and furniture. So, it doesn’t use computer vision in the usual sense of the word. The point here is that computer vision (and AI in general) is a field which is about to change the world, always. But so far it hasn’t. Compare that to Roomba. It’s quite dumb but it does its job and now you can have a robot in our house for the first time ever! So I’d say that Roomba has changed the world, a bit.
A candidate for a commercial success of the “real” computer vision may be face detection. It is now installed on many cameras like Canon. However, for the user it’s just a minor feature. Unlike Roomba. My guess is that a success of this magnitude is more likely to come from a vision system that is just a notch more complex than Roomba’s.
BTW, the Roomba does have vision however rudimentary. It does not detect vertical changes, so it is fair to say that its vision is 1-dimensional. Taking time into account it’s 2-dimensional. Another 1D vision system is radar.
December 9, 2007
For a while I’ve been planning to resume reviews of image search applications but couldn’t decide which one to start with. The decision was made for me – coverage of Polar Rose (last reviewed in March) appeared. Was it in TechCrunch? No, it appeared in Time Magazine!
They should be embarrassed.
Let’s start with this quote at the top of the page in large letters: “these Tech Pioneers show that the best technology is often just a new way of thinking about an old problem” (bold face theirs). So you don’t need expertise or hard work, all you need is to be original. A feel-good idea for little kids…
Now about the article itself. It occupies just a half-page but there is also a whole page picture on the front page of the series “Tech Pioneers”. As far as image search is concerned the article repeats the old promises of the founder: “click on a photo to search for other photos”, “turns photographs into 3-D images”. A new promise is this: plug-in for FF is given away and the one for IE coming soon. OK fine, let’s go to the site, download and test it. Imagine my surprise when I discovered that there is only beta testing going on! The testing is closed and there is no way to try it. How come it’s not available? “It’s a very slow business…”, according to the founder.
My guess is that the reporter didn’t try it either. Apparently, the whole thing came to Time from the World Economic Forum. Here Polar Rose promises to launch face matching “later this year”. We’ll see…
November 30, 2007
Yes, part 6! I thought I was done with the topic (Part 1, Part 2, Part 3,…), but a couple of days ago I ran into this blog post: General connectivity (MATLAB Central) . The issue is connectivity in digital images: 4-connectivity, 8-connectivity, and other “connectivities”. The issue (adjacency of pixels) is discussed in the wiki. When I wrote this article though I did not realize that the topic is related to measuring lengths of curves. Indeed, the 8-connectivity produces curves that go only horizontally or vertically while the 4-connectivity allows diagonal edges as well. In the post the curves appear as “perimeters” of objects. More accurately, they should be called contours or boundaries of objects as the perimeter is mathematically the length of the boundary (that’s where “meter” comes from). But bwperim is the name of the standard MATLAB command for finding the boundary and we will have to live with that…
My problem is with this idea: if you are a coder it is important to make the right choice of connectivity. Let’s break this into in two statements:
- The choice of connectivity is important.
- The choice of connectivity is up to the coder.
The second statement reflects a general attitude that is very common - be it MATLAB or OpenCV. Important decisions are left to the coder and are hidden from the user. If something is important, I as a user would want software that produces output independent of a particular implementation. The two statements simply contradict each other.
Now, is the choice of connectivity really important?
My answer is NO.
The change of connectivity changes the boundary of the object and, therefore, its perimeter. This seems important. But we are interested of the perimeter of a “real” object, which should be independent (as much as possible) from the digital representation. This perimeter is the length of a “real” curve – the boundary of the object. We showed (in Part 3) that the relative error of the computation of length does not diminish with the growth of the image resolution. The accuracy is improved only by choosing more and more complex ways to compute the length (roughly, increasing the degree of the approximation of the curve). The choice of connectivity is determined by a 3×3 “matrix” (it’s not a matrix, it’s just a table! – another annoying thing about MATLAB). With finitely many choices the error can’t be reduced to arbitrary low. You may conceivably improve the accuracy if you can choose larger and larger “matrix” (table!), but that seems pointless…
There is another reason to think that this choice isn’t important. About that in the next post.
P.S. To clarify, every matrix is a table but not every table is a matrix (even if it contains only numbers). It is my view that tables should be called matrices only in the context of matrix operations especially multiplication. In particular, a digital image is table not a matrix.
November 4, 2007
In the last post I asked “If you can teach computer to recognize objects, you can teach it simpler things. How about teaching computer how to add based entirely on feedback?” I was going to write a post about this but the next day I got an opportunity to approach this differently. At Hacker News I read the post How to teach a Bayesian spam filter to play chess. So I presented this (implicit) challenge.
How about teaching it how to do ADDITION?
This would be a better experiment because (1) it is simpler and faster, (2) the feedback is unambiguous, (3) the ability to add is verifiable.
Essentially you supply it with all sums of all pairs of numbers from 0 to 99 and then see if it can compute 100+100.
You can see the whole discussion here (my comments are under ‘pixcavator’). Let me give you some highlights.
First, it was educational for me. Books and sites were generously recommended. This one made me laugh:
My personal recommendation on machine learning is ‘Pattern Recognition and Machine Learning’ by Chris Bishop. But you definately do need a solid mathematical background for that.
I read some of what was recommended about Bayesian method and neural networks – they were mostly irrelevant but interesting nonetheless. The discussion helped me to formulate the answer to my own challenge:
Numerically this is easy, symbolically impossible.
The OP did not take up the challenge but a few people responded positively – yes, the problem can be solved. I started to pose ‘naïve’ questions and it was fun to watch people dig themselves deeper and deeper. This one is my favorite (really deep):
Question: So given a function with f(’1′,’1′)=’2′, the computer will figure out that f(’1′,’2′)=’3′, right?
Answer: Yes, that’s what I’m talking about.
Another one:
Question: Where would the idea of “is bigger than” or “is the following number of” come from if not from the person who creates the network?
Answer: Training examples.
Question: Computers can form concepts, really?
Answer: If you want them to learn a specific concept that we know, they can learn it, yes.
This ‘yes’ is in fact a ‘no’.
As much fun as it was, I was kind of hoping that at least one person would see the light. No such luck…
November 2, 2007
Recently I read this press release: computers get “common sense” to tell a tennis ball from a lemon, right! I quickly dismissed it as another over-optimistic report about a project that will go nowhere. Then I read a short comment (by ivankirgin) here about it.
Common sense reasoning is one of the hardest parts of AI. I don’t think top-down solutions will work. …You can’t build a top down taxonomy of ideas and expect everything too work. You can’t just “hard code” the ideas.
I can’t agree more. But then he continues:
I think building tools from the ground up, with increasingly complicated and capable recognition and modeling, might work. For example, a visual object class recognition suite that first learned faces, phones, cars, etc. and eventually moved on to be able to recognize everything in a scene, might be able to automatically perhaps with some training build up the taxonomy for common sense.
First, he simply does not go far enough. I’d start with even lower level – find the “objects”, their locations, sizes, shapes, etc. This is the “dumb” approach I’ve been suggesting in this blog and the wiki. Once you’ve got those abstract objects, you can try to figure out what those objects represent. In fact, often you don’t need even that. For example, for a home security system you don’t need to detect faces to sound alarm. The right range of sizes will do. A moving object larger than a dog and smaller than a car will trigger the alarm. All you need is a couple of sliders to set it up. Maybe the problem with this approach is that it’s too cheap?
Another point was about training and machine learning. I have big doubts about the whole thing. It is very much like trying to imitate the brain (or something else we observe in nature). Imagine you have a task the people do easily but you don’t understand how they do it. Now you solve the problem in these three easy steps.
- You set up a program that supposedly behaves like the human brain (something you don’t really understand),
- you teach it how to do the task by providing nothing but feedback (because you don’t understand how it’s done),
- the program runs pattern recognition and solves the problem for you.
Nice! “Set it and forget it!” (Here is another example of this approach.) If you can teach computer to recognize objects, you can teach it simpler things. How about teaching computer to add based entirely on feedback? Well, this topic deserves a separate post…
October 24, 2007
Caught a CSI show and started to reminisce. Tons of cool equipment including computers of all kinds and software from the 22nd century. Fingerprint identification? In seconds! DNA analysis? A snap! Even face recognition (ha!). At the same time you see them staring at those old fashioned light microscopes. The CSI people do their investigations in Second Life now but never use digital microsopes. Fiction?
— Next Page » |
|
|