September 15, 2008
Last time I noticed that image-to-image search engines launch in batches was in May. Of course, “launch” usually means private beta. I also found it interesting that there are so many of them and yet they never mention or discuss each other.
Now, another batch - within a few days from each other.
First, Gazopa (what an awful name!) from Hitachi. Private beta.
Second, Imprezzeo. “Coming soon”.
Third, Picasa launched a face recognition feature. By most accounts it does not work well.
Fourth, VideoSurf “Unveils First Computer Vision Search for Video”. Private beta.
Finally, Idee updated its TinEye. Apparently, now it can match an image and its rotated version. That was my main problem with the application.
June 2, 2008
In part 1 and part 2 I discussed a paper on face recognition and the methods it relies on. Recall, each 100×100 gray scale image is a table of 100×100 = 10,000 numbers that can be rearranged into a 10,000-vector or a point in the 10,000-dimensional Euclidean space. As we discovered in part 2, using the closedness of these points as a measurement of similarity between images ignores the way the pixels are attached to each other. A deeper problem is that unless the two images are aligned first, there is no way to use this representation to discover that they depict the same or similar thing. The proper term for this alignment is image registration.
The similarity between images represented this way will be entirely based on their overlap. As result, the distance can be large even between images that we would consider similar. In part 2 we had examples of one-pixel images. More realistic examples are these:
- image with an object in one corner onewith the same object in another corner;
- image of a cross and the same cross turned 45 degrees;
- etc.
Back to face identification. As the faces are points in the 10,000-dimensional space, these points should be grouped somehow. The point is that all images of the same individual should belong to one group and not any other. It is common to consider “clusters” of points, i.e., groups formed of point close to each other. This was discussed above.
Now, in this paper the approach is different: a new point (the face to be identified) is represented as a linear combination of all other points (all faces in the collection).
As we know from linear algebra, this implies the following. (1) the entire collection has to be linearly dependent, (2) you can find a subcollection that adds up to 0! In other words, everything cancels out and you end up with a blank photo. Is it possible? If the dimension is low or the collection is large (the images are small relative to the number of images), maybe. What if the collection is small? (It is small – see below.) It seems unlikely. Why do I think so? Consider this very extreme case: you may need the negative for each face to cancel it: same shape with dark vs. light hair, skin, eyes, teeth (!).…
Second, the new image in the collection has to be a linear combination of training images of the same person. In other words, any image of person A is represented as a linear combination of other images of A in the collection, ideally. (More likely this image is supposed to be closer to the linear space spanned by these images.) The approach could only work under the assumption that people are linearly independent:
No face in the collection can be represented as a linear combination of the rest of the faces.
It’s a bold assumption.
If it is true, then the challenge is to make the algorithm efficient enough. The idea is that you don’t need all of those pixels/features and they in fact could be random. That must be the point of the paper.
The testing was done on two collections with several thousand images each. That sounds OK, but the number of individuals in these collections was 38 and 114!
To summarize, there is nothing wrong with the theory but its assumptions are unproven and the results are untested.
P.S. It’s strange but after so many years computer vision still looks like an academic discipline and not an industry.
May 27, 2008
TinEye is an image-to-image search engine from Idée. It is in a closed testing but I got to try it a couple of days ago. After a very positive review at TechCrunch, I decided to write up my impressions (a review of an earlier version is here).
They don’t make wild claims about being able to do face identification or similar (unsolved) problems. The goal seems very simple: find copies of images. With this task TinEye does a fairly good job. It finds even ones that have been modified - noise, color, stretch, crop, some photoshopping. It does not do well with rotation. That’s a major drawback (compare to Lincoln from MS Research).
These are the images that I tried.
  
Barbara: found both color and bw copies and a slightly cropped version.
Marilyn: found cropped and stretched versions, and an even edited (defaced) version.
Lenna: found both color and bw, but not partial or rotated versions (even though a rotated version is in the index).
May 12, 2008
Let’s review part 1 first. If you have a 100×100 gray scale image, it is simply a table of 100×100 = 10,000 numbers. You rearrange the rows of this table into a 10,000-vector and represent the image as a point in the 10,000-dimensional Euclidean space. This enables you to measure distances between images, discover patterns, match images, etc. Now, what is wrong with this approach?
Suppose A, B, and C are images with a single black pixel in the left upper corner, next to it, and the right bottom corner respectively. Then, the distances will be the equal: d(A,B) = d(B,C) = d(C,A), no matter how you define the distance d(,) between points in this space. The conclusion: if A and B are in the same cluster, then so is C. So adjacency of pixels and distance between them is lost in this representation!
Of course this can be explained, as follows. The three images are essentially blank so it’s not surprising that they are close to the blank image and to each other. So as long as pixels are “small” the difference between these four images is justifiably negligible.
Of course, “small” pixels means “small” with respect to the size of the image. This means high resolution. High resolution means larger image (for the same “physical” object), which means higher dimension of the Euclidean space, which means higher computational costs. Not a good sign.
To take this line of thought all the way to the end, we have to ask the question: what if we keep increasing resolution?
The image will simply turn into an exact copy of the “physical” object. Initially, the image is a table of numbers. Now, you can think of the table as a rectangle subdivided into small squares, then the image is a function to the reals constant on each of these squares. As the resolution grows, the rectangle remains the same but the squares become smaller. In the end we have a - possibly continuous – function (as the limit of this sequence of functions). This is the “real” image and the rest are its approximations.
It’s not as clear what happens to the representations of images in the Euclidean space. The dimension of this space grows and in the end becomes infinite! It also seems that this new space should be made of infinite strings of numbers. That does not work out.
Indeed, consider this (“real”) image: a white square with a black upper left quarter. Let’s represent it first as a 2×2 image. Then in the 4-dimensional Euclidean space this image is (1,0,0,0). Now let’s increase the resolution. If this is a 4×4 image, it is (1,1,0,0,1,1,0,0,..,0) in the 16-dimensional space. In the 32-dimensional space it’s (1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,0,…,0). You can see the pattern. But what is the end result (as the limit of this sequence of points)? It can’t be (1,1,1,…), can it? It definitely isn’t the original image. That image can’t even be represented as a string of numbers, not in any obvious way…
OK, these are just signs that there may be something wrong with this approach. A more tangible problem is that unless the two images are aligned first, there is no way to use this representation to discover that they depict the same or similar thing. About that in the next post.
May 7, 2008
After Google “launched” its ImageRank - by presenting a paper about it, now there are two more.
First, Idée “publicly launched” its image search engine (report here). If you want to try it, they’ll put you on a waiting list. How is it different from what we saw before?
Second, “Pixsta launches image search engine” (report here). Testing is also closed. What is the difference from what we saw before?
The only good thing here is that I discovered a better term for visual image search, CBIR, etc. It’s “image-to-image search“, as opposed to text-to-text and text-to-image we are familiar with.
April 29, 2008
A paper appeared recently on how to improve Google search. It has received a lot of media coverage including NY Times and TechCrunch. Since this is a topic that interests me a lot, I decided to write a few words.
The most important thing to understand here is that the paper isn’t about improving image search in general (especially visual image search and CBIR, see here). It is specifically about Google image search (and indirectly other search engines, MSN, Yahoo, etc). The goal is to improve it (because it sucks). It is currently based on surrounding text and as a result you get a lot of irrelevant images. Essentially, they add to this approach some image analysis. What kind? Not the best kind – “descriptors”. So there will be no analysis of the content of the image (see Fields related to computer vision). Even so, the descriptors will help to evaluate similarity between images - to a certain degree.
To summarize, some similarity measure plus hyperlinks - that will help with improving the search results for sure. Meanwhile, image search, image recognition etc remain unsolved.
April 7, 2008
I kept thinking about the issue of image analysis vs. computer vision. This is how it was interpreted in the article:
- Image Analysis: image in -> features out.
- Computer Vision: image in -> interpretation out.
The problem I had with this approach comes from this example: even though computing the distribution of colors in the image is analysis, it does not tell anything about the contents of the image. My take was:
- Low level image analysis = image processing.
- High level image analysis = low level computer vision.
- High level computer vision = image understanding.
I want now to clarify this idea. The difference between low level analysis and high level analysis is that the latter reveal the content of the image – possibly on a low level. But how? My answer is:
Low level analysis is local and high level analysis is global.
There is a simple test for that:
The analysis is local when cutting the image into pieces and reassembling them in an arbitrary way does not affect the results.
You can even imagine that you arrange the pixels in a single row. You can analyze those pixels all you want but they can’t reveal the content of the picture! Here are some examples.
Local analysis:
- anything based on color/intensity histogram,
- statistics (mean, standard deviation, etc);
- anything based on local filtering, in particular edge detection.
Global analysis:
- Image segmentation;
- Fourier analysis;
- texture and pattern;
- morphological analysis (but only if the output is still image segmentation).
It is interesting that in ImageJ’s Features page, we find no mention of image segmentation:
Analysis:
- Measure area, mean, standard deviation, min and max of selection or entire image.
- Measure lengths and angles.
- Use real world measurement units such as millimeters.
- Calibrate using density standards.
- Generate histograms and profile plots.
The only global item on that list is #2. And one still needs to find something to measure – it would have to come from image segmentation.
In visual image search (CBIR) image analysis is typically local: color distribution, edge distribution, other “descriptors”. Studying patches instead of pixels is still local if you measure the patches in pixels (filtering, morphology). But suppose you cut the image into 100 patches and then collect global information from each patch. Rearranging these patches will unlikely to produce a real life image. Lincoln from MS Research and some others operate this way.
To summarize,
High level analysis = global analysis = low level computer vision.
March 30, 2008
A new post at TechCrunch just appeared: Image Recognition Problem Finally Solved: Let’s Pay People To Tag Photos. A new company apparently provides image recognition for photo tagging - but only with human help! That’s not surprising to me. What is interesting is the change of attitude at TechCrunch: “A trail of failed startups have tried to tackle the problem… Google has effectively thrown in the towel…” After so many enthusiastic articles about image recognition technology somebody finally saw the light. And so did the founder of Riya. For a much longer list of “failed startups” in this area try this article about visual image search engines.
P.S. When I tried to reply to their post with a two-sentence comment, it was rejected. How odd!
P.S.S. The TechCrunch post was about TagCow, now I see a very recent post (elsewhere) about Picollator. They claim they have a visual image search engine for faces. People will keep trying….
January 24, 2008
cellAnalyst 1.0 is out! 
It is well known that the digital microscope has led to an explosion of the number of images that the researcher has to deal with. On the one hand, this opens opportunities of gathering enormous amounts of data. On the other hand, tools for automatic image analysis and data management become a necessity.
cellAnalyst is such a tool. It is intended to help the researcher to analyze his microscopy images and manage the output data.
It is especially suitable for analysis of cell images, cell counting and classification. In each image, cell cells are detected, captured, and measured. This data is presented as a table that lists all cells along with their characteristics. Each such table is recorded as an entry in a searchable database. Image processing and image management tools are also provided.
More features are planned for the future with the goal of a complete toolbox for high content analysis (HCA) and high content screening (HCS). A web based application is also under development. You can download cellAnalyst at http://cellAnalyst.net. The software comes with a complete user’s guide and a tutorial. Feedback will be appreciated.
After this formal introduction, it is interesting to observe that cellAnalyst can also be seen as a visual search engine. Indeed, the search is based entirely on the data extracted from images instead of text, tags, etc.
January 4, 2008
Via TechCrunch. Google filed a patent application “Recognizing Text In Images”. It is supposed to improve indexing of internet images and also read text in images from street views, store shelves etc. Since it is related to computer vision, I decided to take a look.
First I looked at the list of claims of the patent. Those are legal statements (a few dozen in this case) that should clearly define what is new in this invention. That’s what would be protected by the patent. The claims seem a bit strange. For example it seems that the “independent” claims (ones that don’t refer to other claims) end with “and performing optical character recognition on the enhanced image”. So, there seems to be no new OCR algorithms here…
According to some comments, the idea is that there is no “OCR platform being able to do images on the level that is suggested here”. It may indeed be about a “platform” because the claims are filled with generalities. Can you patent a “platform”? Basically you put together some well known image processing-manipulation-indexing methods plus some unspecified OCR algorithms and it works! This sounds like a “business method” patent. In fact, in spite of its apparent complexity the patent reminds me of the one-click patent. And what is the point of this patent? Is it supposed to prevent Yahoo or MS from doing the same? Are we supposed to forget that it has been done before?
ALIPR is Automatic Image Tagging and Visual Image Search. Unlike with many others uploading and, therefore, testing is possible. The last report was about a year ago. Let’s see what progress has been made.
First I tried a simple image of ten coins on dark background. These are the tags: landscape, ice, waterfall, building, historical, ocean, texture, rock, natural, marble, sky, snow, frost, man-made, indoor. For a portrait of Einstein: animal, snow, landscape, mountain, lake, cloud, building, tree, predator, wild_life, rock, natural, pattern, mineral, people. Not very encouraging.
Then I read “About us”. Turns out ALIPR “is not designed for black&white photos”. Strange idea considering that b&w images are simpler than color and if you can’t solve a simpler problem how can you expect to solve a more complex one? OK, fine. Let’s try a color image (on the right). These are the tags: texture, red, food, indoor, natural, candies, bath, kitchen, painting, fruit, people, cloth, face, female, hair_style.
At this point I got bored…
This application was supposed to learn from its users. Clearly it hasn’t learnt anything. In fact there seems to be no change at all after a whole year. In fact there are no blog posts since last January. Is it dead?
December 23, 2007
Yesterday TechCrunch announced the finalists of Crunchies - its awards for best start-up companies/product. Twenty categories - all boring! Except for one at the top - Best technology innovation / achievement. The interesting part is that 3 out of 5 finalists have computer vision related products! This seems fair to me but, sadly, there is nothing to celebrate here. Certainly not from the point of view of technology. These are the companies:
Like.com – “likeness” image search for shopping. Searches sometimes make sense but also seem cooked up. Even then the engine is easy to trick: a search for a watch with a secondary dial returned many watches without. Bottom line, nothing new after a whole year.
Earthmine – reconstructing cities from street views, “first geospatially accurate and complete street-level 3D data”. Well, conversion of 2D to 3D isn’t going to work. If they collect images continuously by driving through the streets and then patching the images together (that’s unclear), they get a 3rd dimension. Bottom line, even their demo shows only static (panoramic) shots not a true 3D reconstruction.
Viewdle – face recognition in videos. Makes a good demo but there is no way to test it. Can such technology ever be reliable? Once again it’s the 2D-to-3D issue. Bottom line, how is it better than Polar Rose?
None gets my vote.
December 9, 2007
For a while I’ve been planning to resume reviews of image search applications but couldn’t decide which one to start with. The decision was made for me – coverage of Polar Rose (last reviewed in March) appeared. Was it in TechCrunch? No, it appeared in Time Magazine!
They should be embarrassed.
Let’s start with this quote at the top of the page in large letters: “these Tech Pioneers show that the best technology is often just a new way of thinking about an old problem” (bold face theirs). So you don’t need expertise or hard work, all you need is to be original. A feel-good idea for little kids…
Now about the article itself. It occupies just a half-page but there is also a whole page picture on the front page of the series “Tech Pioneers”. As far as image search is concerned the article repeats the old promises of the founder: “click on a photo to search for other photos”, “turns photographs into 3-D images”. A new promise is this: plug-in for FF is given away and the one for IE coming soon. OK fine, let’s go to the site, download and test it. Imagine my surprise when I discovered that there is only beta testing going on! The testing is closed and there is no way to try it. How come it’s not available? “It’s a very slow business…”, according to the founder.
My guess is that the reporter didn’t try it either. Apparently, the whole thing came to Time from the World Economic Forum. Here Polar Rose promises to launch face matching “later this year”. We’ll see…
December 5, 2007
A few recent developments:
The number of downloads of Pixcavator 2.3 has reached 10,000.
Idee has re-launched its visual image search application. There is no image uploading and as a result no way to test it. But even sSearching within its own collection of images indicates that shapes aren’t given serious consideration. More on image search here. More updates to come.
Last Sunday Ash Pahwa and I attended the exhibit of the annual meeting of the American Society for Cell Biology in Washington DC. We talked with some companies that make image analysis and related software. Nothing spectacular. Most interesting was the conversation with Ilya Goldberg, the representative of OME (Open Microscopy Environment).
A quick analysis revealed that the time complexity of the image analysis algorithm that runs Pixcavator is O(n2), where n is the number of pixels in the image.
November 15, 2007
A few days ago I received an e-mail from Lenny Kontsevich, CTO of Cognisign. They run xcavator.net (an unfortunate choice of name…) which is a visual image search site. Dr. Kontsevich expressed his disagreement with the review of their application in our wiki. The review is just three short sentences describing the application and my experience with it:
xcavator.net: Semi-manual search based on recording the color at key points chosen by the user. Seems like a good idea, why doesn’t it work? Experiment: an image of rose with 20 key [points] returns an image of a girl in a red bikini.
I want to describe exactly where this came from and give an update.
When I tried the application last year, it seemed that the approach to image matching and search was the following: by placing key points on the image you help the computer concentrate on the important parts. The computer can match images by studying them as a whole; for example, using the color histogram of the image. However you may be interested in matching only a part of the image in order to find different images depicting the same object. Then you need to have the capability for the user to select that object or a part of it. That’s the way I understood this application.
First, I watched carefully the video introduction. The sample image they showed was the Golden Gate Bridge and they put points in the right parts of the bridge: the metal parts of the frame and the holes in the frame. The end result was a series of very good matches of the bridge. The reason was that the points (less than a dozen) were chosen so well that there may be no other image that would have pixels of those colors located the same way with respect to each other. This approach made perfect sense to me.
Next I followed the instructions exactly. I picked an image of a flower and placed key points around the center (red) and in the middle (black). The result was a few red roses but at the same time each time I added a point to the original (10, then 20) one image would never disappear among the matches. It was the image of a girl in a red bikini.
Clearly that was not supposed to happen. The colors were matched but not the locations of the colors. My conclusion was that either this was a bug or I simply misunderstood the method.
This experiment was what I reported in the blog (more than a year ago!). Later I also copied the report to the wiki. Unfortunately, the word ‘points’ in “20 key points” was lost. That may give (and it did) the impression that I chose 20 key words. The meaning changes completely. The review was also copied without the date. That may also have contributed to the misunderstanding because since then many things have changed. There is no Golden Gate Bridge anymore in the introductory video. It seems that the whole application is now much less about matching color+position and more about colors alone and especially tags.
The instructions still mention shapes. Repeating my experiment I didn’t get the girl in a red bikini this time. However I was unable to evaluate how much shapes contribute to image matching. The reason is that the matches are made firstly on the base of the color and on the base of tags. You would never know whether the match was based entirely on colors and tags or on shape as well. There is no image upload and that’s what prevented me from testing in any further. This situation is not uncommon in this area. There are also no links that would help you understand how things work and I was not about to engage in “investigative reporting”.
Of course I’m biased here - I simply do not care about matching based on colors. What I am interested in is the image search technology that is independent of color and of course independent of tags. In some areas, such as radiology, all images are gray scale. That said, the purpose of xcavator.net is to help customers to search those huge collections of stock images. That seems to work fine.
The review of course will have to be updated.
— Next Page » |
|
|