Precision and recall
From Intelligent Perception
Pixcavator image search searches for images based on similarity, a potential application of this software is a search for copyrighted images. With numerous application of this type out there, the discussion of performance evaluation is usually absent.
Let’s consider the performance evaluation commonly used in information retrieval. Below, we just replace “document” with “image”.
For a given query image, we assume the following:
- There are M relevant images in the collection = correct matches.
- When the query is executed, N images are retrieved.
- Out of those only R are relevant.
Then the following two measurements quantify the quality of the search:
Recall = R / M = Number of retrieved images that are also relevant / Total number of relevant images.
Precision = R / N = Number of retrieved images that are also relevant / Total number of retrieved images.
The recall is the answer to the question: How close am I to getting all good matches? The precision is the answer to the question: How close am I to getting only good matches?
In the example below, lenaC.jpg is the query and the program searched for its modified versions (7 modified versions of the original: stretched, rotated, noised, etc). Then N = 27, M = 8, R = 7, so recall = 7/8 and precision = 7/27.
Ideally, the value of recall and precision should each be equal to 1. However, in reality they go in the opposite directions. When the query is broad, the recall is high, but precision is low. When the query is restrictive, the precision is high and recall is low. We have all experienced this effect searching Google, Yahoo, etc.
But what does “restrictive” mean in the image search context? The query is an image, so what is a “restrictive image”?
Turns out that, for this particular algorithm, images with strong, distinctive features are best. This is certainly very vague. More specifically, these are images with large, high contrast objects. They should also have enough of these objects. For example, one large object, or several medium ones, or the whole image filled with dots. The last one isn’t something you’d call an image of good quality but it is very stable under noise, blur, and other transformations.
In Pixcavator image search, the matches are simply ordered based on their distance from the query (just like Google). Then we need to choose a cut-off. In the example considered above, the cut-off was implicitly “all that fit in one page”. This is a reasonable standard for user oriented applications. For experimentation and testing, however, we may want to use the distance instead. For example, below I may choose the cut-off distance of 80: all images within 80 from the query are declared matches and retrieved, the rest are not. Then recall = 7/8, precision = 7/14.
The choice was made based on the examination of the search results for this particular image in an attempt to include as many as possible of “good” matches and, at the same time, to exclude as many as possible of the “bad” matches. More experimentation showed that 80 works OK for other queries as well. In general, however, this is not to be expected.
My conclusion is that the main drawback of precision and recall as a measure quality of the algorithm is that it requires a cut-off to separate the retrieved images from the rest. Then, the evaluation results depend on this choice. In fact, this measure ends up to be a measure of the quality of the query image, not the algorithm.
This is the article about precision and recall. The article provides a probabilistic interpretation of precision and recall.