This page is a part of CVprimer.com, a wiki devoted to computer vision. It focuses on low level computer vision, digital image analysis, and applications. It is designed as an online textbook but the exposition is informal. It geared towards software developers, especially beginners, and CS students. The wiki contains mathematics, algorithms, code examples, source code, compiled software, ans some discussion. If you have any questions or suggestions, please contact me directly.

Overview

From Computer Vision Primer

Jump to: navigation, search

Our image analysis algorithm, and Pixcavator [1], was created initially to illustrate and test how an elementary tool of algebraic topology, homology theory, can be used in computer vision. The idea is that just as Mathematics rests on Topology (and Algebra), Computer Vision should be built on a simple topological foundation. Using Pixcavator as such a platform one can build more advanced image analysis applications.

The scope of the project grew as soon as I realized that computer vision lacks a solid foundation. I think there are many reasons but the biggest one is the emphasis on different “solutions”, “methods” and “algorithms”. Suppose you have many solutions for a computer vision problem. It may sound good to the developer, but is it good for the user? As a user I don't mind having different ways to solve my problem - but only if they all give the same answer! So, here is the first principle we'll follow in this project.

We focus more on WHAT than HOW.

Of course, for each WHAT there may be several HOWs and we need to find the best one. But for now just one will do. So, here is the best part:

Our method is based on a single algorithm.

But what is this WHAT? It is mathematics. We build our algorithms based entirely on mathematical understanding of the problems of computer vision (which forces us to stay away for now from things like AI, machine learning, pattern recognition, fuzzy logic, etc). We develop our methods from scratch starting from the most elementary (or most fundamental if you prefer - I do) and try to do it in such a way that you wouldn’t have to redo it.

The algorithm detects and captures objects in images. But what is an object? The answer is quite obvious in the case of a binary image. It is either a connected cluster of black pixels on white background or a connected cluster of white pixels on black background. What about a gray scale image? Our approach follows human perception - a dark region on a light background may be an object and so is a light region on a darker background (this approach is also valid for color images). For more see Objects in gray scale images.

It is clear that these "objects" aren't real objects. For example, black pants and white shirt will be two separate objects. Then what's the point? This is the point:

Objects are building blocks instead of pixels.

In other words, we are taking care of the very first step in image analysis (see also Fields related to computer vision).

The existing methods of algebraic topology apply only to objects and images that have no attributes such as gray level, color, or time, i.e., still binary images. Instead of trying to generalize these methods for gray scale, then for color images, then for videos, etc, we adopt the following general approach to "parametric images".

Every image should be represented as a combination of binary images.

This approach leads to another general principle:

Every technique should first be applied to binary images.

Examples are Motion tracking, Stereo vision, Image search, etc (for more see our software projects).

The mathematical tools that we use make the following goal possible.

Image analysis should be lossless.

This term is justified by the fact that nothing is removed from or ignored in the image unless specifically requested by the user. The user retains complete control of what happens! (See also Computation error.)

What the user chooses to keep is preserved without deformation, smoothing, blurring, etc. There is also no iteration, no approximation (almost) or interpolation, and no floating point arithmetic! This is one of many things that differentiate our approach from those common in the computer vision and image processing industry. Patent pending...

The algorithm detects objects in the image and finds their locations and measurements. Its first version is for Binary Images and second for Grayscale Images. The extensions of the algorithm are in progress for the following:

As a mathematician I am always skeptical about the applicability of a particular piece of mathematics to a particular real life problem. So, a couple of disclaimers.

First, the losslessness of the topological analysis has a flip side. The algorithm will never treat two objects as one no matter how close they are (that's rarely the case in gray scale images though). For example, a thin scratch cuts an object in two, or person's shirt and pants are always treated separately. So, the first limitation of our approach is the following.

We do not attempt to group objects.

The simplest way to group two adjacent objects into one is via dilation/erosion. At this time this part of our approach has not been sufficiently developed. In the near future we will address how morphological operations can evaluate Robustness of topology.

The second limitation is more profound.

We do not attempt to extract 3D information from a 2D image.

This means that the analysis is meaningful only when the image can be interpreted as 2-dimensional. (Of course, we do extract 3D information from multiple 2D images of the same scene - via stereo vision.)

Examples of appropriate images are:

In inappropriate images the third dimension is essential. They may contain:

  • occluded objects,
  • objects well lit on one side and very dark on the other,
  • X-rays.

Mugshots may serve as an example of appropriate images of 3D objects because they are always taken under same angle with similar lighting. For the same reason, applications in machine vision for industrial inspection are also appropriate. Images of 3D objects may contain 2D items: text, scratches, other imperfections of the photograph or the lenses, noise, etc.

For more, check out our image gallery.

Finally, why would this "theory" ever be of any use? For the answer I'll refer you to this old essay The Unreasonable Effectiveness of Mathematics in the Natural Sciences (or just run Pixcavator [2]).


For a general discussion of computer vision issues, see Related approaches (also Human vision vs. computer vision).

To get started with the wiki, continue to Binary Images.

See also our blog: Computer Vision for Dummies

Personal tools