This site contains: mathematics courses and book; covers: image analysis, data analysis, and discrete modelling; provides: image analysis software. Created and run by Peter Saveliev.

# Topological data analysis

Revision as of 15:46, 15 November 2011; view current revision

Suppose we have conducted 1000 experiments with a set of 100 various measurements in each. Then each experiment is a string of 100 numbers or simply a vector of dimension 100. The result is a collection of disconnected 1000 points, called a point cloud, in the 100-dimensional Euclidean space.

It is impossible to visualize this data as any representation is limited to dimension 3 (by using colors one gets 6, time 7). Yet we still need to answer questions about the object behind the point cloud:

• Is it one piece or more?
• Is there a tunnel?
• Or a void?
• And what about possible 100-dimensional topological features?

Through clustering (and related approaches) statistics answers the first question. The rest need a topological approach to the problem.

For a point cloud in a euclidean space, suppose we are given a threshold $r$ so that any two points within $r$ from each other are to be considered "close". Then each pair of such points is connected by an edge. If three points are “close”, we add a face, etc. The result is a simplicial complex that approximates the, possible, manifold $M$ behind the point cloud.

The Betti numbers count the number of topological features in $M$: the number of connected components, the number tunnels, the number of voids, etc. This information is contained in the homology group of the complex.

Further, to deal with noise and other uncertainty one needs to evaluate the significance of these topological features. For each value of the threshold $r$ we build a separate simplicial complex, then combine the homology groups of these complexes in a single structure, and count the features with a high measure of robustness. This measure, called persistence, is the length of the interval of values of $r$ for which the topological feature is present.

Even more important than these "global" properties may be the local topology of the data. For example, in both of the images above the datasets are 3-dimensional but what's behind is 2-dimensional (surface). This is called manifold learning and is related to dimensionality reduction. See this recent project: The topology of data.

Some issues with the traditional approaches to manifold learning are:

• They rely on local linearity of the dataset and may fail when the manifold is highly curved. But it is in exactly this situation when finding this manifold is especially important. Indeed, knowing the manifold allows us to measure distances along geodesics of this manifold.
• They assume that the dataset is a manifold in the first place. There are numerous examples of datasets where this assumption is violated, such as graphs.
• They assume that the topological space behind the data set inherits the topology of the Euclidean space that is supposed to contain it.
• Even when non-Euclidean metrics (such as the Manhattan metric) are used, the topology is still Euclidean. The possibility of non-Euclidean topology is ignored.

Broader issues are discussed under Computational topology.

Related topics:

Compare to Geometry of data.