mnist数据集数据
Persistent homology is a fascinating mathematical tool that continues to be studied, developed, and applied. The purpose of this article is to give a friendly introduction on how to use the persistent homology that does not require substantial knowledge of topological methods.
持久同源性是一种引人入胜的数学工具,正在继续研究,开发和应用。 本文的目的是对不需要使用大量拓扑方法知识的持久性同源性进行友好介绍。
To illustrate the use of persistent homology in machine learning we apply it to the MNIST data set of handwritten digits. It is an example of the extraction of topological features to distinguish between images of handwritten digits. The diagram in Figure 2 illustrates the main ideas underlying the proposed technique which will be discussed in greater detail in this article.
为了说明持久性同源性在机器学习中的使用,我们将其应用于手写数字的MNIST数据集。 这是提取拓扑特征以区分手写数字图像的一个示例。 图2中的图表说明了所提议技术的主要思想,本文将对此进行详细讨论。

The aim of this example is to demonstrate the classification potential of the technique and not to outperform the existing models for the classification of handwritten digits.
本示例的目的是演示该技术的分类潜力,而不是超过现有的手写数字分类模型。
For a more interesting example of using this technique on a clinical data set to classify hepatic lesions, see [1]. A very similar approach can be applied to any point cloud data and can be generalized to higher dimensions.
有关在临床数据集上使用此技术对肝病变进行分类的更有趣的示例,请参见[1]。 非常相似的方法可以应用于任何点云数据,并且可以推广到更高的维度。
I made publicly available all scripts that I wrote for this tutorial including a processed version of the data set. I am also using a publicly available package that provides an implementation for the computation of persistent homology.
我公开了我为本教程编写的所有脚本,包括数据集的处理版本。 我还使用了一个公开可用的软件包,该软件包提供了用于计算持久同源性的实现。
动机 (Motivation)
Topology applied to real-world data sets using persistent homology has begun to look for applications in machine learning, including deep learning [2]. It is mainly used as a pre-processing step to provide robust topological features for learning.
使用持久性同源性应用于现实世界数据集的拓扑已经开始寻找在机器学习中的应用,包括深度学习[2]。 它主要用作预处理步骤,以提供强大的学习拓扑功能。
Our data is often a finite set of noisy samples from some underlying space. The developed topological techniques, mostly deal with point clouds, i.e. finite sets of data points in space.
我们的数据通常是来自某些基础空间的有限的噪声样本集。 发达的拓扑技术主要处理点云,即空间中有限的数据点集。
Point clouds are typically produced by a variety of imaging devices, such as MRI or CT scanners. With the greater availability of such data capture devices, this type of data is being generated at an increasing rate. The data sets are often also very noisy and contain a lot of missing information, especially biological data sets.
点云通常由各种成像设备(例如MRI或CT扫描仪)产生。 随着这种数据捕获设备的更大可用性,这种类型的数据正以越来越高的速率生成。 数据集通常也非常嘈杂,并且包含很多丢失的信息,尤其是生物学数据集。
Our ability to analyze this data, both in terms of the amount and the nature of the data, is clearly out of step with the data we generate [3]. Topology can be used to make a useful contribution to the analysis of such data sets and can be especially helpful in studying them qualitatively.
我们分析数据的能力(无论是数据的数量还是性质)显然与我们生成的数据不一致[3]。 拓扑可用于为此类数据集的分析做出有益的贡献,并且在定性研究它们方面尤其有用。
术语 (Terminology)
A list of short definitions which we will expand later when necessary:
简短定义列表,稍后将在必要时进行扩展:
Topology is a branch of mathematics that deals with qualitative geometric information. This includes the classification of loops and higher-dimensional surfaces.
拓扑是数学的一个分支,处理定性几何信息。 这包括环和高维曲面的分类。
Topological data analysis and computational topology deal with the study of topology using a computer.
拓扑数据分析和计算拓扑是使用计算机处理拓扑的研究。
Persistent homology is an algebraic method for discerning topological features of data. A connected component (or connected cluster of points) is a 0-dimensional feature and a cycle (or loop) is a 1-dimensional feature.
持久同源性是一种识别数据拓扑特征的代数方法。 连接的组件(或连接的点簇)是0维特征,而循环(或循环)是1维特征。
A simplicial complex is a set composed of points, line segments, triangles, and their n-dimensional counterparts.
简单复数是由点,线段,三角形及其n维对应物组成的集合。
Filtration is the sequence of simplicial complexes, with an inclusion map from each simplicial complex to the next.
过滤是简单复合物的序列,包含从每个简单复合物到下一个简单复合物的包含图。
Barcode is a visual representation of the persistence of the topological features. Longer bars represent significant features of the data. Shorter bars are due to irregularities or noise.
条形码是拓扑特征持久性的直观表示。 较长的条表示数据的重要特征。 短条是由于不规则或噪音引起的。
介绍 (Introduction)
The main problem we are trying to solve is how to extract the topological features that can be used as an input to standard machine learning algorithms. We will use a similar approach as described in [1].
我们正在尝试解决的主要问题是如何提取可用作标准机器学习算法输入的拓扑特征。 我们将使用与[1]中所述类似的方法。
From each image, we first construct a graph, where pixels of the image correspond to vertices of the graph and we add edges between adjacent pixels.
首先从每个图像构造一个图形,其中图像的像素对应于图形的顶点,然后在相邻像素之间添加边。
A pure topological classification cannot distinguish between individual numbers, as the numbers are topologically too similar. For example numbers 1, 2, 3 are topologically the same if we use this style for writing numbers. Persistent homology, however, gives us more information.
纯数字拓扑分类无法区分单个数字,因为数字在拓扑上过于相似。 例如,如果我们使用这种样式写数字,则数字1、2、3在拓扑上是相同的。 持久的同源性为我们提供了更多信息。
We define a filtration on the vertices of the graph corresponding to the image pixels, adding vertices and edges as we sweep across the image in the vertical or horizontal direction. This adds spatial information to the topological features. For example, though 6 and 9 both have a single loop, it will appear at different locations in the filtration.
我们在与图像像素对应的图形的顶点上定义过滤,并在垂直或水平方向扫过图像时添加顶点和边缘。 这将空间信息添加到拓扑特征中。 例如,尽管6和9都有一个循环,但它会出现在过滤的不同位置。
We then compute the persistent homology given the simplex stream from the filtration to get the so-called Betti barcodes. The persistent homology was computed using the computational topology package Dionysus 2.
然后,我们根据过滤得到的单纯形流计算持久同源性,以获得所谓的Betti条码。 使用计算拓扑包Dionysus 2计算持久性同源性。
The Betti k barcode is a finite set of intervals. Each interval represents the first filtration level where the topological feature of dimension k appears and the filtration level where it disappears. These are called birth and death times of the topological feature respectively.
Betti k条形码是一组有限的间隔。 每个间隔代表第一个过滤级别,其中出现了尺寸k的拓扑特征,第一个过滤级别则消失了。 这些分别称为拓扑特征的出生和死亡时间。
We extract 4 features from the k-dimensional barcode from the invariants discussed in [1]. For each of 4 sweep directions: top, bottom, right, left and dimensions 0 and 1 we compute 4 features. This gives us a total of 32 features per image. We extract the features from a set of images and then apply a support vector machine (SVM) to classify the images.
我们从[1]中讨论的不变量从k维条形码中提取4个特征。 对于4个扫掠方向中的每一个:顶部,底部,右侧,左侧以及尺寸0和1,我们计算了4个特征。 这样一来,每个图像共有32个功能。 我们从一组图像中提取特征,然后应用支持向量机(SVM)对图像进行分类。
We show an example of applying this technique on one image of a handwritten digit and then give the empirical classification results on a subset of the MNIST database.
我们显示了在手写数字的一个图像上应用此技术的示例,然后在MNIST数据库的子集上给出了经验分类结果。
从手写数字中提取拓扑特征 (Extraction of topological features from handwritten digits)

The pre-processing steps we used, shown in Figure 3, are the following:
我们使用的预处理步骤如图3所示:
- Produce the binary image by thresholding. 通过阈值产生二进制图像。
Reduce the binary image to a skeleton of 1-pixel width to expose its topology using the popular Zhang-Suen thinning algorithm.
使用流行的Zhang-Suen稀疏算法将二进制图像缩小为1像素宽度的骨架,以暴露其拓扑。
- Transform the pixels of the skeleton to points in the plane. 将骨架的像素变换为平面中的点。
- Construct an embedded graph G in the plane where we treat the points as vertices and add edges between adjacent points similar and then remove all cycles of length 3. Intuitively, we connect the points while trying not to create new topological features. 在平面上构造一个嵌入的图形G,在该平面中,我们将这些点视为顶点,并在相邻点之间添加相似的边,然后删除所有长度为3的循环。直观地,我们在尝试不创建新拓扑特征的情况下连接了这些点。
- Construct a simplex stream for computing the persistent homology using the filtration on the vertices of the graph G. Filtration is the following. We are adding the vertices and edges of the embedded graph G as we sweep across the plane. In this example, we sweep across the plane to the top in a vertical direction. 使用图G的顶点上的过滤构造一个用于计算持久同源性的单纯形流。过滤如下。 当我们在平面上扫掠时,我们将添加嵌入图G的顶点和边。 在此示例中,我们沿垂直方向扫过飞机到顶部。
Given the simplex stream from the final pre-processing step, we compute the persistent homology to get the so-called Betti barcodes shown in Figure 4. See also Figure 2 which is using the same example.
给定来自最终预处理步骤的单工流,我们计算持久性同源性以获得图4所示的所谓Betti条形码。 另请参见使用相同示例的图2 。

Betti 0 barcode consists of one interval [4.0, inf), which clearly shows the single connected component with the birth time of 4 when the first point is detected.
贝蒂0条形码由一个间隔[4.0,inf)组成,当检测到第一个点时,该间隔清楚显示了连接时间为4的单个连接组件。
Betti 1 barcode consists of two intervals [12, inf) and [20, inf), with birth times 12 and 20 correspondingly to the births of 2 cycles (the value of y when the loop closes) in the drawing of number 8 when we sweep to the top.
Betti 1条码由两个间隔[12,inf)和[20,inf)组成,当我们在数字8中绘制时,出生时间12和20分别对应于2个周期(循环关闭时的y值)的出生。扫到顶部。
Denote the endpoints of Betti barcode intervals with:
用以下方式表示Betti条码间隔的端点:

where x represents the beginning and y the end of each interval. From the endpoints we compute 4 features from the invariants discussed in [1], that take into account all of the bars, lengths, and endpoints:
其中x代表每个间隔的开始,y代表每个间隔的结束。 从端点,我们根据[1]中讨论的不变量计算出4个特征,这些特征考虑了所有条形,长度和端点:

For each of 4 sweep directions: top, bottom, right, left and for each k-dimensional barcode, for k = 0, 1, we compute the 4 features. This gives us a total of 32 features per image.
对于4个扫描方向中的每个方向:顶部,底部,右侧,左侧以及每个k维条形码,对于k = 0、1,我们计算4个特征。 这样一来,每个图像共有32个功能。
MNIST数据库的子集的经验分类结果 (Empirical classification results on a subset of the MNIST database)
To create a data set for the input to a standard machine learning algorithm, we extracted topological features for 10000 images of handwritten digits. We split the data set 50:50 in train and test set so each one had 5000 examples and classified the images only based on extracted topological features using SVM with RBF kernel.
为了创建输入标准机器学习算法的数据集,我们提取了10000个手写数字图像的拓扑特征。 我们将数据集以训练集和测试集的比例分为50:50,因此每个样本集都有5000个示例,并且仅使用具有RBF内核的SVM基于提取的拓扑特征对图像进行分类。
Accuracy on the train set using 10-fold cross-validation was 0.88 (+/- 0.05). Accuracy on the test set was 0.89. We examined the common misclassifications.
使用10倍交叉验证的列车上的准确性为0.88(+/- 0.05)。 测试仪的准确度为0.89。 我们检查了常见的错误分类。
There were 3 examples of the number 2 being mistaken for number 0, shown in Figure 5. The reason was that the number 2 was written with a loop that appears in the region that is close to the loop in number 0.
数字5被误认为数字0的例子有3个,如图5所示。 原因是数字2带有一个循环,该循环出现在靠近数字0的循环的区域中。

For number 5 we got the lowest F1 score of 0.75. It was misclassified as number 2 in 32 examples in the test set. The first three examples are shown in Figure 6. This was expected since these two numbers are topologically the same with no topological features (e.g. loops) appearing in different regions.
对于数字5,我们得到的最低F1分数为0.75。 在测试集中的32个示例中,它被错误分类为2号。 前三个示例如图6所示。 这是可以预期的,因为这两个数字在拓扑上是相同的,并且在不同区域没有出现拓扑特征(例如,回路)。

Number 8 was misclassified as 4 in 21 examples from the test set. The first three examples are shown in Figure 7. We see the stylistic problems that caused the misclassifications. The top loop of number 8 was not closed which made it topologically more similar to number 4 written with a loop.
在测试集中的21个示例中,数字8被错误分类为4。 前三个示例如图7所示。 我们看到了导致分类错误的样式问题。 数字8的顶部循环未闭合,这使其在拓扑上更类似于用循环编写的数字4。

源代码和数据 (Source code and data)
The repository containing all Python scripts that I wrote for this tutorial including the processed version of the data set is available here:
包含我为本教程编写的所有Python脚本(包括数据集的已处理版本)的存储库可在此处找到:
https://github.com/markolalovic/tda-digits
https://github.com/markolalovic/tda-digits
It’s dependencies are:
它的依赖项是:
Python (2 or 3)
Python (2或3)
Dionysus 2 for computing persistent homology
Dionysus 2用于计算持久同源性
Boost version 1.55 or higher for Dionysus 2
Dionysus 2的Boost 1.55或更高版本
matplotlib for plotting the data
matplotlib用于绘制数据
networkx for plotting the graphs
用于绘制图形的networkx
numpy for loading data and computing
numpy用于加载数据和计算
sklearn for machine learning algorithms
sklearn用于机器学习算法
skimage for image pre-processing
skimage用于图像预处理
mayavi if you want to draw the torus from Figure 1 in 3D
mayavi如果要从3D图1中绘制圆环
To generate the figures in the example of topological features extraction, run scripts / tda_digits.py
:
要在拓扑特征提取示例中生成图形,请运行scripts / tda_digits.py
:
$ cd scripts
$ ./tda_digits
For details on how to use the functions and classes see the Jupyter notebooks: Example.ipynb
and Classification.ipynb
that are in the scripts
directory.
有关如何使用功能和类的详细信息,请参见scripts
目录中的Jupyter笔记本: Example.ipynb
和Classification.ipynb
。
致谢 (Acknowledgments)
This work was part of the project I did for the Summer School on Computational Topology and Topological Data Analysis in Ljubljana. This project was presented by Dr. Primoz Skraba.
这项工作是我为卢布尔雅那的计算拓扑和拓扑数据分析暑期班所做的项目的一部分。 该项目由Primoz Skraba博士提出。
翻译自: https://medium.com/swlh/topological-features-applied-to-mnist-data-set-d3d01bdb2298
mnist数据集数据