线性代数 · SVD | 令人困扰的精度 1-优快云博客

注：本文为 “线性代数 · SVD” 相关英文引文，机翻未校。
如有内容异常，请看原文。

csdn 篇幅字数限制，分为两篇，此为第 1 篇。

Annoying Precision

令人困扰的精度

“A good stock of examples, as large as possible, is indispensable for a thorough understanding of any concept, and when I want to learn something new, I make it my first job to build one.” – Paul Halmos

“要透彻理解任何概念，大量的示例储备必不可少；当我想要学习新事物时，我的首要任务就是构建这样的示例储备。”——保罗·哈尔莫斯（Paul Halmos）

Singular value decomposition

奇异值分解

March 13, 2017 by Qiaochu Yuan

As a warm-up to the subject of this blog post, consider the problem of how to classify $\times m$ matrices $\in \mathbb{R}^{n \times m}$ up to change of basis in both the source ( $\mathbb{R}^{m}$ ) and the target ( $\mathbb{R}^{n}$ ). In other words, the problem is to describe the equivalence classes of the equivalence relation on $\times m$ matrices given by

作为本文主题的热身内容，我们来思考一个问题：如何在源空间（ $\mathbb{R}^{m}$ ）和目标空间（ $\mathbb{R}^{n}$ ）均进行基变换的前提下，对 $\times m$ 矩阵 $\in \mathbb{R}^{n \times m}$ 进行分类。换句话说，该问题旨在描述 $\times m$ 矩阵集合上由下述关系定义的等价类：

$\sim N \Leftrightarrow M = PNQ^{-1}, P \in GL_{n}(\mathbb{R}), Q \in GL_{m}(\mathbb{R}).$

It turns out that the equivalence class of $M$ is completely determined by its rank $r = r ank (M)$ . To prove this we construct some bases by induction. For starters, let $x_{1} \in \mathbb{R}^{m}$ be a vector such that $y_{1} = M x_{1} \neq 0$ ; this is always possible unless $M = 0$ . Next, let $x_{2} \in \mathbb{R}^{m}$ be a vector such that $y_{2} = M x_{2}$ is linearly independent of $y_{1}$ ; this is always possible unless $r ank (M) = 1$ .
事实证明，矩阵 $M$ 的等价类完全由其秩 $r = r ank (M)$ 决定。为证明这一点，我们通过数学归纳法构造一些基。首先，取向量 $x_{1} \in \mathbb{R}^{m}$ 使得 $y_{1} = M x_{1} \neq 0$ ；除非 $M = 0$ ，否则这样的向量总能找到。接下来，取向量 $x_{2} \in \mathbb{R}^{m}$ 使得 $y_{2} = M x_{2}$ 与 $y_{1}$ 线性无关；除非 $r ank (M) = 1$ ，否则这样的向量也总能找到。

Continuing in this way, we construct vectors $x_{1}, \dots, x_{r} \in \mathbb{R}^{m}$ such that the vectors $y_{1} = M x_{1}, \dots, y_{r} = M x_{r} \in \mathbb{R}^{n}$ are linearly independent, hence a basis of the column space of $M$ . Next, we complete the $x_{i}$ and $y_{i}$ to bases of $\mathbb{R}^{m}$ and $\mathbb{R}^{n}$ in whatever manner we like. With respect to these bases, $M$ takes a very simple form: we have $M x_{i} = y_{i}$ if $\leq i \leq r$ and otherwise $M x_{i} = 0$ . Hence, in these bases, $M$ is a block matrix where the top left block is an $\times r$ identity matrix and the other blocks are zero.
按照这种方式继续，我们构造出向量 $x_{1}, \dots, x_{r} \in \mathbb{R}^{m}$ ，使得向量 $y_{1} = M x_{1}, \dots, y_{r} = M x_{r} \in \mathbb{R}^{n}$ 线性无关，因此这些向量构成 $M$ 的列空间的一组基。接下来，我们以任意方式将 $x_{i}$ 扩充为 $\mathbb{R}^{m}$ 的基，将 $y_{i}$ 扩充为 $\mathbb{R}^{n}$ 的基。在这些基下，矩阵 $M$ 具有非常简洁的形式：当 $\leq i \leq r$ 时， $M x_{i} = y_{i}$ ；当 $i > r$ 时， $M x_{i} = 0$ 。因此，在这些基下， $M$ 是一个分块矩阵，其左上角分块为 $\times r$ 的单位矩阵，其余分块均为零矩阵。

Explicitly, this means we can write $M$ as a product
更明确地说，这意味着我们可以将 $M$ 表示为如下乘积形式：

$Q^{-1}, P \in GL_{n}(\mathbb{R}), Q \in GL_{m}(\mathbb{R})$

where $D$ has the block form above, the columns of $P$ are the basis of $\mathbb{R}^{n}$ we found by completing $x_{1}, \cdots, M x_{r}$ , and the columns of $Q$ are the basis of $\mathbb{R}^{m}$ we found by completing $x_{1}, \cdots, x_{r}$ . This decomposition can be computed by row and column reduction on $M$ , where the row operations we perform give $P$ and the column operations we perform give $Q$ .
其中 $D$ 具有上述分块形式，矩阵 $P$ 的列是通过扩充 $x_{1}, \cdots, M x_{r}$ 得到的 $\mathbb{R}^{n}$ 的基，矩阵 $Q$ 的列是通过扩充 $x_{1}, \cdots, x_{r}$ 得到的 $\mathbb{R}^{m}$ 的基。这种分解可通过对 $M$ 进行行约简和列约简来计算：所执行的行变换对应得到矩阵 $P$ ，所执行的列变换对应得到矩阵 $Q$ 。

Conceptually, the question we’ve asked is: what does a linear transformation $\to Y$ between vector spaces “look like,” when we don’t restrict ourselves to picking a particular basis of $X$ or $Y$ ? The answer, stated in a basis-independent form, is the following. First, we can factor $T$ as a composite
从概念上讲，我们提出的问题是：当不限制选择向量空间 $X$ 或 $Y$ 的特定基时，向量空间之间的线性变换 $\to Y$ “看起来是什么样的”？以不依赖于基的形式表述，答案如下：首先，我们可以将 $T$ 分解为如下复合变换

$\stackrel{p}{\to} im(T) \stackrel{i}{\to} Y$

where $im (T)$ is the image of $T$ . Next, we can find direct sum decompositions $\cong im(T) \oplus X'$ and $\cong im(T) \oplus Y'$ such that $p$ is the projection of $X$ onto its first factor and $i$ is the inclusion of the first factor into $Y$ . Hence every linear transformation “looks like” a composite
其中 $im (T)$ 是 $T$ 的像空间。接下来，我们可以找到直和分解 $\cong im(T) \oplus X'$ 和 $\cong im(T) \oplus Y'$ ，使得 $p$ 是 $X$ 到其第一个直和因子 $im (T)$ 的投影， $i$ 是第一个直和因子 $im (T)$ 到 $Y$ 的包含映射。因此，每个线性变换“看起来都像”如下复合变换

$\oplus X' \stackrel{p_{im(T)}}{\to} im(T) \stackrel{i_{im(T)}}{\to} im(T) \oplus Y'$

of a projection onto a direct summand and an inclusion of a direct summand. So the only basis-independent information contained in $T$ is the dimension of the image $im (T)$ , or equivalently the rank of $T$ . (It’s worth considering the analogous question for functions between sets, whose answer is a bit more complicated.)
该复合变换由“到直和因子的投影”与“直和因子的包含映射”构成。因此，线性变换 $T$ 中仅有的不依赖于基的信息是其像空间 $im (T)$ 的维数，或者等价地说，是 $T$ 的秩。（值得思考集合之间函数的类似问题，其答案会复杂一些。）

The actual problem this blog post is about is more interesting: it is to classify $\times m$ matrices $\in \mathbb{R}^{n \times m}$ up to orthogonal change of basis in both the source and the target. In other words, we now want to understand the equivalence classes of the equivalence relation given by
本文真正要探讨的问题更有趣：在源空间和目标空间均进行正交基变换的前提下，对 $\times m$ 矩阵 $\in \mathbb{R}^{n \times m}$ 进行分类。换句话说，我们现在希望理解由下述关系定义的等价类：

$\sim N \Leftrightarrow M = U N V^{-1}, U \in O(n), V \in O(m)$

Conceptually, we’re now asking: what does a linear transformation $\to Y$ between finite-dimensional Hilbert spaces “look like”?
从概念上讲，我们现在提出的问题是：有限维希尔伯特空间之间的线性变换 $\to Y$ “看起来是什么样的”？

Inventing singular value decomposition

构造奇异值分解

As before, we’ll answer this question by picking bases with respect to which $M$ is as easy to understand as possible, only this time we need to deal with the additional restriction of choosing orthonormal bases. We will follow roughly the same inductive strategy as before. For starters, we would like to pick a unit vector $v_{1} \in \mathbb{R}^{m}$ , $v_{1}\| = 1$ such that $v_{1} \neq 0$ ; this is possible unless $M$ is identically zero, in which case there’s not much to say. Now, there’s no guarantee that $M v_{1}$ will be a unit vector, but we can always use
与之前类似，我们将通过选择基来回答这个问题——在这些基下，矩阵 $M$ 尽可能易于理解，只不过这次需要额外满足“选择标准正交基”的限制。我们将采用与之前大致相同的归纳策略。首先，我们选择单位向量 $v_{1} \in \mathbb{R}^{m}$ （即 $v_{1}\| = 1$ ）使得 $v_{1} \neq 0$ ；除非 $M$ 是零矩阵（此时无需过多讨论），否则这样的向量总能找到。虽然无法保证 $M v_{1}$ 是单位向量，但我们总能取

$u_{1} = \frac{M v_{1}}{\|M v_{1}\|} \in \mathbb{R}^{n}$

as the beginning of an orthonormal basis of $\mathbb{R}^{n}$ . The question remains which of the many possible values of $v_{1}$ to use. In the previous argument it didn’t matter because they were all related by change of coordinates, but now it very much does because the length $M v_{1}\|$ may differ for different choices of $v_{1}$ . A natural choice is to pick $v_{1}$ so that $M v_{1}\|$ is as large as possible (hence equal to the operator norm (https://en.wikipedia.org/wiki/Operator_norm) $\|M\|$ of $M$ ); writing $\sigma_{1} = \|M v_{1}\|$ , we then have
作为 $\mathbb{R}^{n}$ 标准正交基的起始向量。但问题在于，应选择众多可能的 $v_{1}$ 中的哪一个？在之前的论证中，选择哪个 $v_{1}$ 无关紧要，因为它们可通过坐标变换相互关联；但现在情况大不相同，因为不同 $v_{1}$ 对应的长度 $M v_{1}\|$ 可能不同。一个自然的选择是：选取 $v_{1}$ 使得 $M v_{1}\|$ 尽可能大（因此该最大值等于 $M$ 的算子范数（https://en.wikipedia.org/wiki/Operator_norm） $\|M\|$ ）。记 $\sigma_{1} = \|M v_{1}\|$ ，则有

$v_{1} = \sigma_{1} u_{1}, \|v_{1}\| = \|u_{1}\| = 1$

$\sigma_{1}$ is called the first singular value of $M$ , $v_{1}$ is called its first right singular vector, and $u_{1}$ is called its first left singular vector. (The singular vectors aren’t unique in general, but we’ll ignore this for now.) To continue building orthonormal bases we need to find a unit vector
其中 $\sigma_{1}$ 称为 $M$ 的第一个奇异值， $v_{1}$ 称为第一个右奇异向量， $u_{1}$ 称为第一个左奇异向量。（一般来说，奇异向量并不唯一，但我们暂时忽略这一点。）为继续构造标准正交基，我们需要找到一个单位向量

$v_{2} \in \mathbb{R}^{m}, \|v_{2}\| = 1, \langle v_{1}, v_{2} \rangle = 0$

orthogonal to $v_{1}$ such that $M v_{2}$ is linearly independent of $M v_{1}$ ; this is possible unless $r ank (M) = 1$ , in which case we’re already done and $M$ is completely describable as $M_{1} = \sigma_{1} u_{1} \otimes v_{1}^{*}$ ; equivalently, in this case we have
该向量与 $v_{1}$ 正交，且使得 $M v_{2}$ 与 $M v_{1}$ 线性无关；除非 $r ank (M) = 1$ （此时我们的构造已完成，且 $M$ 可完全表示为 $M_{1} = \sigma_{1} u_{1} \otimes v_{1}^{*}$ ），否则这样的向量总能找到。等价地，在 $r ank (M) = 1$ 的情况下，有

$M_{1} x = \sigma_{1} u_{1} \langle v_{1}, x \rangle$

We’ll pick $v_{2}$ using the same strategy as before: we want the value of $v_{2}$ such that $M v_{2}\|$ is as large as possible. Note that since $M_{1} v_{2} = 0$ , this is equivalent to finding the value of $v_{2}$ such that $M - M_{1}) v_{2}\|$ is as large as possible. Call this largest possible value $\sigma_{2}$ and write
我们采用与之前相同的策略选择 $v_{2}$ ：选取 $v_{2}$ 使得 $M v_{2}\|$ 尽可能大。注意到 $M_{1} v_{2} = 0$ ，因此这等价于选取 $v_{2}$ 使得 $M - M_{1}) v_{2}\|$ 尽可能大。记该最大值为 $\sigma_{2}$ ，则有

$v_{2} = \sigma_{2} u_{2}, \|v_{2}\| = \|u_{2}\| = 1$

At this point we are in trouble unless $\langle u_{1}, u_{2} \rangle = 0$ ; if this weren’t the case then our strategy would fail to actually build an orthonormal basis of $\mathbb{R}^{n}$ . Very importantly, this turns out to be the case.
此时，若 $\langle u_{1}, u_{2} \rangle \neq 0$ ，我们的构造将陷入困境——因为这意味着我们的策略无法真正构建出 $\mathbb{R}^{n}$ 的标准正交基。但至关重要的是， $\langle u_{1}, u_{2} \rangle = 0$ 这一关系实际上是成立的。

Key lemma #1: Suppose $v_{1}$ is a unit vector maximizing $M v_{1}\|$ . Let $v$ be a unit vector orthogonal to $v_{1}$ . Then $M v$ is also orthogonal to $M v_{1}$ .
关键引理1：设 $v_{1}$ 是使得 $M v_{1}\|$ 最大的单位向量， $v$ 是与 $v_{1}$ 正交的单位向量，则 $M v$ 也与 $M v_{1}$ 正交。

Proof. Consider the function
证明：考虑如下函数

$f(t) = \| M(v_{1} \cos t + v \sin t) \|^{2}.$

The vectors $v_{1} \cos t + v \sin t$ are all unit vectors since $v_{1}$ , $v$ are orthonormal, so by construction (of $v_{1}$ ) this function is maximized when $t = 0$ . In particular, its derivative at $t = 0$ is zero. On the other hand, we can expand $f (t)$ out using dot products as
由于 $v_{1}$ 和 $v$ 是标准正交向量，因此向量 $v_{1} \cos t + v \sin t$ 均为单位向量。根据 $v_{1}$ 的构造（使其最大化 $M v_{1}\|$ ），函数 $f (t)$ 在 $t = 0$ 处取得最大值，因此其在 $t = 0$ 处的导数为零。另一方面，利用点积可将 $f (t)$ 展开为

$\|M v_{1}\| \cos^{2} t + 2 \langle M v_{1}, M v \rangle \cos t \sin t + \|M v\| \sin^{2} t.$

Now we can compute the first-order Taylor series expansion of this function around $t = 0$ , giving
现在计算该函数在 $t = 0$ 处的一阶泰勒展开式，得到

$\|M v_{1}\| + 2 t \langle M v_{1}, M v \rangle + \|M v\| t^{2} + O(t^{2})$

so setting the first derivative at $t = 0$ to zero gives $\langle M v_{1}, M v \rangle = 0$ as desired.
因此，令 $t = 0$ 处的一阶导数为零，可得 $\langle M v_{1}, M v \rangle = 0$ ，即证得所需结论。

This is the technical heart of singular value decomposition, so it’s worth understanding in some detail. Michael Nielsen (http://cognitivemedium.com/emm/emm.html) has a very nice interactive demo/explanation of this. Geometrically, the points $M v_{1} \cos t + M v \sin t$ trace out an ellipse centered at the origin, and by hypothesis $M v_{1}$ describes the semimajor axis of the ellipse: the point furthest away from the origin. As we move away from $t = 0$ , to first order we are moving slightly in the direction of $M v$ and so if $M v$ were not orthogonal to $M v_{1}$ it would be possible to move slightly further away from the origin than $M v_{1}$ by moving either in the positive or negative $t$ direction, depending on whether the angle between $M v_{1}$ and $M v$ is greater than or less than $90^\circ$ . The only way to ensure that moving in the direction of $M v$ does not, to first order, get us further away from the origin is if $M v$ is orthogonal to $M v_{1}$ .

这是奇异值分解的技术核心，值得深入理解。迈克尔·尼尔森（Michael Nielsen）在其网站（http://cognitivemedium.com/emm/emm.html）上提供了非常好的交互式演示和解释。从几何角度看，点 $M v_{1} \cos t + M v \sin t$ 描绘出以原点为中心的椭圆；根据假设， $M v_{1}$ 对应椭圆的长半轴——即离原点最远的点。当我们偏离 $t = 0$ 时，一阶近似下我们会沿 $M v$ 方向轻微移动。若 $M v$ 与 $M v_{1}$ 不正交，则根据 $M v_{1}$ 与 $M v$ 之间的夹角是大于还是小于 $90^\circ$ ，沿 $t$ 正方向或负方向移动，就有可能到达比 $M v_{1}$ 离原点更远的点。而要确保沿 $M v$ 方向移动（一阶近似下）不会使我们离原点更远，唯一的方式就是 $M v$ 与 $M v_{1}$ 正交。

Note that this gives a proof that the semiminor axis of an ellipse – the point closest to the origin – is always orthogonal to its semimajor axis. We can think of key lemma #1 above as more or less being equivalent to this fact, also known as the principal axis theorem (https://en.wikipedia.org/wiki/Principal_axis_theorem) in the plane, and which is also closely related to but slightly weaker than the spectral theorem for symmetric matrices.
需注意的是，这一证明同时表明：椭圆的短半轴（即离原点最近的点）始终与其长半轴正交。我们可以认为上述关键引理1在某种程度上等价于这一事实——该事实在平面几何中被称为主轴定理（https://en.wikipedia.org/wiki/Principal_axis_theorem），且与对称矩阵的谱定理密切相关，但强度稍弱。

Thanks to key lemma #1, we can continue our construction. With $r = r ank (M)$ as before, we inductively produce orthonormal vectors $v_{1}, \dots, v_{r} \in \mathbb{R}^{m}$ such that $M v_{i}\|$ is maximized subject to the condition that $\langle v_{i}, v_{j} \rangle = 0$ for all $\leq i - 1$ , and write
借助关键引理1，我们可以继续构造过程。与之前一样，设 $r = r ank (M)$ ，通过归纳法构造标准正交向量 $v_{1}, \dots, v_{r} \in \mathbb{R}^{m}$ ，其中每个 $v_{i}$ 满足：在 $\langle v_{i}, v_{j} \rangle = 0$ （对所有 $\leq i - 1$ ）的条件下， $M v_{i}\|$ 取得最大值。记

$v_{i} = \sigma_{i} u_{i}, \|u_{i}\| = 1$

where $\sigma_{i}$ is the maximum value of $\|M v\|$ on all vectors $v$ orthogonal to $v_{1}, \dots, v_{i - 1}$ ; note that this implies that
其中 $\sigma_{i}$ 是 $\|M v\|$ 在所有与 $v_{1}, \dots, v_{i - 1}$ 正交的向量 $v$ 上的最大值。需注意的是，这意味着

$\sigma_{1} \geq \sigma_{2} \geq \cdots \geq \sigma_{r} > 0.$

The $\sigma_{i}$ are the singular values of $M$ , the $v_{i}$ are its right singular vectors, and the $u_{i}$ are its left singular vectors. Repeated application of key lemma #1 shows that the $u_{i}$ are an orthonormal basis of the column space of $M$ , so the construction stops here: $M$ is identically zero on the orthogonal complement of $span(v_{1}, \dots, v_{r})$ , because if it weren’t then it would take a non-zero value orthogonal to $span(u_{1}, \dots, u_{r})$ . This means we can write $M$ as a sum

其中 $\sigma_{i}$ 是 $M$ 的奇异值， $v_{i}$ 是右奇异向量， $u_{i}$ 是左奇异向量。反复应用关键引理1可知， $u_{i}$ 构成 $M$ 的列空间的标准正交基，因此构造过程在此处停止：在 $span(v_{1}, \dots, v_{r})$ 的正交补上， $M$ 恒为零（若不然， $M$ 在该正交补上会取到与 $span(u_{1}, \dots, u_{r})$ 正交的非零值，与构造矛盾）。这意味着我们可以将 $M$ 表示为如下和式

$\sum_{i = 1}^{r} \sigma_{i} u_{i} \otimes v_{i}^{*}$

This is one version of the Singular Value Decomposition (SVD for short) of $M$ , and it has the benefit of being as close to unique as possible. A more familiar version of SVD is obtained by completing the $v_{i}$ and $u_{i}$ to orthonormal bases of $\mathbb{R}^{m}$ and $\mathbb{R}^{n}$ (necessarily highly non-unique in general). With respect to these bases, $M$ takes, similar to the warm-up, a block form where the top left block is the diagonal matrix with entries $\sigma_{1}, \dots, \sigma_{r}$ and the remaining blocks are zero. Hence we can write $M$ as a product
这是矩阵 $M$ 的奇异值分解（Singular Value Decomposition，简称SVD）的一种形式，其优点是尽可能接近唯一性。另一种更常见的SVD形式可通过将 $v_{i}$ 扩充为 $\mathbb{R}^{m}$ 的标准正交基、将 $u_{i}$ 扩充为 $\mathbb{R}^{n}$ 的标准正交基（一般来说，这种扩充方式具有高度非唯一性）得到。在这些基下，与热身部分类似， $M$ 呈现分块形式：左上角分块为对角矩阵（对角元为 $\sigma_{1}, \dots, \sigma_{r}$ ），其余分块均为零矩阵。因此，我们可以将 $M$ 表示为如下乘积

$\Sigma V^{T}, U \in O(n), V \in O(m)$

where $\Sigma$ has the above block form, $U$ has columns given by $u_{1}, \dots, u_{n}$ , and $V$ has columns given by $v_{1}, \dots, v_{m}$ .
其中 $\Sigma$ 具有上述分块形式，矩阵 $U$ 的列由 $u_{1}, \dots, u_{n}$ 构成，矩阵 $V$ 的列由 $v_{1}, \dots, v_{m}$ 构成。

So, stepping back a bit: what have we learned about what a linear transformation $\to Y$ between Hilbert spaces looks like? Up to orthogonal change of basis, we’ve learned that they all look like “weighted projections”: we are almost projecting onto the image as in the warm-up, except with weights given by the singular values $\sigma_{i}$ to account for changes in length. The only orthogonal-basis-independent information contained in a linear transformation turns out to be its singular values.
现在稍作回顾：关于希尔伯特空间之间的线性变换 $\to Y$ 的形式，我们有了哪些认识？我们发现，在正交基变换的意义下，所有线性变换都呈现“加权投影”的形式：与热身部分的投影类似，但需通过奇异值 $\sigma_{i}$ 引入权重，以体现长度的变化。线性变换中仅有的不依赖于正交基的信息，最终证明是其奇异值。

Looking for more analogies between singular value decomposition and the warm-up, we might think of the singular values as a quantitative refinement of the rank, since there are $r$ of them where $r$ is the rank, and if some of them are small then $T$ is close (in the operator norm) to a linear transformation having lower rank.
若进一步寻找奇异值分解与热身部分的相似性，我们可以将奇异值视为对“秩”的定量细化：一方面，奇异值的非零个数等于秩 $r$ ；另一方面，若部分奇异值很小，则线性变换 $T$ 在算子范数意义下接近一个低秩线性变换。

Geometrically, one way to describe the answer provided by singular value decomposition to the question “what does a linear transformation look like” is that the key to understanding $T$ is to understand what it does to the unit sphere of $X$ . The image of the unit sphere is an $r$ -dimensional ellipsoid, and its principal axes have direction given by the left singular vectors $u_{i}$ and lengths given by the singular values $\sigma_{i}$ . The right singular vectors $v_{i}$ map to these principal axes.
从几何角度描述奇异值分解对“线性变换是什么样”这一问题的回答：理解线性变换 $T$ 的关键在于理解它对 $X$ 中单位球面的作用。单位球面在 $T$ 下的像为一个 $r$ 维椭球面，该椭球面的主轴方向由左奇异向量 $u_{i}$ 确定，长度由奇异值 $\sigma_{i}$ 确定，而右奇异向量 $v_{i}$ 恰好映射到这些主轴上。