信息检索导论第十八章笔记(英文)

最新推荐文章于 2023-12-30 23:26:45 发布

原创最新推荐文章于 2023-12-30 23:26:45 发布

· 419 阅读

0 ·

版权

文章标签：

#信息检索

数据挖掘专栏收录该内容

53 篇文章

订阅专栏

Matrix decompositions and latent semantic indexing

term-document matrix: an M * N matrix C, each of whose rows represents a term and each of whose columns represents a document in the collection.

develop a class of operations from linear algebra, known as matrix decomposition
use a special form of matrix decomposition to construct a low-rank approximation to the term-document matrix
examine the application of such low-rank approximations to indexing and retrieving documents, a technique referred to as latent semantic indexing

Linear algebra review

eigenvalues of C

For a square M× Mmatrix C and a vector x that is not all zeros, the values of λ satisfying:

在这里插入图片描述

The eigenvector corresponding to the eigenvalue of largest magnitude is called the principal eigenvector.

In a similar fashion, the left eigenvectors of C are the M-vectors y such that

在这里插入图片描述

The number of nonzero eigenvalues of C is at most rank©.

Note:

the effect of small eigenvalues (and their eigenvectors) on a matrix–vector product is small
For a symmetric matrix S, the eigenvectors corresponding to distinct eigenvalues are orthogonal. Further, if S is both real and symmetric, the eigenvalues are all real.

Matrix decompositions

a square matrix can be factored into the product of matrices derived from its eigenvectors

Two theorems

Let S be a square real-valued M× M matrix with M linearly independent eigenvectors. Then there exists an eigen decomposition

在这里插入图片描述

where the columns of U are the eigenvectors of S and is a diagonal matrix whose diagonal entries are the eigenvalues of S in decreasing order

在这里插入图片描述

If the eigenvalues are distinct, then this decomposition is unique.

Let S be a square, symmetric real-valued M× M matrix with M linearly independent eigenvectors. Then there exists a symmetric diagonal decomposition

在这里插入图片描述

build on this symmetric diagonal decomposition to build low-rank approximations to term–document matrices.

Term–document matrices and singular value decompositions

M * N term-document matrix C, thus C is very unlikely to be symmetric.

Theorem SVD

在这里插入图片描述

Illustration of the SVD

在这里插入图片描述

there are two cases:

M > N
M < N

Low-rank approximations

Forbenius Norm

Given an M × N matrix C and a positive integer k, we wish to find an M × N matrix Ck of rank at most k, so as to minimize the Frobenius norm of the matrix difference X = C − Ck , defined to be

在这里插入图片描述

the Frobenius norm of X measures the discrepancy between Ck and C; our goal is to find a matrix Ck that minimizes this discrepancy

When k is far smaller than r , we refer to Ck as a low-rank approximation.

The SVD can be used to solve the low-rank matrix approximation problem.
We then derive from it an application to approximating term–document matrices. We invoke the following three-step procedure to this end:

在这里插入图片描述

The rank of Ck is at most k.

this procedure yields the matrix of rank k with the lowest possible Frobenius error.

在这里插入图片描述

the form of Ck

在这里插入图片描述

where ui and vi are the ith columns of U and V, respectively. Thus, uiviT is a rank-1 matrix, so that we have just expressed Ck as the sum of k rank-1 matrices each weighted by a singular value.

Latent semantic indexing

LSI, The low-rank approximation to C yields a new representation for each document in the collection. We will cast queries into this low-rank representation as well, enabling us to compute query– document similarity scores in this low-rank representation. This process is known as latent semantic indexing

use SVD to construct a low-rank approximation Ck to the term-document matrix
map each row/column to a k-dimensional space
use the new k-dimensional LSI represnetation to compute similarities between vectors

在这里插入图片描述

query vector is mapped into its representation in the LSI space

Note:
The computational cost of the SVD is significant, One approach to this obstacle is to build the LSI representation on a randomly sampled subset of the documents in the collection
a value of k in the low hundreds can actually increase precision on some query benchmarks.
This suggests that, for a suitable value of k, LSI addresses some of the challenges of synonymy
LSI works best in applications where there is little overlap between queries and documents.

soft clustering

LSI can be viewed as soft clustering by interpreting each dimension of the reduced space as a cluster and the value that a document has on that dimension as its fractional membership in that cluster.