Latent semantic Indexing(LSI)

最新推荐文章于 2019-10-29 09:23:22 发布

原创最新推荐文章于 2019-10-29 09:23:22 发布 · 1k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#semantic #structure #orthogonal #components #methods #user

研究专栏收录该内容

24 篇文章

订阅专栏

本文介绍了一种信息检索方法——潜在语义索引(LSI)，利用奇异值分解(SVD)技术来捕捉词汇与文档间的隐含高阶结构。这种方法能够超越词汇层面的不一致性，揭示出词项与文档间的重要关联性。

Because of the tremendous diversity in the words people use to describe the same document,lexical methods are necessarily incomplete and imprecise. Using the singular value decomposition (SVD), one can take advantage of the implicit higher-order structure in the association of terms with documents by determining the SVD of large sparse term by document matrices. Terms and documents represented by 200-300 of the largest singular vectors are then matched against user queries. We call this retrieval method Latent Semantic Indexing (LSI). because the subspace represents important associative relationships between terms and documents that are not evident in individual documents.

LSI assumes that there is some underlying or latent structure in word usage that is partially obscured by variability in word choice.

SVD 奇异值分解:

Given an m*n matrix A , where without loss of generality m>=n and rank(A)=r, the singular value decomposition of A, denoted by SVD(A), is defined as

A=UΣV^T

where U^TU=V^TV=I

and Σ=diag(σ₁,...,σ_n),

σ_i>0 for 1<=i<=r,

σ_j=0 for j>=r+1.

The first r columns of the orthogonal matrices U and V define the orthonormal eigenvectors associated with the r nonzero eigenvalues of AA^T and A^TA, respectively. The columns of U and V are referred to as the left and right singular vectors, resp ectively, and the singular values of A are defined as the diagonal elements of which are the nonnegative square roots of the n