Because of the tremendous diversity in the words people use to describe the same document,lexical methods are necessarily incomplete and imprecise. Using the singular value decomposition (SVD), one can take advantage of the implicit higher-order structure in the association of terms with documents by determining the SVD of large sparse term by document matrices. Terms and documents represented by 200-300 of the largest singular vectors are then matched against user queries. We call this retrieval method Latent Semantic Indexing (LSI). because the subspace represents important associative relationships between terms and documents that are not evident in individual documents.
LSI assumes that there is some underlying or latent structure in word usage that is partially obscured by variability in word choice.
SVD 奇异值分解:
Given an m*n matrix A , where without loss of generality m>=n and rank(A)=r, the singular value decomposition of A, denoted by SVD(A), is defined as
A=UΣVT
where UTU=VTV=I
and Σ=diag(σ1,...,σn),
σi>0 for 1<=i<=r,
σj=0 for j>=r+1.
The first r columns of the orthogonal matrices U and V define the orthonormal eigenvectors associated with the r nonzero eigenvalues of AAT and ATA, respectively. The columns of U and V are referred to as the left and right singular vectors, resp ectively, and the singular values of A are defined as the diagonal elements of which are the nonnegative square roots of the n
eigenvalues of AAT.
Interpretation of SVD components within LSI.
Ak=Best rank-k approximation to A.
U=term vectors
Σ=Singular values
V=Document Vectors
m=Number of terms
n=Number of documents
k=Number of factors
r=rank of A
the user query can be represented by
q^=qTUkΣk-1
本文介绍了一种信息检索方法——潜在语义索引(LSI),利用奇异值分解(SVD)技术来捕捉词汇与文档间的隐含高阶结构。这种方法能够超越词汇层面的不一致性,揭示出词项与文档间的重要关联性。
1910

被折叠的 条评论
为什么被折叠?



