Latent semantic Indexing(LSI)

本文介绍了一种信息检索方法——潜在语义索引(LSI),利用奇异值分解(SVD)技术来捕捉词汇与文档间的隐含高阶结构。这种方法能够超越词汇层面的不一致性,揭示出词项与文档间的重要关联性。

 

Because of the tremendous diversity in the words people use to describe the same document,lexical methods are necessarily incomplete and imprecise. Using the singular value decomposition (SVD), one can take advantage of the implicit higher-order structure in the association of terms with documents by determining the SVD of large sparse term by document matrices. Terms and documents represented by 200-300 of the largest singular vectors are then matched against user queries. We call this retrieval method Latent Semantic Indexing (LSI).  because the subspace represents important associative relationships between terms and documents that are not evident in individual documents.

 

 

LSI assumes that there is some underlying or latent structure in word usage that is partially obscured by variability in word choice.

 

 

SVD 奇异值分解:

 

Given an m*n matrix , where without loss of generality m>=n and rank(A)=r, the singular value decomposition of A, denoted by SVD(A), is defined as

A=UΣVT

where UTU=VTV=I

and Σ=diag(σ1,...,σn),

σi>0 for 1<=i<=r, 

σj=0 for j>=r+1.

 

The first columns of the orthogonal matrices and define the orthonormal eigenvectors associated with the nonzero eigenvalues of AAT and ATA, respectively. The columns of and are referred to as the left and right singular vectors, resp   ectively, and the singular values of A are defined as the diagonal elements of which are the nonnegative square roots of the n

eigenvalues of AAT.

 

 

Interpretation of SVD components within LSI.

 

Ak=Best rank-k approximation to A.

U=term vectors

Σ=Singular values

V=Document Vectors

m=Number of terms

n=Number of documents

k=Number of factors

r=rank of A

 

the user query can be represented by 

q^=qTUkΣk-1


评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值