信息检索导论第十八章笔记(英文)

Matrix decompositions and latent semantic indexing

term-document matrix: an M * N matrix C, each of whose rows represents a term and each of whose columns represents a document in the collection.

  1. develop a class of operations from linear algebra, known as matrix decomposition
  2. use a special form of matrix decomposition to construct a low-rank approximation to the term-document matrix
  3. examine the application of such low-rank approximations to indexing and retrieving documents, a technique referred to as latent semantic indexing

Linear algebra review

  • eigenvalues of C

For a square M× Mmatrix C and a vector x that is not all zeros, the values of λ satisfying:

在这里插入图片描述

The eigenvector corresponding to the eigenvalue of largest magnitude is called the principal eigenvector.

In a similar fashion, the left eigenvectors of C are the M-vectors y such that

在这里插入图片描述

The number of nonzero eigenvalues of C is at most rank©.

Note:

  1. 在这里插入图片描述
  1. the effect of small eigenvalues (and their eigenvectors) on a matrix–vector product is small
  2. For a symmetric matrix S, the eigenvectors corresponding to distinct eigenvalues are orthogonal. Further, if S is both real and symmetric, the eigenvalues are all real.

Matrix decompositions

a square matrix can be factored into the product of matrices derived from its eigenvectors

  • Two theorems
  1. Let S be a square real-valued M× M matrix with M linearly independent eigenvectors. Then there exists an eigen decomposition

在这里插入图片描述

where the columns of U are the eigenvectors of S and is a diagonal matrix whose diagonal entries are the eigenvalues of S in decreasing order

在这里插入图片描述

If the eigenvalues are distinct, then this decomposition is unique.

在这里插入图片描述

  1. Let S be a square, symmetric real-valued M× M matrix with M linearly independent eigenvectors. Then there exists a symmetric diagonal decomposition

在这里插入图片描述

在这里插入图片描述

build on this symmetric diagonal decomposition to build low-rank approximations to term–document matrices.

Term–document matrices and singular value decompositions

M * N term-document matrix C, thus C is very unlikely to be symmetric.

  • Theorem SVD

在这里插入图片描述

  • Illustration of the SVD

在这里插入图片描述

there are two cases:

  1. M > N
  2. M < N

Low-rank approximations

  • Forbenius Norm

Given an M × N matrix C and a positive integer k, we wish to find an M × N matrix Ck of rank at most k, so as to minimize the Frobenius norm of the matrix difference X = C − Ck , defined to be

在这里插入图片描述

the Frobenius norm of X measures the discrepancy between Ck and C; our goal is to find a matrix Ck that minimizes this discrepancy

When k is far smaller than r , we refer to Ck as a low-rank approximation.

  • The SVD can be used to solve the low-rank matrix approximation problem.
    We then derive from it an application to approximating term–document matrices. We invoke the following three-step procedure to this end:

在这里插入图片描述

The rank of Ck is at most k.

this procedure yields the matrix of rank k with the lowest possible Frobenius error.

在这里插入图片描述

  • the form of Ck

在这里插入图片描述

where ui and vi are the ith columns of U and V, respectively. Thus, uiviT is a rank-1 matrix, so that we have just expressed Ck as the sum of k rank-1 matrices each weighted by a singular value.

Latent semantic indexing

LSI, The low-rank approximation to C yields a new representation for each document in the collection. We will cast queries into this low-rank representation as well, enabling us to compute query– document similarity scores in this low-rank representation. This process is known as latent semantic indexing

  1. use SVD to construct a low-rank approximation Ck to the term-document matrix

  2. map each row/column to a k-dimensional space

  3. use the new k-dimensional LSI represnetation to compute similarities between vectors

在这里插入图片描述

query vector is mapped into its representation in the LSI space

Note:
The computational cost of the SVD is significant, One approach to this obstacle is to build the LSI representation on a randomly sampled subset of the documents in the collection
a value of k in the low hundreds can actually increase precision on some query benchmarks.
This suggests that, for a suitable value of k, LSI addresses some of the challenges of synonymy
LSI works best in applications where there is little overlap between queries and documents.

  • soft clustering

LSI can be viewed as soft clustering by interpreting each dimension of the reduced space as a cluster and the value that a document has on that dimension as its fractional membership in that cluster.

### 关于信息检索导论的学习资料 学习信息检索的基础理论和实践方法对于理解和开发搜索引擎和其他信息访问系统至关重要。以下是几个推荐的信息检索导论学习资源: #### 1. 教科书 《Introduction to Information Retrieval》由 Christopher D. Manning、Prabhakar Raghavan 和 Hinrich Schütze 编写,是一本广泛使用的教科书。这本书不仅涵盖了基本概念和技术,还深入探讨了现代信息检索系统的内部工作原理[^4]。 ```python # Python code snippet demonstrating a simple information retrieval system using TF-IDF from sklearn.feature_extraction.text import TfidfVectorizer documents = ["The sky is blue.", "The sun is bright."] tfidf_vectorizer = TfidfVectorizer() tfidf_matrix = tfidf_vectorizer.fit_transform(documents) print(tfidf_matrix.shape) ``` #### 2. 在线课程 Coursera 上提供了一门名为 “Information Retrieval”的在线课程,该课程由圣彼得堡国立大学教授讲授。这门课通过视频讲座、编程作业以及测验来帮助学生掌握信息检索的核心技能[^5]。 #### 3. 学术论文 阅读来自顶级会议如 SIGIR (Special Interest Group on Information Retrieval) 的最新研究文章可以帮助了解当前的研究趋势和发展方向。Google Scholar 是查找这类文献的好地方[^6]。 #### 4. 开放教育资源 斯坦福大学提供了免费的开放课程材料,包括幻灯片、练习题及其解答指南等,这些都是非常好的自学工具[^7]。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值