Introduce to Inforamtion Retrieval读书笔记(2)

本文介绍倒排索引的构建步骤,包括文档收集、分词、词法预处理等,并探讨了如何通过跳指针加快倒排列表交集运算,以及如何使用位置倒排索引来处理短语查询。

The term vocabulary and postings lists

Inverted index construction step:

1. Collect the documents to be indexed.
2. Tokenize the text.
3. Do linguistic preprocessing of tokens.
4. Index the documents that each term occurs in.

 

2.1 Document delineation and character sequence decoding

Encoding Problems: how to auto-dectect encoding:

Text Format Problems: docs pdf xml html and so on.

Sequence Problems : Arabic(阿拉伯语), where text takes on some two dimensional and mixed order characteristics.

 

Choosing a document unit : A precision/recall tradeoff,large document units can be alleviated by use of explicit or implicit proximity search

 

2.2 Determining the vocabulary of terms

token :tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time
throwing away certain characters, such as punctuation。

Difference between token and type:

token not exactly the same word sequence,is a instance

type is exactly the same work sequence,is a class

like the difference of OOP's class and instance.

Tokenization are language-specific : Language identification based on clas-
IDENTIFICATION sifiers that use short character subsequences as features is highly effective;

most languages have distinctive signature patterns

中文分词:最大正向/反向匹配。

专有名词识别,ip url,邮箱、电话号码识别。

2.2.2 Dropping common terms: stop words

Stop words: extremely common words has little value in helping select documents matching a user need.

How to Collect:

The general COLLECTION strategy for determining a stop list is to sort the terms by collection frequency and then to take the most frequent terms, often hand-filtered for their semantic
content relative to the domain of the documents being indexed, as a stop list.

2.2.3 Normalization (equivalence classing of terms)

Token normalization is the process of canonicalizing TOKEN tokens so that matches
 occur despite superficial differences in the character sequences of the tokens.

不同写法:anti-discriminatory and antidiscriminatory

同义词:car and automobile

Accents and diacritics

Capitalization/case-folding

2.2.4 Stemming and lemmatization

The goal of both stemming and lemmatization is to reduce inflectional
forms and sometimes derivationally related forms of a word to a common
base form。

eg.:

am, are, is ⇒be
car, cars, car’s, cars’⇒car

Some common algorithm for stemming English:

Porter stemmer、Lovins stemmer、Paice stemmer

 

2.3 Faster postings list intersection via skip pointers

Postings lists intersection with skip pointers:

INTERSECTWITHSKIPS(p1, p2)
1 answer ← ()
2 while p1 != NIL and p2 != NIL
3 do if docID(p1) = docID(p2)
4 then ADD(answer, docID(p1))
5 p1 ← next(p1)
6 p2 ← next(p2)
7 else if docID(p1) < docID(p2)
8 then if hasSkip(p1) and (docID(skip(p1) ≤ docID(p2)))
9 then while hasSkip(p1) and (docID(skip(p1) ≤ docID(p2)))
10 do p1 ← skip(p1)
11 else p1 ← next(p1)
12 else if hasSkip(p2) and (docID(skip(p2) ≤ docID(p1)))
13 then while hasSkip(p2) and (docID(skip(p2) ≤ docID(p1)))
14 do p2 ← skip(p2)
15 else p2 ← next(p2)
16 return answer

 2.4 Positional postings and phrase queries

Biword indexes : One approach to handling phrases is to consider every pair of consecutive
terms in a document as a phrase.(Not a standard solution)

Biword Extension:

The concept of a biword index can be extended to longer sequences of
words, and if the index includes variable length word sequences, it is generally
referred to as a phrase index

 

Positional indexes :(most commonly employed)

store postings of the form docID: <position1, position2, . . . >

 

An algorithm for proximity intersection of postings lists p1 and p2:

 

POSITIONALINTERSECT(p1, p2, k)
1 answer ← ()
2 while p1 != NIL and p2 != NIL
3 do if docID(p1) = docID(p2)
4 then l ← ()
5 pp1 ← positions(p1)
6 pp2 ← positions(p2)
7 while pp1 != NIL
8 do while pp2 != NIL
9 do if |pos(pp1) − pos(pp2)| > k
10 then break
11 else ADD(l, pos(pp2))
12 pp2 ← next(pp2)
13 while l != () and |l[0] − pos(pp1)| > k
14 do DELETE(l[0])
15 for each ps ∈ l
16 do ADD(answer, hdocID(p1), pos(pp1), psi)
17 pp1 ← next(pp1)
18 p1 ← next(p1)
19 p2 ← next(p2)
20 else if docID(p1) < docID(p2)
21 then p1 ← next(p1)
22 else p2 ← next(p2)
23 return answer
 

Combination schemes :

Combination of biword indexes and positional indexes。

 

【事件触发一致性】研究多智能体网络如何通过分布式事件驱动控制实现有限时间内的共识(Matlab代码实现)内容概要:本文围绕多智能体网络中的事件触发一致性问题,研究如何通过分布式事件驱动控制实现有限时间内的共识,并提供了相应的Matlab代码实现方案。文中探讨了事件触发机制在降低通信负担、提升系统效率方面的优势,重点分析了多智能体系统在有限时间收敛的一致性控制策略,涉及系统模型构建、触发条件设计、稳定性与收敛性分析等核心技术环节。此外,文档还展示了该技术在航空航天、电力系统、机器人协同、无人机编队等多个前沿领域的潜在应用,体现了其跨学科的研究价值和工程实用性。; 适合人群:具备一定控制理论基础和Matlab编程能力的研究生、科研人员及从事自动化、智能系统、多智能体协同控制等相关领域的工程技术人员。; 使用场景及目标:①用于理解和实现多智能体系统在有限时间内达成一致的分布式控制方法;②为事件触发控制、分布式优化、协同控制等课题提供算法设计与仿真验证的技术参考;③支撑科研项目开发、学术论文复现及工程原型系统搭建; 阅读建议:建议结合文中提供的Matlab代码进行实践操作,重点关注事件触发条件的设计逻辑与系统收敛性证明之间的关系,同时可延伸至其他应用场景进行二次开发与性能优化。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值