2.Glossary(术语)
Analysis(分析)
The process of breaking the text of a field into individual terms to be indexed.
将文本里面的字段切分成单个被索引的术语的过程。
This consists of tokenizing the text into terms, and then optionally filtering the tokenized terms (for example, lowercasing and removing stop words). Whoosh includes several different analyzers.
consists of 由…构成
tokenizing 标记
这包含将文本标记成术语,然后可选地过滤被标记的术语(举一个例子,小写和删除停用词) whoosh包含几种不同的分析仪。
Corpus(语料库)
The set of documents you are indexing
想要设置索引的文档集。
Document(文件)
The individual pieces of content you want to make searchable. The word “documents” might imply files, but the data source could really be anything – articles in a content management system, blog posts in a blogging system, chunks of a very large file, rows returned from an SQL query, individual email messages from a mailbox file, or whatever. When you get search results from Whoosh, the results are a list of documents, whatever “documents” means in your search engine.
你想要查找的各个内容。单词‘document’可以理解为文件,但是数据源可以是任何东西,比如一个内容管理系统里面的文章,博客系统里面的博客文章,一个非常大的文件块,从一条sql语句里面查询出来的行。一个邮箱文件的email信息,或者是其他的。不管‘documents’在你的搜索引擎里面意味着什么,当你从whoosh中获取查询结果后,结果都是一个文档列表。
Fields(字段)
Each document contains a set of fields. Typical fields might be “title”, “content”, “url”, “keywords”, “status”, “date”, etc. Fields can be indexed (so they’re searchable) and/or stored with the document. Storing the field makes it available in search results. For example, you typically want to store the “title” field so your search results can display it.
每一个文件都包含了一组字段。典型的字段可能是标题,内容,url, 关键字,状态,日期等等。字段可以被索引(所以它们是可查找的),可以选择是否和文档一起存储起来。存储字段使得它可以在查询结果中被查找。举一个例子,你通常想存储title字段,这样你就可以在搜索结果中展示它。
Forward index(前项索引)
A table listing every document and the words that appear in the document. Whoosh lets you store term vectors that are a kind of forward index.
一个表列出了所有的文件和所有出现在文件中的单词。whoosh让你存储术语向量作为一种前项索引。
Indexing(索引)
The process of examining documents in the corpus and adding them to the reverse index.
检查语料库中的文档并将其添加到反向索引的过程中。
Postings(帖子)
The reverse index lists every word in the corpus, and for each word, a list of documents in which lists every word in the corpus, and for each word, a list of documents in which that word appears, along with some optional information (such as the number of times the word appears in that document). These items in the list, containing a document number and any extra information, are called postings. In Whoosh the information stored in postings is customizable for each field.
反向索引列出了语料库中的每个单词。对于每个词,文件列表中出现该词,并有一些可选的信息(好比文件中单词出现了多少次)。列表中包含了文件编号和其他一些信息,都被称为帖子。在whoosh中,存储在帖子中的信息可以被每个字段自定义。
Reverse index(反向索引)
Basically a table listing every word in the corpus, and for each word, the list of documents in which it appears. It can be more complicated (the index can also list how many times the word appears in each document, the positions at which it appears, etc.) but that’s how it basically works.
基本上,一个表列出了语料库中的所有的词,对于每个词,文件列表中会出现该词。它可以更复杂(索引也可以列出单词在每一个列表中出现的次数,以及单词出现的位置,等等) 它基本上就是这样工作。
Schema(架构)
Whoosh requires that you specify the fields of the index before you begin indexing. The Schema associates field names with metadata about the field, such as the format of the postings and whether the contents of the field are stored in the index.
whoosh要求你在开始索引之前,指定索引的字段。schema将字段名字和字段的元数据建立连接,好比发布的格式以及字段的内容是否被存储在索引里面。
Term vector(术语向量)
A forward index for a certain field in a certain document. You can specify in the Schema that a given field should store term vectors.
一个前项索引对于一个文档中有一个确定的列。你可以在schema里面指定一个给定的字段存储术语向量。

822

被折叠的 条评论
为什么被折叠?



