what changes in lucene4

Lucene 4.0引入了插件式Codec架构,允许用户自定义索引格式,并提供了多种内置Codec供选择。同时,更新了评分机制,使其完全可插拔,并增加了新的文档值API来替代字段缓存。此外,还实现了多项性能优化。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

  • Pre-3.0 indices are no longer supported. 

  • MIGRATE.txt describes how to update your application code. 

  • The index format won't change (unless a serious bug fix requires it) between this release and 4.0 GA, but APIs may still change before 4.0.0 beta. 

Please try the release and report back! 

Pluggable Codec  

The biggest change is the new pluggable Codec architecture, which provides full control over how all elements (terms, postings, stored fields, term vectors, deleted documents, segment infos, field infos) of the index are written. You can create your own or use one of the provided codecs, and you can customize the postings format on a per-field basis. 

There are some fun core codecs:
  • Lucene40 is the default codec. 

  • Lucene3x (read-only) reads any index written with Lucene 3.x. 

  • SimpleText stores everything in plain text files (great for learning and debugging, but awful for production!). 

  • MemoryPostingsFormat stores all postings (terms, documents, positions, offsets) in RAM as a fast and compact FST, useful for fields with limited postings (primary key (id) field, date field, etc.) 

  • PulsingPostingsFormat inlines postings for low-frequency terms directly into the terms dictionary, saving a disk seek on lookup. 

  • AppendingCodec avoids seeking while writing, necessary for file-systems such as Hadoop DFS

If you create your own Codec it's  easy to confirm all of Lucene/Solr's tests pass with it . If tests fail then likely your Codec has a bug! 

A new 4-dimensional postings API (to read fields, terms, documents, positions) replaces the previous postings API. 

Flexible scoring  

Lucene's scoring is now  fully pluggable , with the  TF/IDF vector space model  remaining as the default. You can create your own scoring model, or use one of the core scoring models ( BM25 , Divergence from Randomness, Language Models, and Information-based models). Per-document normalization values are  no longer limited to a single byte Various new aggregate statistics  are now available. 

These changes were part of a  2011 Google Summer of Code project  (thank you David!). 

These two changes are really important because they remove the barriers to ongoing innovations. Now it's easy to experiment with wild changes to the index format or to Lucene's scoring models. A recent example of such innovation is this  neat codec  by the devs at  Flax  to enable updatable fields by storing postings in a  Redis  key/value store. 

Document Values  

The new  document values API  stores strongly typed single-valued fields per document, meant as an eventual replacement for Lucene's field cache. The values are pre-computed during indexing and stored in the index in a  column-stride  format (values for a single field across all documents are stored together), making it much faster to initialize at search time than the field cache. Values can be fixed 8, 16, 32, 64 bit ints, or variable-bits sized (packed) ints; float or double; and six flavors of byte[] (fixed size or variable sized; dereferenced, straight or sorted). 

New Field APIs  

The API for creating document fields has changed:  Fieldable  and  AbstractField  have been removed, and a new  FieldType , factored out of  Field  class, holds details about how the field's value should be indexed. New classes have been created for specific commonly-used fields:
  • StringField indexes a string as a single token, without norms and as docs only. For example, use this for a primary key (id) field, or for a field you will sort on. 

  • TextField indexes the fully tokenized string, with norms and including docs, term frequencies and positions. For example, use this for the primary text field. 

  • StoredField is a field whose value is just stored. 

  • XXXDocValuesField create typed document values fields. 

  • IntFieldFloaFieldLongFieldDoubleField create typed numeric fields for efficient range queries and filters. 

If none of these field classes apply, you can always create your own  FieldType  (typically by starting from the exposed  FieldType s from the above classes and then tweaking), and then construct a  Field  by passing the name,  FieldType  and value. 

Note that the old APIs (using  Index Store TermVector  enums) are still present (deprecated), to ease migration. 

These changes were part of a  2011 Google Summer of Code project  (thank you Nikola!). 

Other big changes  

Lucene's terms are now binary (arbitrary byte[]); by default they are  UTF-8  encoded strings, sorted in Unicode sort order. But your  Analyzer  is free to produce tokens with an arbitrary  byte[] (e.g.,  CollationKeyAnalyzer  does so). 

A new  DirectSpellChecker  finds suggestions directly from any Lucene index, avoiding the hassle of maintaining a sidecar spellchecker index. It uses the same fast Levenshtein automata as FuzzyQuery  (see below). 

Term offsets (the start and end character position of each term) may now be stored in the postings, by using  FieldInfo.IndexOption.DOCS_AND_POSITIONS_AND_OFFSETS  when indexing the field. I expect this will be useful for fast highlighting without requiring term vectors, but this part is not yet done (patches welcome!). 

A new  AutomatonQuery   matches all documents containing any term matching a provided automaton . Both  WildcardQuery  and  RegexpQuery  simply construct the corresponding automaton and then run  AutomatonQuery . The classic  QueryParser  produces a  RegexpQuery  if you type fieldName:/expression/  or  /expression against default field/

Optimizations  

Beyond the fun new features there are some incredible performance gains. 

If you use  FuzzyQuery , you should see a  factor of 100-200 speedup  on moderately sized indices. 

If you search with a  Filter , you can see gains up to 3X faster (depending on filter density and query complexity), thanks to  a change that applies filters just like we apply deleted documents

If you use multiple threads for indexing, you should see  stunning throughput gains  (265% in that case), thanks to  concurrent flushing . You are also now able to use more than 2048 MB IndexWriter  RAM buffer (as long as you use multiple threads). 

The new  BlockTree  default terms dictionary uses far less RAM to hold the terms index, and can sometimes avoid going to disk for terms that do not exist. In addition, the field cache also uses substantially less RAM, by avoiding separate objects per document and instead packing character data into shared  byte[]  blocks. Together this results in a  73% reduction in RAM required for searching  in one case. 

IndexWriter  now buffers term data using  byte[]  instead of  char[] , using half the RAM for ASCII terms. 

MultiTermQuery  now rewrites per-segment, and caches per-term metadata to avoid a second lookup during scoring. This should improve performance though it hasn't been directly tested. 

If a  BooleanQuery  consist of only MUST  TermQuery  clauses, then a specialized  ConjunctionTermScorer is used, giving ~25% speedup. 

Reducing merge IO impact  

Merging (consolidating many small segments into a single big one) is a very IO and CPU intensive operation which can easily interfere with ongoing searches. In 4.0.0 we now have two ways to reduce this impact:
  • Rate-limit the IO caused by ongoing merging, by callingFSDirectory.setMaxMergeWriteMBPerSec

  • Use the new NativeUnixDirectory which bypasses the OS's IO cache for all merge IO, by using direct IO. This ensures that a merge won't evict hot pages used by searches. (Note that there is also a native WindowsDirectory, but it does not yet use direct IO during merging... patches welcome!). 

Remember to also set  swappiness to 0 on Linux  if you want to maximize search responsiveness.

More generally, the APIs that open an input or output file ( Directory.openInput  and Directory.createOutput ) now take an  IOContext  describing what's being done (e.g., flush vs merge), so you can create a custom  Directory  that changes its behavior depending on the context. 

These changes were part of a  2011 Google Summer of Code project  (thank you Varun!). 

Consolidated modules  

The diverse sources, previously scattered between Lucene's and Solr's core and contrib, have been consolidated. Especially noteworthy is the  analysis  module, providing a rich selection of 48 analyzers across many languages; the  queries  module, containing function queries (the old core function queries have been removed) and other non-core query classes; and the  queryparser module, with numerous query parsers including the  classic QueryParser  (moved from core). 

Other changes  

Here's a long list of additional changes:
  • The classic QueryParser now interprets term~N where N is an integer >= 1 as a FuzzyQuerywith edit distance N. 

  • The field cache normally requires single-valued fields, but we've addedFieldCache.getDocTermsOrd which can handle multi-valued fields. 

  • Analyzers must always provide a reusable token stream, by implementing theAnalyzer.createComponents method (reusableTokenStream has been removed and tokenStream is now final, in Analzyer). 

  • IndexReaders are now read-only (you cannot delete document by id, nor change norms) and are strongly typed as AtomicIndexReader or CompositeIndexReader 

  • The API for reading term vectors is the same API used for reading all postings, except the term vector API only covers a single document. This is a good match because term vectors are really just a single-document inverted index. 

  • Positional queries (PhraseQuerySpanQuery) will now throw an exception if you run them against a field that did not index positions (previously they silently returned 0 hits). 

  • String-based field cache APIs have been replaced with BytesRef based APIs. 

  • ParallelMultiSearcher has been absorbed into IndexSearcher as an optional ExecutorServiceargument to the constructor. Searcher and Searchable have been removed. 

  • All serialization code has been removed from Lucene's classes; you must handle serialization at a higher level in your application. 

  • Field names are no longer interned, so you cannot rely on == to test for equality (use.equals instead). 

  • *SpanFilter has been removed: they created too many objects during searching and were not scalable. 

  • Removed IndexSearcher.closeIndexSearcher now only takes a provided IndexReader (no longer a Directory), which is the caller's responsibility to close. 

  • You cannot put foreign files into the index directory anymore: they will be deleted byIndexWriter

  • FieldSelector (to only load certain stored fields) has been replaced with a simplerStoredFieldVisitor API. 
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值