- Pre-3.0 indices are no longer supported.
- MIGRATE.txt describes how to update your application code.
- The index format won't change (unless a serious bug fix requires it) between this release and 4.0 GA, but APIs may still change before 4.0.0 beta.
Pluggable Codec
The biggest change is the new pluggable Codec architecture, which provides full control over how all elements (terms, postings, stored fields, term vectors, deleted documents, segment infos, field infos) of the index are written. You can create your own or use one of the provided codecs, and you can customize the postings format on a per-field basis.
There are some fun core codecs:
Lucene40
is the default codec.
Lucene3x
(read-only) reads any index written with Lucene 3.x.
- SimpleText stores everything in plain text files (great for learning and debugging, but awful for production!).
- MemoryPostingsFormat stores all postings (terms, documents, positions, offsets) in RAM as a fast and compact FST, useful for fields with limited postings (primary key (id) field, date field, etc.)
- PulsingPostingsFormat inlines postings for low-frequency terms directly into the terms dictionary, saving a disk seek on lookup.
AppendingCodec
avoids seeking while writing, necessary for file-systems such as Hadoop DFS.
A new 4-dimensional postings API (to read fields, terms, documents, positions) replaces the previous postings API.
Flexible scoring
Lucene's scoring is now fully pluggable , with the TF/IDF vector space model remaining as the default. You can create your own scoring model, or use one of the core scoring models ( BM25 , Divergence from Randomness, Language Models, and Information-based models). Per-document normalization values are no longer limited to a single byte . Various new aggregate statistics are now available.
These changes were part of a 2011 Google Summer of Code project (thank you David!).
These two changes are really important because they remove the barriers to ongoing innovations. Now it's easy to experiment with wild changes to the index format or to Lucene's scoring models. A recent example of such innovation is this neat codec by the devs at Flax to enable updatable fields by storing postings in a Redis key/value store.
Document Values
The new document values API stores strongly typed single-valued fields per document, meant as an eventual replacement for Lucene's field cache. The values are pre-computed during indexing and stored in the index in a column-stride format (values for a single field across all documents are stored together), making it much faster to initialize at search time than the field cache. Values can be fixed 8, 16, 32, 64 bit ints, or variable-bits sized (packed) ints; float or double; and six flavors of byte[] (fixed size or variable sized; dereferenced, straight or sorted).
New Field APIs
The API for creating document fields has changed:
Fieldable
and
AbstractField
have been removed, and a new
FieldType
, factored out of
Field
class, holds details about how the field's value should be indexed. New classes have been created for specific commonly-used fields:
StringField
indexes a string as a single token, without norms and as docs only. For example, use this for a primary key (id) field, or for a field you will sort on.
TextField
indexes the fully tokenized string, with norms and including docs, term frequencies and positions. For example, use this for the primary text field.
StoredField
is a field whose value is just stored.
XXXDocValuesField
create typed document values fields.
IntField
,FloaField
,LongField
,DoubleField
create typed numeric fields for efficient range queries and filters.
FieldType
(typically by starting from the exposed
FieldType
s from the above classes and then tweaking), and then construct a
Field
by passing the name,
FieldType
and value.
Note that the old APIs (using
Index
,
Store
,
TermVector
enums) are still present (deprecated), to ease migration.
These changes were part of a 2011 Google Summer of Code project (thank you Nikola!).
Other big changes
Lucene's terms are now binary (arbitrary byte[]); by default they are UTF-8 encoded strings, sorted in Unicode sort order. But your
Analyzer
is free to produce tokens with an arbitrary
byte[]
(e.g.,
CollationKeyAnalyzer
does so).
A new
DirectSpellChecker
finds suggestions directly from any Lucene index, avoiding the hassle of maintaining a sidecar spellchecker index. It uses the same fast Levenshtein automata as
FuzzyQuery
(see below).
Term offsets (the start and end character position of each term) may now be stored in the postings, by using
FieldInfo.IndexOption.DOCS_AND_POSITIONS_AND_OFFSETS
when indexing the field. I expect this will be useful for fast highlighting without requiring term vectors, but this part is not yet done (patches welcome!).
A new
AutomatonQuery
matches all documents containing any term matching a provided automaton
. Both
WildcardQuery
and
RegexpQuery
simply construct the corresponding automaton and then run
AutomatonQuery
. The classic
QueryParser
produces a
RegexpQuery
if you type
fieldName:/expression/
or
/expression against default field/
.
Optimizations
Beyond the fun new features there are some incredible performance gains.
If you use
FuzzyQuery
, you should see a
factor of 100-200 speedup
on moderately sized indices.
If you search with a
Filter
, you can see gains up to 3X faster (depending on filter density and query complexity), thanks to
a change that applies filters just like we apply deleted documents
.
If you use multiple threads for indexing, you should see stunning throughput gains (265% in that case), thanks to concurrent flushing . You are also now able to use more than 2048 MB
IndexWriter
RAM buffer (as long as you use multiple threads).
The new
BlockTree
default terms dictionary uses far less RAM to hold the terms index, and can sometimes avoid going to disk for terms that do not exist. In addition, the field cache also uses substantially less RAM, by avoiding separate objects per document and instead packing character data into shared
byte[]
blocks. Together this results in a
73% reduction in RAM required for searching
in one case.
IndexWriter
now buffers term data using
byte[]
instead of
char[]
, using half the RAM for ASCII terms.
MultiTermQuery
now rewrites per-segment, and caches per-term metadata to avoid a second lookup during scoring. This should improve performance though it hasn't been directly tested.
If a
BooleanQuery
consist of only MUST
TermQuery
clauses, then a specialized
ConjunctionTermScorer
is used, giving ~25% speedup.
Reducing merge IO impact
Merging (consolidating many small segments into a single big one) is a very IO and CPU intensive operation which can easily interfere with ongoing searches. In 4.0.0 we now have two ways to reduce this impact:
- Rate-limit the IO caused by ongoing merging, by calling
FSDirectory.setMaxMergeWriteMBPerSec
.
- Use the new
NativeUnixDirectory
which bypasses the OS's IO cache for all merge IO, by using direct IO. This ensures that a merge won't evict hot pages used by searches. (Note that there is also a nativeWindowsDirectory
, but it does not yet use direct IO during merging... patches welcome!).
More generally, the APIs that open an input or output file (
Directory.openInput
and
Directory.createOutput
) now take an
IOContext
describing what's being done (e.g., flush vs merge), so you can create a custom
Directory
that changes its behavior depending on the context.
These changes were part of a 2011 Google Summer of Code project (thank you Varun!).
Consolidated modules
The diverse sources, previously scattered between Lucene's and Solr's core and contrib, have been consolidated. Especially noteworthy is the
analysis
module, providing a rich selection of 48 analyzers across many languages; the
queries
module, containing function queries (the old core function queries have been removed) and other non-core query classes; and the
queryparser
module, with numerous query parsers including the
classic QueryParser
(moved from core).
Other changes
Here's a long list of additional changes:
- The classic
QueryParser
now interprets term~N where N is an integer >= 1 as aFuzzyQuery
with edit distance N.
- The field cache normally requires single-valued fields, but we've added
FieldCache.getDocTermsOrd
which can handle multi-valued fields.
- Analyzers must always provide a reusable token stream, by implementing the
Analyzer.createComponents
method (reusableTokenStream
has been removed andtokenStream
is now final, inAnalzyer
).
IndexReader
s are now read-only (you cannot delete document by id, nor change norms) and are strongly typed asAtomicIndexReader
orCompositeIndexReader
- The API for reading term vectors is the same API used for reading all postings, except the term vector API only covers a single document. This is a good match because term vectors are really just a single-document inverted index.
- Positional queries (
PhraseQuery
,SpanQuery)
will now throw an exception if you run them against a field that did not index positions (previously they silently returned 0 hits).
- String-based field cache APIs have been replaced with
BytesRef
based APIs.
ParallelMultiSearcher
has been absorbed intoIndexSearcher
as an optionalExecutorService
argument to the constructor.Searcher
andSearchable
have been removed.
- All serialization code has been removed from Lucene's classes; you must handle serialization at a higher level in your application.
- Field names are no longer interned, so you cannot rely on
==
to test for equality (use.equals
instead).
*SpanFilter
has been removed: they created too many objects during searching and were not scalable.
- Removed
IndexSearcher.close
:IndexSearcher
now only takes a providedIndexReader
(no longer aDirectory
), which is the caller's responsibility to close.
- You cannot put foreign files into the index directory anymore: they will be deleted by
IndexWriter
.
FieldSelector
(to only load certain stored fields) has been replaced with a simplerStoredFieldVisitor
API.