Mandatory, prohibited, and optional clauses
Lucene has a somewhat unique way of combining multiple clauses in a query string. It is tempting to think of this as a mundane detail common to boolean operations in programming languages, but Lucene doesn't quite work that way.
A query expression is decomposed into a set of unordered clauses of three types:
A clause can be mandatory:
(for example, only artists containing the
word Smashing)
+Smashing
A clause can be prohibited:
(for example, all documents except those
with Smashing)
-Smashing
A clause can be optional:
Smashing
Boolean operators
When the AND or && operator is used between clauses, then both the left and right sides of the operand become mandatory, if not already marked as prohibited. So:
Smashing AND Pumpkins is equivalent to:
+Smashing +Pumpkins
Similarly, if the OR or || operator is used between clauses, then both the left and right sides of the operand become optional, unless they are marked mandatory or prohibited. If the default operator is already OR then this syntax is redundant. If the default operator is AND, then this is the only way to mark a clause as optional.
To match artist names that contain Smashing or Pumpkins try:
Smashing || Pumpkins
The NOT operator is equivalent to the - syntax. So to find artists with Smashing but not Atoms in the name, you can do this:
Smashing NOT Atoms
Sub-expressions (aka sub-queries)
You can use parenthesis to compose a query of smaller queries. The following example satisfies the intent of the previous example:
(Smashing AND Pumpkins) OR (Green AND Day)
Using what we know previously, this could also be written as:
(+Smashing +Pumpkins) (+Green +Day)
But this is not the same as:
+(Smashing Pumpkins) +(Green Day)
Limitations of prohibited clauses in sub-expressions
Lucene doesn't actually support a pure negative query, for example:
-Smashing -Pumpkins
Solr enhances Lucene to support this, but only at the top level query expression such as in the example above. Consider the following admittedly strange query:
Smashing (-Pumpkins)
This query attempts to ask the question: Which artist names contain either Smashing or do not contain Pumpkins? However, it doesn't work and only matches the first clause—(4 documents). The second clause should essentially match most documents resulting in a total for the query that is nearly every document. The artist named Wild Pumpkins at Midnight is the only one in my index that does not contain Smashing but does contain Pumpkins, and so this query should match every document except that one. To make this work, you have to take the sub-expression containing only negative clauses, and add the all-documents query clause: *:*,
as shown below:
Smashing (-Pumpkins *:*)
Field qualifier
To have a clause explicitly search a particular field, precede the relevant clause with the field's name, and then add a colon. Spaces may be used in-between, but that is generally not done.
a_member_name:Corgan
This matches bands containing a member with the name Corgan. To match, Billy and Corgan:
+a_member_name:Billy +a_member_name:Corgan
Or use this shortcut to match multiple words:
a_member_name:(+Billy +Corgan)
Phrase queries and term proximity
A clause may be a phrase query (a contiguous series of words to be matched in that order) instead of just one word at a time. In the previous examples, we've searched for text containing multiple words like Billy and Corgan, but let's say we wanted to match Billy Corgan (that is the two words adjacent to each other in that order). This further constrains the query. Double quotes are used to indicate a phrase query, as shown below:
"Billy Corgan"
Related to phrase queries is the notion of the term proximity, aka the slop factor or a near query. In our previous example, if we wanted to permit these words to be separated by no more than say three words in–between, then we could do this:
"Billy Corgan"~3
Wildcard queries
A Lucene index fundamentally stores analyzed terms (words after lowercasing and other processing), and that is generally what you are searching for. However, if you really need to, you can search on partial words. But there are issues with this:
...
To find artists containing words starting with Smash, you can do:
smash*
Or perhaps those starting with sma and ending with ing:
sma*ing
The asterisk matches any number of characters (perhaps none). You can also use ?
to force a match of any character at that position:
sma??*
Fuzzy queries
Fuzzy queries are useful when your search term needn't be an exact match, but the closer the better. The fewer the number of character insertions, deletions, or exchanges relative to the search term length, the better the score. The algorithm used is known as the Levenstein Distance algorithm. Fuzzy queries suffer from some of the same problems as the wildcard queries just described, but it is not as serious.
For example:
Smashing~
Notice the tilde character at the end. Without this notation, simply Smashing would match only four documents because only that many artist names contain that word. Smashing~ matched 578, words and it took my computer 706 milliseconds. You can modify the proximity threshold, which is a number between 0 and 1, defaulting to 0.5
. For instance, changing the proximity to a more stringent 0.7:
Smashing~0.7
Range queries
Lucene lets you query for numeric, date, and even text ranges. The following query matches all of the bands formed in the 1990s:
a_type:2 AND a_begin_date:[1990-01-01T00:00:00.000Z TO 1999-12-31T24:59:99.999Z]
t_duration:[300000 TO *]
somefield:[B TO C]
Date math
Solr extended Lucene with some date-time math that is especially useful in specifying date ranges. In addition, there is a way to specify the current date-time using NOW. The syntax offers addition, subtraction, and rounding at various levels of date granularity (years, seconds, and so on.) The operations can be chained together as needed, in which case they are executed from left to right. Spaces aren't allowed. For example:
r_event_date:[* TO NOW-2YEAR]
Score boosting
You can easily modify the degree to which a clause in the query string contributes to the ultimate score by adding a multiplier. This is call boosting. A value between 0 and 1 reduces the score, and numbers greater than 1 increase it.
Scoring details are described later in this chapter. In the following example, we search for artists (a band is a type of artist in MusicBrainz) that either have a member named Billy, or have a name containing the word Smashing.
a_member_name:Billy^2 OR Smashing
Here we search for artists named Billy, and either Bob or Corgan, but we're less interested in those that are also named Corgan:
+Billy Bob Corgan^0.7
Existence (and non-existence) queries
This is actually not a new syntax case, but an application of range queries. Suppose you wanted to match all of the documents that have a value in a field (whatever that value is, it doesn't matter). Here we find all of the documents that have a_name:
a_name:[* TO *]
As a_name is the default field, just [* TO *] will do.
This can be negated to find documents that do not have a value for a_name, as
shown below:
-a_name:[* TO *]
Escaping special characters
The following characters are used by the query syntax, as described in this chapter:
+ - && || ! ( ) { } [ ] ^ " ~ * ? : /
In order to use any of these without their syntactical meaning, you need to escape them by a preceding /:
id:Artist/:11650
In some cases such as this one where the character is part of the text that is indexed, the double-quotes phrase query will also work, even though there is only one term:
id:"Artist:11650"
Filtering
Filtering in Solr is really quite simple. Let's say you are dispatching a user's query to Solr, but you want to limit the scope of that query further than what the query might be doing. As an example, let's say we wanted to make a search form for MusicBrainz that lets the user search for bands, not individual artists. Let's also say that the user's query string is Green. In the index, a_type is either 1 for an individual, 2 for a band, and 0 if unknown. Therefore, a clause that would find non-individuals would be this, combined with the user's query:
+Green +type:Artist -a_type:1
However, you should not use this approach.
Instead, use multiple fq query parameters, and leave the query string blank:
q=Green&fq=type%3AArtist&fq=-a_type%3A1
Sorting
The sorting specification is specified with the sort query parameter. The default is to sort by score in a descending order. In order to sort in an ascending order, you would put this in the URL:
sort=score+asc
In the following example, suppose we searched for artists that are not individuals (a previous example in the chapter), then we might want to ensure that those that are surely bands get top placement ahead of those that are unknown (2's then 0's). Secondly, we want the typical score descending search. This would simply be:
sort=a_type+desc,score+desc