<p>So in Solr, normally we’re used to stopwords just kind of magically working. If you enter a stop word in a query, it’ll just be silently ignored and stripped out (unlike my legacy OPAC, which will give you zero results whenever you include a stopword!) — if you include a stopword in a <em>phrase</em> search, it’ll do even better: “kill a mockingbird” basically changes into “kill * mockingbird”, kill and mockingbird seperated by one word, and succesfully matches indexes with “kill a mockingbird” (along with any other “kill * mockingbird”).</p>
<p></p>
<p>Great! So normally we don’t have to think about it too much.</p>
<p></p>
<p>An exception is when you throw dismax into it. Dismax lets you search multiple solr fields at once (the qf parameter). It also lets you search with a multi-clause query, where, depending on your “mm” settings, only SOME of those clauses have to match for results to be included in the hitlist.</p>
<p></p>
<p>So you have multiple Solr fields involved. As long as each of those solr fields is configured for stopwords (and the <em>same</em>) stopwords, everything Just Works the way you’d expect. But if one of those fields does <em>not</em> have stopwords configured, then (depending on your mm settings), you can easily end up getting zero hits for any (non-phrase) query clause that is a stopword. This kind of makes sense when you think about it — since at least one field didn’t have stopwords, there was a clause included for that stopword you entered. And that clause won’t possibly match on any of your stopword fields, so it’s a clause that can’t match, which depending on your mm (and the contents of all your fields, phew) will result in no hits.</p>
<p></p>
<p>A bit more information in <a href="http://n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html">this solr listserv thread</a>.</p>
<p></p>
<p>If you have fields included in a dismax qf that all have stopwords configured, but with <em>different</em> stopwords lists, the results could be even <em>more</em> confusing.</p>
<p></p>
<p>The solution?</p>
<p></p>
<p>If you are using dismax, make sure all fields included in a qf have <em>exactly the same</em> stopwords settings. Either they all need to have stopwords configured with the same stopwords file, or they all need to have stopwords <em>not</em> configured.</p>
<p></p>
<p>Just not using stopwords seems like the simplest solution to me. What’s the reason for stopwords in the first place? Generally performance, a very common word will end up with a huge result set when there’s a search clause on that word, which will slow down lucene/solr. My Solr is not as performant as I’d like, it’s true, but there are <em>a whole bunch</em> of different things I really need to look at for performance (So many that it’s kind of overwhelming to consider, honestly) — Since using stopwords would make my solr configuration more confusing and error prone, I think assuming that lack of stopwords is my most important bottleneck without profiling of some kind is a kind of “premature optimization”. So no stopwords for now.</p>
<p></p>
<p>Erik Hatcher suggested in an IRC chat that if very common words are a performance bottleneck, rather than stopwords it might make more sense to investigate Solr’s (or lucene’s?) “commongrams capability”. Need to put that on my list to look into, I know little about that; I get the basic concept, but dont’ know how it’s implemented in solr/lucene or how to set it up.</p>
<p></p>
<p>Great! So normally we don’t have to think about it too much.</p>
<p></p>
<p>An exception is when you throw dismax into it. Dismax lets you search multiple solr fields at once (the qf parameter). It also lets you search with a multi-clause query, where, depending on your “mm” settings, only SOME of those clauses have to match for results to be included in the hitlist.</p>
<p></p>
<p>So you have multiple Solr fields involved. As long as each of those solr fields is configured for stopwords (and the <em>same</em>) stopwords, everything Just Works the way you’d expect. But if one of those fields does <em>not</em> have stopwords configured, then (depending on your mm settings), you can easily end up getting zero hits for any (non-phrase) query clause that is a stopword. This kind of makes sense when you think about it — since at least one field didn’t have stopwords, there was a clause included for that stopword you entered. And that clause won’t possibly match on any of your stopword fields, so it’s a clause that can’t match, which depending on your mm (and the contents of all your fields, phew) will result in no hits.</p>
<p></p>
<p>A bit more information in <a href="http://n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html">this solr listserv thread</a>.</p>
<p></p>
<p>If you have fields included in a dismax qf that all have stopwords configured, but with <em>different</em> stopwords lists, the results could be even <em>more</em> confusing.</p>
<p></p>
<p>The solution?</p>
<p></p>
<p>If you are using dismax, make sure all fields included in a qf have <em>exactly the same</em> stopwords settings. Either they all need to have stopwords configured with the same stopwords file, or they all need to have stopwords <em>not</em> configured.</p>
<p></p>
<p>Just not using stopwords seems like the simplest solution to me. What’s the reason for stopwords in the first place? Generally performance, a very common word will end up with a huge result set when there’s a search clause on that word, which will slow down lucene/solr. My Solr is not as performant as I’d like, it’s true, but there are <em>a whole bunch</em> of different things I really need to look at for performance (So many that it’s kind of overwhelming to consider, honestly) — Since using stopwords would make my solr configuration more confusing and error prone, I think assuming that lack of stopwords is my most important bottleneck without profiling of some kind is a kind of “premature optimization”. So no stopwords for now.</p>
<p></p>
<p>Erik Hatcher suggested in an IRC chat that if very common words are a performance bottleneck, rather than stopwords it might make more sense to investigate Solr’s (or lucene’s?) “commongrams capability”. Need to put that on my list to look into, I know little about that; I get the basic concept, but dont’ know how it’s implemented in solr/lucene or how to set it up.</p>