Elasticsearch之高亮进阶-高性能高亮器, 让Elasticsearch飞一会儿

本文探讨了Elasticsearch和Solr中使用的Lucene高亮显示功能,对比了highlighter、fast-vector-highlighter及postings-highlighter的不同特性,并介绍了一种自定义的fast-highlighter,该高亮器在保持较低存储空间的同时显著提升了大文本字段的高亮显示速度。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

	
       很多应用场景下,搜索带高亮显示可以较好的改善用户体验。常用的企业搜索引擎Elasticsearch、Solr中均提供了高亮的功能。Elasticsearch、Solr中的高亮显示是均来源于lucene的高亮模块,luncene允许在一个或者多个字段上突出显示搜索内容,在中高亮方式上,lucene支持三种高亮显示方式highlighter, fast-vector-highlighter, postings-highlighter,  在solr中,highlighter 高亮是缺省配置高亮方式。在ElasticSearch中,highlighter 同样是默认的高亮方式。  

highlighter

     highlighter 高亮也叫plain高亮,该方式有一定的优点也有一定的缺点,先说说缺点。highlighter方式高亮是个实时分析处理高亮器。即用户在查询的时候,es取到了符合条件的docid后,将需要高亮的字段数据提取到内存,再调用该字段的分析器进行分词,分词完毕后采用相似度算法计算得分最高的前n组并高亮段返回数据。以ansj分析器为例,官方给出的性能在60-80万字/每秒,但实际上中服务器运行效率会小于该值(服务器主频都比较低),在生产环境下,ansj分词效率大多在 40-50万字/秒。假设用户搜索的都是比较大的文档同时需要进行高亮。按照一页查询40条(每条数据20k)的方式进行显示,即使相似度计算以及搜索排序不耗时,整个查询也会被高亮拖累到接近两秒,这种查询就有点无法忍受了。

   highlighter的优点也是因为其是实时分析高亮器,这种实时分析机制会让ES占用较少的io资源同时也占用较少的存储空间(词库较全的话相比fvh方式能节省一半的存储空间),其采用cpu资源来缓解io压力,在高亮字段较短(比如高亮文章的标题)时候速度较快,同时因io访问的次数少,io压力较小,有利于提高系统吞吐量。

fast-vector-highlighter :
      为解决 highlighter 高亮在大文本字段上的性能问题,lucene高亮模块提供了基于向量的高亮方式 fast-vector-highlighter。要采用fast-vector-highlighter(fvh)高亮方式,在数据建索引时候,需要配置存储词向量的词位置、词偏移量。fast-vector-highlighter在高亮时候的逻辑如下:
    1.分析高亮查询语法,提取表达式中的高亮词集合
    2.从磁盘上读取该文档字段下的词向量集合
    3.遍历词向量集合,提取自表达式中出现的词向量
    4.根据提取到目标词向量读取词频信息,根据词频获取每个位置信息、偏移量
    5.通过相似度算法获取得分较高的前n组高亮信息
    6.读取字段内容(多字段用空格隔开),根据提取的词向量直接定位截取高亮字段(注意:lucene原生高亮存在bug,bug分别存在core 以及 highlighter工程中,之前我写过如何修改)
     由此可见,fast-vector-highlighter 省去了实时分析过程,但是多了磁盘读取,故fast-vector-highlighter 也有一定的优点以及缺点.

缺点:

(1)fast-vector-highlighter  高亮方式需要存储词向量,而在词库丰富的系统中,存储词向量往往要比不存储词向量多占用一倍的空间。

(2)fast-vector-highlighter  高亮会比plain高亮多出至少一倍的io操作次数,读取的字节大小也多出至少一倍,大量的io请求会让搜索引擎并发能力降低。

优点:
(1)当实时分词速度小于磁盘读随机取速度的时候,从磁盘读取词向量的fast-vector-highlighter高亮有明显优势,例如: ansj分词器处理1百万字的文档耗时约两秒,而当前企业硬盘一分钟转速约为一万转,即一秒钟有160次的寻址能力,单次寻址并读取20k耗时约为7-10ms。分40次从磁盘总共读取2M内容耗时约为300毫秒,重复读取数据时候io上存在缓存,速度较快。与plain方式相比,fvh高亮在文档字段内容较大的情况下具有较大优势。

       默认plain高亮方式占用空间小,但是对大字段高亮慢,fvh对大字段高亮快,但占用空间过大,有没有一种高亮方式可以折中一些,即不要占用太大空间,对大字段分词也会太慢?当然有,lucene还提供了postings-highlighter(postings)高亮方式,postings-highlighter 高亮方式也是采用词量向量的方式进行高亮,与fvh高亮不同的是postings高亮只存储了词向量的位置信息,并未存储词向量的偏移量,故中大字段存储中,其比fvh节省约20-30%的存储空间。在实际使用中,postings高亮的优点和缺点都不突出,故高亮时候对小字段采用highlighter高亮方式,大字段采用fast-vector-highlighter即可满足需求。

       目前,lucene提供的默认plain高亮方式占用空间小,但是对大文本操作速度又太慢,fvh速度快,但占用磁盘空间和io操作又太多,在生产环境下,系统的吞吐量以及存储量都无法达到一个满意的水平。为了达到空间占用与默认的高亮其相同,速度比fast-vector-highlighter 高亮速度快,根据lucene高亮器的实现结构,我自己写了个高亮器,名称为fast-highlighter

 fast-highlighter 由几部分组成:
 1.es环境调用插件 FastPlainHighlighter,用于环境变量处理。
 2.TreeAnalysis分析器,能以FieldQuery类分析出的词作为词典的高性能分析器。
 3.高亮段计算以及编解码处理类

 
 编写难点:
 1. 短语高亮(带引号高亮,短语会被分成多个词,高亮时候只有位置连续符合的才被高亮)

 2. 最优化返回(需要计算最符合或者高亮词数最多的钱n段)

 3. 高亮时候允许不区分大小写匹配,不区分全角半角匹配高亮


 代码测试。

测试样本:

   (1)单条约10K的文本数据索引(索引总大小1.5g)。

   (2)检索文本中包含 “国美电器” 关键词文章并高亮返回40条


测试结果:

采用fast-vector-highlighter 高亮方式耗时336毫秒:

{
  • "took": 336,
  • "timed_out": false,
  • "_shards": {
    • "total": 1,
    • "successful": 1,
    • "failed": 0
    },
  • "hits": {
    • "total": 115,
    • "max_score": 0.19190195,
    • "hits": [
      • {
        • "_index": "test_v1",
        • "_type": "test",
        • "_id": "51000508",
        • "_score": 0.19190195,
        • "highlight": {
          • "text": [
            • "且主要品类零售额增速均高于上年同期水平。(2) 6 月 12 日,<b>国美</b><b>电器</b>宣布,其股东特别大会已通过公司更名议案。中文名称由“<b>国美</b><b>电器</b>控股有限公司”更改为“<b>国美</b>零售控股有限公司”。同日公司宣布正式推出全球首家专业 VR 影院,地点位于国美旗下大中<b>电器</b>北京马甸店。<b>国美</b> VR 影院将打破售票入场形式,采用“时间售卖”的方式,正式对外营业后一小时将收费"
            ]
          }
        }



采用fast--highlighter 高亮耗时132毫秒:

{

  • "took": 132,
  • "timed_out": false,
  • "_shards": {
    • "total": 1,
    • "successful": 1,
    • "failed": 0
    },
  • "hits": {
    • "total": 115,
    • "max_score": 0.19190195,
    • "hits": [
      • {
        • "_index": "test_v2",
        • "_type": "test",
        • "_id": "51000508",
        • "_score": 0.19190195,
        • "highlight": {
          • "text": [
            • "家用<b>电器</b>类零售额同比增长 1.6%,相比上年同期加快了 11.8 个百分点。(2) 6 月 12 日,<b>国美</b><b>电器</b>宣布,其股东特别大会已通过公司更名议案。中文名称由“<b>国美</b><b>电器</b>控股有限公司”更改为“<b>国美</b>零售控股有限公司”。同日,公司宣布正式推出全球首家专业 VR 影院,地点位于<b>国美</b>旗下大中<b>电器</b>北京马甸店。<b>国美</b>"
            ]
          }
        }

从结果可以看到,自己实现的高亮器可以比fast高亮器性能提高一倍以上,如下为fast-highlighter 的核心代码。

package org.elasticsearch.search.highlight;

import com.google.common.collect.Maps;
import org.apache.lucene.search.highlight.*;
import org.apache.lucene.search.vectorhighlight.BoundaryScanner;
import org.apache.lucene.search.vectorhighlight.CustomFieldQuery;
import org.apache.lucene.search.vectorhighlight.FieldQuery;
import org.apache.lucene.search.vectorhighlight.SimpleBoundaryScanner;
import org.apache.lucene.search.vectorhighlight.FieldQuery.Phrase;
import org.apache.lucene.util.BytesRefHash;
import org.elasticsearch.ExceptionsHelper;
import org.elasticsearch.common.text.Text;
import org.elasticsearch.index.mapper.FieldMapper;
import org.elasticsearch.search.fetch.FetchPhaseExecutionException;
import org.elasticsearch.search.fetch.FetchSubPhase;
import org.elasticsearch.search.internal.SearchContext;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;

/**
 * 
 * @author jkuang.nj
 *
 */
public class FastPlainHighlighter implements Highlighter
{
	private static final String CACHE_KEY = "highlight-fast";
	public static final char mark = 0;
	private static final SimpleBoundaryScanner DEFAULT_BOUNDARY_SCANNER = new SimpleBoundaryScanner();

	@Override
	public HighlightField highlight(HighlighterContext highlighterContext)
	{
		SearchContextHighlight.Field field = highlighterContext.field;
		SearchContext context = highlighterContext.context;
		FetchSubPhase.HitContext hitContext = highlighterContext.hitContext;
		FieldMapper mapper = highlighterContext.mapper;
		Encoder encoder = field.fieldOptions().encoder().equals("html") ? HighlightUtils.Encoders.HTML : HighlightUtils.Encoders.DEFAULT;

		if (!hitContext.cache().containsKey(CACHE_KEY))
		{
			hitContext.cache().put(CACHE_KEY, new HighlighterEntry());
		}

		HighlighterEntry cache = (HighlighterEntry) hitContext.cache().get(CACHE_KEY);
		try
		{
			FieldQuery fieldQuery;
			if (field.fieldOptions().requireFieldMatch())
			{
				if (cache.fieldMatchFieldQuery == null)
				{
					cache.fieldMatchFieldQuery = new CustomFieldQuery(highlighterContext.query, hitContext.topLevelReader(), true,
							field.fieldOptions().requireFieldMatch());
				}
				fieldQuery = cache.fieldMatchFieldQuery;
			}
			else
			{
				if (cache.noFieldMatchFieldQuery == null)
				{
					cache.noFieldMatchFieldQuery = new CustomFieldQuery(highlighterContext.query, hitContext.topLevelReader(), true,
							field.fieldOptions().requireFieldMatch());
				}
				fieldQuery = cache.noFieldMatchFieldQuery;

			}
			if (!cache.analysises.containsKey(field.field()))
			{
				cache.setPhrases(field.field(), fieldQuery.getPhrases(field.field()));
				cache.setWords(field.field(), fieldQuery.getTermSet(field.field()));
			}
			FastHighlighter entry = cache.mappers.get(mapper);
			if (entry == null)
			{

				BoundaryScanner boundaryScanner = DEFAULT_BOUNDARY_SCANNER;
				if (field.fieldOptions().boundaryMaxScan() != SimpleBoundaryScanner.DEFAULT_MAX_SCAN
						|| field.fieldOptions().boundaryChars() != SimpleBoundaryScanner.DEFAULT_BOUNDARY_CHARS)
				{
					boundaryScanner = new SimpleBoundaryScanner(field.fieldOptions().boundaryMaxScan(), field.fieldOptions().boundaryChars());
				}
				CustomFieldQuery.highlightFilters.set(field.fieldOptions().highlightFilter());
				entry = new FastHighlighter(encoder, boundaryScanner);
				cache.mappers.put(mapper, entry);
			}

			String[] fragments;
			int numberOfFragments = field.fieldOptions().numberOfFragments() == 0 ? 1 : field.fieldOptions().numberOfFragments();
			int fragmentCharSize = field.fieldOptions().numberOfFragments() == 0 ? 50 : field.fieldOptions().fragmentCharSize();
			List textsToHighlight = null;
			try
			{
				textsToHighlight = HighlightUtils.loadFieldValues(field, mapper, context, hitContext);
				StringBuilder buffer = new StringBuilder();
				for (Object textToHighlight : textsToHighlight)
				{
					String text = textToHighlight.toString();
					buffer.append(text).append(" ");
				}
				fragments = entry.getBestBestFragments(cache.analysises.get(field.field()), cache.phrases.get(field.field()), buffer,
						numberOfFragments, fragmentCharSize, field.fieldOptions().preTags(), field.fieldOptions().postTags());
			}
			catch (Exception e)
			{
				e.printStackTrace();
				if (ExceptionsHelper.unwrap(e, BytesRefHash.MaxBytesLengthExceededException.class) != null)
				{
					return null;
				}
				else
				{
					throw new FetchPhaseExecutionException(context, "Failed to highlight field [" + highlighterContext.fieldName + "]", e);
				}
			}

			if (fragments != null && fragments.length > 0)
			{
				return new HighlightField(highlighterContext.fieldName, Text.convertFromStringArray(fragments));
			}

			int noMatchSize = highlighterContext.field.fieldOptions().noMatchSize();
			if (noMatchSize > 0 && textsToHighlight.size() > 0)
			{
				String fieldContents = textsToHighlight.get(0).toString();
				return new HighlightField(highlighterContext.fieldName,
						new Text[] { new Text(fieldContents.substring(0, Math.min(fragmentCharSize, fieldContents.length()))) });
			}

			return null;

		}
		catch (Exception e)
		{
			throw new FetchPhaseExecutionException(context, "Failed to highlight field [" + highlighterContext.fieldName + "]", e);
		}
	}

	@Override
	public boolean canHighlight(FieldMapper fieldMapper)
	{
		return true;
	}

	private class HighlighterEntry
	{
		public FieldQuery noFieldMatchFieldQuery;
		public FieldQuery fieldMatchFieldQuery;
		public Map> phrases = new HashMap<>();
		public Map mappers = Maps.newHashMap();
		public Map analysises = new HashMap<>();

		public void setPhrases(String field, Set phrases)
		{
			if(!this.phrases.containsKey(field)){
				this.phrases.put(field, phrases);
			}
		}

		public void setWords(String field, Set words)
		{
			if (!analysises.containsKey(field))
			{
				TreeAnalysis analysis = new TreeAnalysis();
				if (words != null && words.size() > 0)
				{
					for (String word : words)
					{
						analysis.add(word);
					}
				}
				analysises.put(field, analysis);
			}

		}
	}

	static class FragmentScore implements Comparable
	{
		int point = 0;
		int distance = 0;
		List terms = new ArrayList<>();
		HashSet set = new HashSet<>();
		StringBuffer buffer = new StringBuffer();

		public FragmentScore(int distance)
		{
			this.distance = distance;
		}

		public void updateScore(Set phrases)
		{
			for (Phrase phrase : phrases)
			{
				if (buffer.indexOf(phrase.toString()) >= 0)
				{
					this.point += 5 * phrase.list.size();
				}
			}
		}

		public boolean add(Term term)
		{
			if (terms.size() == 0 || term.pos - terms.get(0).pos <= distance)
			{

				if (terms.size() == 0)
				{
					buffer.append(term.word);
				}
				else
				{
					int dis = term.pos - terms.get(terms.size() - 1).pos;
					buffer.append(mark).append(dis).append(mark);
					buffer.append(term.word);
				}
				terms.add(term);
				this.point += term.length();
				if (set.size() > 0)
				{
					if (!set.contains(term.word))
					{
						this.point += 2;
						set.add(term.word);
					}
					if (term.pos - terms.get(terms.size() - 1).pos == 1)
					{
						this.point += 2;
					}
				}
				return true;
			}
			return false;
		}

		@Override
		public int compareTo(FragmentScore o)
		{
			return -(this.point - o.point);
		}

	}



	public class FastHighlighter
	{
		BoundaryScanner boundaryScanner;
		Encoder encoder;

		public FastHighlighter(Encoder encoder, BoundaryScanner boundaryScanner)
		{
			this.encoder = encoder;
			this.boundaryScanner = boundaryScanner;
		}

		public String[] getBestBestFragments(TreeAnalysis analyzer, Set phrases, StringBuilder buffer, int maxNumFragments, int fragmentSize,
				String[] preTags, String[] postTags)
		{
			List fragmentScores;
			if (maxNumFragments <= 1)
			{
				fragmentScores = getBestFragments(analyzer, phrases, buffer.toString(), fragmentSize);
			}
			else
			{
				fragmentScores = getBestFragments(analyzer, buffer.toString(), maxNumFragments, fragmentSize);

			}
			return toString(buffer, fragmentSize, fragmentScores, preTags, postTags);
		}

		public String[] toString(StringBuilder buffer, int fragmentSize, List fragmentScores, String[] preTags, String[] postTags)
		{
			List list = new ArrayList<>();
			for (FragmentScore score : fragmentScores)
			{
				List terms = score.terms;
				Term head = terms.get(0);
				Term tail = terms.get(terms.size() - 1);
				int start = boundaryScanner.findStartOffset(buffer, head.startoffset);
				int end = boundaryScanner.findEndOffset(buffer, tail.endoffset());
				if (fragmentScores.size() == 1 && buffer.length() <= fragmentSize)
				{
					start = 0;
					end = buffer.length();
				}
				else if (fragmentSize - (tail.endoffset() - head.startoffset) > (fragmentSize / 10))
				{
					int size = fragmentSize - (tail.endoffset() - head.startoffset);
					if (head.startoffset < (size * 3 / 10))
					{
						start = 0;
					}
					else
					{
						start = boundaryScanner.findStartOffset(buffer, head.startoffset);
					}
					if (buffer.length() - start < fragmentSize)
					{
						end = buffer.length();
					}
					else
					{
						end = boundaryScanner.findEndOffset(buffer, Math.max(start + fragmentSize, tail.endoffset()));
					}
				}
				StringBuffer result = new StringBuffer();
				for (int i = 0; i < terms.size(); i++)
				{
					Term term = terms.get(i);
					result.append(buffer.substring(start, term.startoffset));
					result.append(getTag(preTags, i));
					result.append(encoder.encodeText(buffer.substring(term.startoffset, term.endoffset())));
					result.append(getTag(postTags, i));
					start = term.endoffset();
				}
				result.append(buffer.substring(start, end));
				list.add(result.toString());
			}
			return list.toArray(new String[0]);
		}

		public final List getBestFragments(TreeAnalysis analyzer, Set phrases, String text, int fragmentSize)
		{
			if (analyzer == null)
			{
				return new ArrayList<>();
			}
			List fragments = new ArrayList();
			FragmentScore fragmentScore = null;
			List terms = analyzer.find(text);
			for (int i = 0, j = 0; i < terms.size(); i++)
			{
				FragmentScore fScore = new FragmentScore(fragmentSize);
				for (j = i; j < terms.size(); j++)
				{
					if (!fScore.add(terms.get(j)))
					{
						break;
					}
				}
				fScore.updateScore(phrases);
				if (fragmentScore == null || fragmentScore.compareTo(fScore) >= 0)
				{
					fragmentScore = fScore;
				}
				if (j >= terms.size())
				{
					break;
				}
			}
			if (fragmentScore != null)
			{
				fragments.add(fragmentScore);
			}

			return fragments;
		}

		public final List getBestFragments(TreeAnalysis analyzer, String text, int maxNumFragments, int fragmentSize)
		{
			if (analyzer == null)
			{
				return null;
			}
			List terms = analyzer.find(text);
			List fragments = new ArrayList();
			FragmentScore fScore = new FragmentScore(fragmentSize);
			for (int i = 0; i < terms.size(); i++)
			{
				if (!fScore.add(terms.get(i)))
				{
					fragments.add(fScore);
					fScore = new FragmentScore(fragmentSize);
					fScore.add(terms.get(i));
				}
			}
			fragments.add(fScore);
			Collections.sort(fragments);
			while (fragments.size() > maxNumFragments)
			{
				fragments.remove(fragments.size() - 1);
			}
			return fragments;
		}

		protected String getTag(String[] tags, int num)
		{
			int n = num % tags.length;
			return tags[n];
		}

	}

	public static class Term
	{
		String word;
		int startoffset,  pos;
		public Term(int startoffset, int pos, String word)
		{
			this.startoffset = startoffset;
			this.pos = pos;
			this.word = word;
		}

		public int endoffset()
		{
			return this.startoffset+word.length();
		}
		public int length()
		{
			return word.length();
		}
		
		public String toString()
		{
			return "start:" + startoffset +  " pos:" + pos+" word:"+word;
		}
	}

	public static class TreeAnalysis
	{
		private TNode root = new TNode((char) 0, false);
		boolean[] nodes = new boolean[64 * 1024];
        static final char ch0 ='\uFF00';
        static final char ch1 ='\uFF5F';
		public List find(String str)
		{
			int start = 0;
			int length = str.length();
			str = str.toLowerCase();
			char[] values = str.toCharArray();
			List terms = new ArrayList<>();
			int sumpos = 0;
			while (start < length)
			{
				char ch = values[start];
				//全椒字符串换为半角字符
				ch= (char) (ch > ch0 && ch < ch1 ? ch - 65248 :ch);
				if (!nodes[ch])
				{
					start++;
					continue;
				}
				else
				{
					int pos = root.find(values, start, -1);
					if (pos >= start)
					{
						terms.add(new Term(start, start - sumpos + terms.size(), str.substring(start, pos + 1)));
						sumpos += pos + 1 - start;
						start = pos + 1;
					}
					else
					{
						start++;
					}
				}
			}
			return terms;
		}

		public void add(String str)
		{
			if (str == null || str.length() == 0)
			{
				return;
			}
			str = str.toLowerCase();
			nodes[(int)str.charAt(0)] = true;
			root.insert(str, 0);
		}
	
		private static class TNode implements Comparable
		{
			// 标记当前节点是否是一个词的终止字符
			boolean mark;
			// 当前节点的字符
			char value;
			// 子节点
			TNode[] nodes;
			int nodesize;
			public TNode(char ch, boolean mark)
			{
				this.value = ch;
				this.mark = mark;
			}
			public int find(char[] chs, int nextPos, int leafoffset)
			{
				if (nextPos >= chs.length)
				{
					return -1;
				}
				int size = 0;
				char ch = chs[nextPos];
				//全椒字符串换为半角字符
				ch= (char) (ch > ch0 && ch < ch1 ? ch - 65248 :ch);
				while (size < this.nodesize && nodes[size++].value < ch);
				int pos = nodes[size - 1].value == ch ? size - 1 : -1;
				// int pos = index(chs[nextPos]);
				if (pos >= 0)
				{
					if (nodes[pos].mark)
					{
						leafoffset = nextPos;
						if (nodes[pos].nodesize == 0)
						{
							return nextPos;
						}
					}
					int next = nodes[pos].find(chs, nextPos + 1, leafoffset);
					return next > leafoffset ? next : leafoffset;

				}
				else
				{
					return -1;
				}
			}

			/*public int index(char ch)
			{
				if (this.nodesize < 5)
				{
					int size = 0;
					while (size < this.nodesize && nodes[size++].value < ch)
						;
					return nodes[size - 1].value == ch ? size - 1 : -1;
				}
				else
				{
					return indexOf(nodes, this.nodesize, ch, Type._index);
				}
			}*/

			int indexOf(TNode[] nodes, int size, char node, Type type)
			{
				int fromIndex = 0;
				int toIndex = size - 1;
				while (fromIndex <= toIndex)
				{
					int mid = (fromIndex + toIndex) >> 1;
					int cmp = nodes[mid].compareTo(node);// this.comparator.compare(nodes[mid],
															// node);
					if (cmp < 0)
						fromIndex = mid + 1;
					else if (cmp > 0)
						toIndex = mid - 1;
					else
						return type == Type._insert ? -(mid + 1) : mid; // key
																		// found
				}
				switch (type)
				{
				case _insert:
					return fromIndex;
				case _index:
					return -(fromIndex + 1);
				default:
					return toIndex;
				}
			}

			public void insert(String str, int pos)
			{
				char ch = str.charAt(pos);
				boolean isleaf = pos == str.length() - 1;
				if (this.nodesize == 0)
				{
					nodes = new TNode[1];
					nodes[0] = new TNode(ch, isleaf);
					if (!isleaf)
					{
						nodes[0].insert(str, pos + 1);
					}
					this.nodesize++;
				}
				else
				{
					int _index = indexOf(nodes, nodesize, ch, Type._insert);
					if (_index >= 0)
					{
						int moved = this.nodesize - _index;
						if (this.nodesize == nodes.length)
						{
							nodes = Arrays.copyOf(nodes, nodes.length + 1);
						}
						if (moved > 0)
						{
							System.arraycopy(nodes, _index, nodes, _index + 1, moved);
						}
						nodes[_index] = new TNode(ch, isleaf);
						if (!isleaf)
						{
							nodes[_index].insert(str, pos + 1);
						}
						this.nodesize++;
					}
					else
					{
						if (isleaf)
						{
							nodes[0].mark = true;
						}
						else
						{
							nodes[-_index - 1].insert(str, pos + 1);
						}
					}
				}
			}

			@Override
			public int compareTo(TNode o)
			{
				if (this.value > o.value)
				{
					return 1;
				}
				else if (this.value < o.value)
				{
					return -1;
				}
				return 0;
			}

			public int compareTo(char o)
			{
				if (this.value > o)
				{
					return 1;
				}
				else if (this.value < o)
				{
					return -1;
				}
				return 0;
			}
			
			enum Type
			{
				_insert, _index
			}
		}
	}
}
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements.  See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License.  You may obtain a copy of the License at
*
*     http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.lucene.search.vectorhighlight;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Iterator;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.queries.CustomScoreQuery;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.BoostQuery;
import org.apache.lucene.search.ConstantScoreQuery;
import org.apache.lucene.search.DisjunctionMaxQuery;
import org.apache.lucene.search.FilteredQuery;
import org.apache.lucene.search.MultiTermQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.vectorhighlight.FieldTermStack.TermInfo;

/**
* FieldQuery breaks down query object into terms/phrases and keeps
* them in a QueryPhraseMap structure.
*/
public class FieldQuery {

 final boolean fieldMatch;

 // fieldMatch==true,  Map
 // fieldMatch==false, Map
 Map rootMaps = new HashMap<>();

 // fieldMatch==true,  Map
 // fieldMatch==false, Map
 Map> termSetMap = new HashMap<>();

 //存储短语
 Map> phraseMap = new HashMap<>();
 int termOrPhraseNumber; // used for colored tag support

 // The maximum number of different matching terms accumulated from any one MultiTermQuery
 private static final int MAX_MTQ_TERMS = 1024;

 public static class Phrase {
	  public List list = new ArrayList<>();
	  StringBuffer buffer = new StringBuffer();
	  public void add(String word,int position)
	  {
		  if(list.size()==0){
			  buffer.append(word);
		  }else{
			  int pos= position-list.get(list.size()-1).position;
			  char z = 0;
			  buffer.append(z).append(pos).append(z);
			  buffer.append(word);
		  }
		  list.add(new Term(position, word));
	  }
	  public String toString()
	  {
		  return buffer.toString();
	  }
	 public static class Term{

			public int position;
			 public String word;
		 public Term(int position, String word) {
			this.position = position;
			this.word = word;
		}
	  }
 }
 protected FieldQuery( Query query, IndexReader reader, boolean phraseHighlight, boolean fieldMatch ) throws IOException {
   this.fieldMatch = fieldMatch;
   Set flatQueries = new LinkedHashSet<>();
   flatten( query, reader, flatQueries, 1f );
   saveTerms( flatQueries, reader );
   Collection expandQueries = expand( flatQueries );

   for( Query flatQuery : expandQueries ){
     QueryPhraseMap rootMap = getRootMap( flatQuery );
     rootMap.add( flatQuery, reader );
     float boost = 1f;
     while (flatQuery instanceof BoostQuery) {
       BoostQuery bq = (BoostQuery) flatQuery;
       flatQuery = bq.getQuery();
       boost *= bq.getBoost();
     }
     if( !phraseHighlight && flatQuery instanceof PhraseQuery ){
       PhraseQuery pq = (PhraseQuery)flatQuery;
       if( pq.getTerms().length > 1 ){
         for( Term term : pq.getTerms() )
           rootMap.addTerm( term, boost );
       }
     }
   }
 }
 /** For backwards compatibility you can initialize FieldQuery without
  * an IndexReader, which is only required to support MultiTermQuery
  */
 FieldQuery( Query query, boolean phraseHighlight, boolean fieldMatch ) throws IOException {
   this (query, null, phraseHighlight, fieldMatch);
 }

 void flatten( Query sourceQuery, IndexReader reader, Collection flatQueries, float boost ) throws IOException{
   while (true) {
     if (sourceQuery.getBoost() != 1f) {
       boost *= sourceQuery.getBoost();
       sourceQuery = sourceQuery.clone();
       sourceQuery.setBoost(1f);
     } else if (sourceQuery instanceof BoostQuery) {
       BoostQuery bq = (BoostQuery) sourceQuery;
       sourceQuery = bq.getQuery();
       boost *= bq.getBoost();
     } else {
       break;
     }
   }
   if( sourceQuery instanceof BooleanQuery ){
     BooleanQuery bq = (BooleanQuery)sourceQuery;
     for( BooleanClause clause : bq ) {
       if( !clause.isProhibited() ) {
         flatten( clause.getQuery(), reader, flatQueries, boost );
       }
     }
   } else if( sourceQuery instanceof DisjunctionMaxQuery ){
     DisjunctionMaxQuery dmq = (DisjunctionMaxQuery)sourceQuery;
     for( Query query : dmq ){
       flatten( query, reader, flatQueries, boost );
     }
   }
   else if( sourceQuery instanceof TermQuery ){
     if (boost != 1f) {
       sourceQuery = new BoostQuery(sourceQuery, boost);
     }
     if( !flatQueries.contains( sourceQuery ) )
       flatQueries.add( sourceQuery );
   }
   else if( sourceQuery instanceof PhraseQuery ){
     PhraseQuery pq = (PhraseQuery)sourceQuery;
     if( pq.getTerms().length == 1 )
       sourceQuery = new TermQuery( pq.getTerms()[0] );
     if (boost != 1f) {
       sourceQuery = new BoostQuery(sourceQuery, boost);
     }
     flatQueries.add(sourceQuery);
   } else if (sourceQuery instanceof ConstantScoreQuery) {
     final Query q = ((ConstantScoreQuery) sourceQuery).getQuery();
     if (q != null) {
       flatten( q, reader, flatQueries, boost);
     }
   } else if (sourceQuery instanceof FilteredQuery) {
     final Query q = ((FilteredQuery) sourceQuery).getQuery();
     if (q != null) {
       flatten( q, reader, flatQueries, boost);
     }
   } else if (sourceQuery instanceof CustomScoreQuery) {
     final Query q = ((CustomScoreQuery) sourceQuery).getSubQuery();
     if (q != null) {
       flatten( q, reader, flatQueries, boost);
     }
   } else if (reader != null) {
     Query query = sourceQuery;
     Query rewritten;
     if (sourceQuery instanceof MultiTermQuery) {
       rewritten = new MultiTermQuery.TopTermsScoringBooleanQueryRewrite(MAX_MTQ_TERMS).rewrite(reader, (MultiTermQuery) query);
     } else {
       rewritten = query.rewrite(reader);
     }
     if (rewritten != query) {
       // only rewrite once and then flatten again - the rewritten query could have a speacial treatment
       // if this method is overwritten in a subclass.
       flatten(rewritten, reader, flatQueries, boost);
     } 
     // if the query is already rewritten we discard it
   }
   // else discard queries
 }
 /*
  * Create expandQueries from flatQueries.
  * 
  * expandQueries := flatQueries + overlapped phrase queries
  * 
  * ex1) flatQueries={a,b,c}
  *      => expandQueries={a,b,c}
  * ex2) flatQueries={a,"b c","c d"}
  *      => expandQueries={a,"b c","c d","b c d"}
  */
 Collection expand( Collection flatQueries ){
   Set expandQueries = new LinkedHashSet<>();
   for( Iterator i = flatQueries.iterator(); i.hasNext(); ){
     Query query = i.next();
     i.remove();
     expandQueries.add( query );
     float queryBoost = 1f;
     while (query instanceof BoostQuery) {
       BoostQuery bq = (BoostQuery) query;
       queryBoost *= bq.getBoost();
       query = bq.getQuery();
     }
     if( !( query instanceof PhraseQuery ) ) continue;
     for( Iterator j = flatQueries.iterator(); j.hasNext(); ){
       Query qj = j.next();
       float qjBoost = 1f;
       while (qj instanceof BoostQuery) {
         BoostQuery bq = (BoostQuery) qj;
         qjBoost *= bq.getBoost();
         qj = bq.getQuery();
       }
       if( !( qj instanceof PhraseQuery ) ) continue;
       checkOverlap( expandQueries, (PhraseQuery)query, queryBoost, (PhraseQuery)qj, qjBoost );
     }
   }
   return expandQueries;
 }

 /*
  * Check if PhraseQuery A and B have overlapped part.
  * 
  * ex1) A="a b", B="b c" => overlap; expandQueries={"a b c"}
  * ex2) A="b c", B="a b" => overlap; expandQueries={"a b c"}
  * ex3) A="a b", B="c d" => no overlap; expandQueries={}
  */
 private void checkOverlap( Collection expandQueries, PhraseQuery a, float aBoost, PhraseQuery b, float bBoost ){
   if( a.getSlop() != b.getSlop() ) return;
   Term[] ats = a.getTerms();
   Term[] bts = b.getTerms();
   if( fieldMatch && !ats[0].field().equals( bts[0].field() ) ) return;
   checkOverlap( expandQueries, ats, bts, a.getSlop(), aBoost);
   checkOverlap( expandQueries, bts, ats, b.getSlop(), bBoost );
 }

 /*
  * Check if src and dest have overlapped part and if it is, create PhraseQueries and add expandQueries.
  * 
  * ex1) src="a b", dest="c d"       => no overlap
  * ex2) src="a b", dest="a b c"     => no overlap
  * ex3) src="a b", dest="b c"       => overlap; expandQueries={"a b c"}
  * ex4) src="a b c", dest="b c d"   => overlap; expandQueries={"a b c d"}
  * ex5) src="a b c", dest="b c"     => no overlap
  * ex6) src="a b c", dest="b"       => no overlap
  * ex7) src="a a a a", dest="a a a" => overlap;
  *                                     expandQueries={"a a a a a","a a a a a a"}
  * ex8) src="a b c d", dest="b c"   => no overlap
  */
 private void checkOverlap( Collection expandQueries, Term[] src, Term[] dest, int slop, float boost ){
   // beginning from 1 (not 0) is safe because that the PhraseQuery has multiple terms
   // is guaranteed in flatten() method (if PhraseQuery has only one term, flatten()
   // converts PhraseQuery to TermQuery)
   for( int i = 1; i < src.length; i++ ){
     boolean overlap = true;
     for( int j = i; j < src.length; j++ ){
       if( ( j - i ) < dest.length && !src[j].text().equals( dest[j-i].text() ) ){
         overlap = false;
         break;
       }
     }
     if( overlap && src.length - i < dest.length ){
       PhraseQuery.Builder pqBuilder = new PhraseQuery.Builder();
       for( Term srcTerm : src )
         pqBuilder.add( srcTerm );
       for( int k = src.length - i; k < dest.length; k++ ){
         pqBuilder.add( new Term( src[0].field(), dest[k].text() ) );
       }
       pqBuilder.setSlop( slop );
       Query pq = pqBuilder.build();
       if (boost != 1f) {
         pq = new BoostQuery(pq, 1f);
       }
       if(!expandQueries.contains( pq ) )
         expandQueries.add( pq );
     }
   }
 }
 QueryPhraseMap getRootMap( Query query ){
   String key = getKey( query );
   QueryPhraseMap map = rootMaps.get( key );
   if( map == null ){
     map = new QueryPhraseMap( this );
     rootMaps.put( key, map );
   }
   return map;
 }
 /*
  * Return 'key' string. 'key' is the field name of the Query.
  * If not fieldMatch, 'key' will be null.
  */
 private String getKey( Query query ){
   if( !fieldMatch ) return null;
   while (query instanceof BoostQuery) {
     query = ((BoostQuery) query).getQuery();
   }
   if( query instanceof TermQuery )
     return ((TermQuery)query).getTerm().field();
   else if ( query instanceof PhraseQuery ){
     PhraseQuery pq = (PhraseQuery)query;
     Term[] terms = pq.getTerms();
     return terms[0].field();
   }
   else if (query instanceof MultiTermQuery) {
     return ((MultiTermQuery)query).getField();
   }
   else
     throw new RuntimeException( "query \"" + query.toString() + "\" must be flatten first." );
 }

 /*
  * Save the set of terms in the queries to termSetMap.
  * 
  * ex1) q=name:john
  *      - fieldMatch==true
  *          termSetMap=Map<"name",Set<"john">>
  *      - fieldMatch==false
  *          termSetMap=Map>
  *          
  * ex2) q=name:john title:manager
  *      - fieldMatch==true
  *          termSetMap=Map<"name",Set<"john">,
  *                         "title",Set<"manager">>
  *      - fieldMatch==false
  *          termSetMap=Map>
  *          
  * ex3) q=name:"john lennon"
  *      - fieldMatch==true
  *          termSetMap=Map<"name",Set<"john","lennon">>
  *      - fieldMatch==false
  *          termSetMap=Map>
  */
 void saveTerms( Collection flatQueries, IndexReader reader ) throws IOException{
   for( Query query : flatQueries ){
     while (query instanceof BoostQuery) {
       query = ((BoostQuery) query).getQuery();
     }
     Set terms = getTerms( query );
     Set termSet = getTermSet( query );
     if( query instanceof TermQuery ){
        termSet.add( ((TermQuery)query).getTerm().text() );
     }
     else if( query instanceof PhraseQuery ){
   	int[] positions=((PhraseQuery)query).getPositions();
   	Term[] terms2 =((PhraseQuery)query).getTerms();
   	Phrase phrase = new Phrase();
   	for (int i = 0; i < terms2.length; i++) {
   		phrase.add(terms2[i].text(), positions[i]);
			termSet.add( terms2[i].text() );
		}
   	if(terms2.length > 1){
   		terms.add(phrase);
   	}
     }
     else if (query instanceof MultiTermQuery && reader != null) {
       BooleanQuery mtqTerms = (BooleanQuery) query.rewrite(reader);
       for (BooleanClause clause : mtqTerms) {
         termSet.add (((TermQuery) clause.getQuery()).getTerm().text());
       }
     }
     else
       throw new RuntimeException( "query \"" + query.toString() + "\" must be flatten first." );
   }
 }
 private Set getTermSet( Query query ){
   String key = getKey( query );
   Set set = termSetMap.get( key );
   if( set == null ){
     set = new HashSet<>();
     termSetMap.put( key, set );
   }
   return set;
 }
 //用fastHighlighter
 private  Set getTerms( Query query ){
	    String key = getKey( query );
	    Set set = phraseMap.get( key );
	    if( set == null ){
	      set = new HashSet<>();
	      phraseMap.put( key, set );
	    }
	    return set;
}
 public Set getTermSet( String field ){
   return termSetMap.get( fieldMatch ? field : null );
 }

 /**
  * 短语不进行分词
  * @param field
  * @return
  */
 public Set getPhrases( String field ){
	    return phraseMap.get( fieldMatch ? field : null );
	  }
 /**
  * 
  * @return QueryPhraseMap
  */
 public QueryPhraseMap getFieldTermMap( String fieldName, String term ){
   QueryPhraseMap rootMap = getRootMap( fieldName );
   return rootMap == null ? null : rootMap.subMap.get( term );
 }

 /**
  * 
  * @return QueryPhraseMap
  */
 public QueryPhraseMap searchPhrase( String fieldName, final List phraseCandidate ){
   QueryPhraseMap root = getRootMap( fieldName );
   if( root == null ) return null;
   return root.searchPhrase( phraseCandidate );
 }
 private QueryPhraseMap getRootMap( String fieldName ){
   return rootMaps.get( fieldMatch ? fieldName : null );
 }
 public int nextTermOrPhraseNumber(){
   return termOrPhraseNumber++;
 }
 /**
  * Internal structure of a query for highlighting: represents
  * a nested query structure
  */
 public static class QueryPhraseMap {

   boolean terminal;
   int slop;   // valid if terminal == true and phraseHighlight == true
   float boost;  // valid if terminal == true
   int termOrPhraseNumber;   // valid if terminal == true
   FieldQuery fieldQuery;
   Map subMap = new HashMap<>();
   public QueryPhraseMap( FieldQuery fieldQuery ){
     this.fieldQuery = fieldQuery;
   }

   void addTerm( Term term, float boost ){
     QueryPhraseMap map = getOrNewMap( subMap, term.text() );
     map.markTerminal( boost );
   }
   private QueryPhraseMap getOrNewMap( Map subMap, String term ){
     QueryPhraseMap map = subMap.get( term );
     if( map == null ){
       map = new QueryPhraseMap( fieldQuery );
       subMap.put( term, map );
     }
     return map;
   }

   void add( Query query, IndexReader reader ) {
     float boost = 1f;
     while (query instanceof BoostQuery) {
       BoostQuery bq = (BoostQuery) query;
       query = bq.getQuery();
       boost = bq.getBoost();
     }
     if( query instanceof TermQuery ){
       addTerm( ((TermQuery)query).getTerm(), boost );
     }
     else if( query instanceof PhraseQuery ){
       PhraseQuery pq = (PhraseQuery)query;
       Term[] terms = pq.getTerms();
       Map map = subMap;
       QueryPhraseMap qpm = null;
       for( Term term : terms ){
         qpm = getOrNewMap( map, term.text() );
         map = qpm.subMap;
       }
       qpm.markTerminal( pq.getSlop(), boost );
     }
     else
       throw new RuntimeException( "query \"" + query.toString() + "\" must be flatten first." );
   }
   public QueryPhraseMap getTermMap( String term ){
     return subMap.get( term );
   }
   private void markTerminal( float boost ){
     markTerminal( 0, boost );
   }
   private void markTerminal( int slop, float boost ){
     this.terminal = true;
     this.slop = slop;
     this.boost = boost;
     this.termOrPhraseNumber = fieldQuery.nextTermOrPhraseNumber();
   }
   public boolean isTerminal(){
     return terminal;
   }
   public int getSlop(){
     return slop;
   }
   public float getBoost(){
     return boost;
   }
   public int getTermOrPhraseNumber(){
     return termOrPhraseNumber;
   }
   public QueryPhraseMap searchPhrase( final List phraseCandidate ){
     QueryPhraseMap currMap = this;
     for( TermInfo ti : phraseCandidate ){
       currMap = currMap.subMap.get( ti.getText() );
       if( currMap == null ) return null;
     }
     return currMap.isValidTermOrPhrase( phraseCandidate ) ? currMap : null;
   }
   public boolean isValidTermOrPhrase( final List phraseCandidate ){
     // check terminal
     if( !terminal ) return false;

     // if the candidate is a term, it is valid
     if( phraseCandidate.size() == 1 ) return true;

     // else check whether the candidate is valid phrase
     // compare position-gaps between terms to slop
     int pos = phraseCandidate.get( 0 ).getPosition();
     for( int i = 1; i < phraseCandidate.size(); i++ ){
       int nextPos = phraseCandidate.get( i ).getPosition();
       if( Math.abs( nextPos - pos - 1 ) > slop ) return false;
       pos = nextPos;
     }
     return true;
   }
 }
}

评论 17
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值