Solr: Clustering documents with carrot

最新推荐文章于 2024-04-06 09:48:21 发布

原创最新推荐文章于 2024-04-06 09:48:21 发布 · 122 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#人工智能 #数据结构与算法

Solr 专栏收录该内容

41 篇文章

订阅专栏

本文详细介绍了如何在Solr中配置集群组件，使用特定算法进行聚类，并将多语言特性应用于聚类过程，以提高搜索效率和用户体验。

1. Configure clutering in solrconfig.xml

<searchComponent name="clustering"
                   enable="true"
                   class="solr.clustering.ClusteringComponent" >
    <lst name="engine">
      <str name="name">lingo</str>

      <str name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>
      <str name="carrot.resourcesDir">clustering/carrot2</str>
    </lst>

    <lst name="engine">
      <str name="name">stc</str>
      <str name="carrot.algorithm">org.carrot2.clustering.stc.STCClusteringAlgorithm</str>
    </lst>

    <lst name="engine">
      <str name="name">kmeans</str>
      <str name="carrot.algorithm">org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm</str>
    </lst>

  </searchComponent>
 <requestHandler name="/clustering"
                  startup="lazy"
                  enable="true"
                  class="solr.SearchHandler">
    <lst name="defaults">
      <bool name="clustering">true</bool>
      <str name="clustering.engine">lingo</str>
      <bool name="clustering.results">true</bool>
      <!-- Field name with the logical "title" of a each document (optional) -->
      <str name="carrot.title">content</str>
      <!-- Field name with the logical "URL" of a each document (optional) -->
      <str name="carrot.url">id</str>
      <!-- Field name with the logical "content" of a each document (optional) -->
      <str name="carrot.snippet">content</str>
      <!-- Apply highlighter to the title/ content and use this for clustering. -->
      <bool name="carrot.produceSummary">true</bool>
      <!-- the maximum number of labels per cluster -->
      <!--<int name="carrot.numDescriptions">5</int>-->
      <!-- produce sub clusters -->
      <bool name="carrot.outputSubClusters">false</bool>

      <!-- Configure the remaining request handler parameters. -->
      <str name="defType">edismax</str>
      <str name="q.alt">*:*</str>
      <str name="rows">10</str>
      <str name="fl">*,score</str>
    </lst>
    <arr name="last-components">
      <str>clustering</str>
    </arr>
  </requestHandler>

2. alter clustering/carrot2/lingo-attributes.xml

3. add chinese tokenizer jar to classpath in solrconfig.xml

lucene-analyzers-smartcn-4.7.0.jar

References

http://wiki.apache.org/solr/ClusteringComponent

http://www.cnblogs.com/tomcattd/archive/2013/08/20/3270143.html

http://carrot2.github.io/solr-integration-strategies/carrot2-3.6.3/index.html