Solr 使用入门介绍

最新推荐文章于 2022-05-09 00:23:12 发布

不看白不看，看了不白看

最新推荐文章于 2022-05-09 00:23:12 发布

阅读量1.3k

点赞数

分类专栏： Solr 文章标签： solr encoding tomcat 服务器搜索引擎 opensource

Solr 专栏收录该内容

1 篇文章

订阅专栏

原文出处：http://blog.chenlb.com/2009/05/apache-solr-quick-start-and-demo.html

前些日子做了个 apache solr 应用的入门介绍，也在博客记录下，方便新手看看。以搜索论坛帖子为示例。

1、先下载 Apache Solr 1.3 http://apache.etoak.com/lucene/solr/1.3.0/apache-solr-1.3.0.zip，解压到如 E:\apache-solr-1.3.0。

2、下载 Apache Tomcat 6.0.18 http://labs.xiaonei.com/apache-mirror/tomcat/tomcat-6/v6.0.18/bin/apache-tomcat-6.0.18.zip，解压到如 E:\apache-tomcat-6.0.18。

3、solr 安装到 tomcat。修改 E:\apache-tomcat-6.0.18\conf\server.xml，加个 URIEncoding="UTF-8"，把 8080 的那一块改为：

     <Connector port="8080" protocol="HTTP/1.1"
               connectionTimeout="20000"
               redirectPort="8443" URIEncoding="UTF-8"/>

把下面的内容保存到 E:\apache-tomcat-6.0.18\conf\Catalina\localhost\solr.xml，没有这个目录自行创建。

<Context docBase="E:/apache-solr-1.3.0/dist/apache-solr-1.3.0.war" reloadable="true" >
	<Environment name="solr/home" type="java.lang.String" value="E:/apache-solr-1.3.0/example/solr" override="true" />
</Context>

solr 的更多方式请看：solr install

4、现在安装好，启动 tomcat，并打开 http://localhost:8080/solr/admin/ 看看界面。

5、为搜索论坛帖子应用设计索引结构：

字段	说明
id	帖子 id
user	发表用户名或UserId
title	标题
content	内容
timestamp	发表时间
text	把标题和内容放到这里，可以用同时搜索这些内容。

6、上面的索引结构告诉 solr，把下面的内容覆盖 E:\apache-solr-1.3.0\example\solr\conf\scheam.xml，（可以先备份这文件，方便以后看官方示例）：

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example" version="1.1">
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="sint" class="solr.SortableIntField" sortMissingLast="true" omitNorms="true"/>
<!-- The format for this date field is of the form 1995-12-31T23:59:59Z, and
is a more restricted form of the canonical representation of dateTime
http://www.w3.org/TR/xmlschema-2/#dateTime
The trailing "Z" designates UTC time and is mandatory.
Optional fractional seconds are allowed: 1995-12-31T23:59:59.999Z
All other components are mandatory.
Expressions can also be used to denote calculations that should be
performed relative to "NOW" to determine the value, ie...
NOW/HOUR
... Round to the start of the current hour
NOW-1DAY
... Exactly 1 day prior to now
NOW/DAY+6MONTHS+3DAYS
... 6 months and 3 days in the future from the start of
the current day
Consult the DateField javadocs for more information.
-->
<fieldType name="date" class="solr.DateField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.CJKTokenizerFactory"/>
</analyzer>
</fieldType>
</types>
<fields>
<field name="id" type="sint" indexed="true" stored="true" required="true" />
<field name="user" type="string" indexed="true" stored="true"/>
<field name="title" type="text" indexed="true" stored="true"/>
<field name="content" type="text" indexed="true" stored="true" />
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW"/>
<!-- catchall field, containing all other searchable text fields (implemented
via copyField further on in this schema -->
<field name="text" type="text" indexed="true" stored="false" multiValued="true"/>
</fields>
<!-- Field to use to determine and enforce document uniqueness.
Unless this field is marked with required="false", it will be a required field
-->
<uniqueKey>id</uniqueKey>
<defaultSearchField>text</defaultSearchField>
<solrQueryParser defaultOperator="AND"/>
<!-- copyField commands copy one field to another at the time a document
is added to the index. It's used either to index the same field differently,
or to add multiple fields to the same field for easier/faster searching. -->
<copyField source="title" dest="text"/>
<copyField source="content" dest="text"/>
</schema>

<?xml version="1.0" encoding="UTF-8" ?>

<schema name="example" version="1.1">

  <types>
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
    <fieldType name="sint" class="solr.SortableIntField" sortMissingLast="true" omitNorms="true"/>

    <!-- The format for this date field is of the form 1995-12-31T23:59:59Z, and
         is a more restricted form of the canonical representation of dateTime
         http://www.w3.org/TR/xmlschema-2/#dateTime
         The trailing "Z" designates UTC time and is mandatory.
         Optional fractional seconds are allowed: 1995-12-31T23:59:59.999Z
         All other components are mandatory.

         Expressions can also be used to denote calculations that should be
         performed relative to "NOW" to determine the value, ie...

               NOW/HOUR
                  ... Round to the start of the current hour
               NOW-1DAY
                  ... Exactly 1 day prior to now
               NOW/DAY+6MONTHS+3DAYS
                  ... 6 months and 3 days in the future from the start of
                      the current day

         Consult the DateField javadocs for more information.
      -->
    <fieldType name="date" class="solr.DateField" sortMissingLast="true" omitNorms="true"/>

    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.CJKTokenizerFactory"/>
      </analyzer>
    </fieldType>

 </types>

 <fields>
   <field name="id" type="sint" indexed="true" stored="true" required="true" />
   <field name="user" type="string" indexed="true" stored="true"/>
   <field name="title" type="text" indexed="true" stored="true"/>
   <field name="content" type="text" indexed="true" stored="true" />
   <field name="timestamp" type="date" indexed="true" stored="true" default="NOW"/>

   <!-- catchall field, containing all other searchable text fields (implemented
        via copyField further on in this schema  -->
   <field name="text" type="text" indexed="true" stored="false" multiValued="true"/>
 </fields>

 <!-- Field to use to determine and enforce document uniqueness.
      Unless this field is marked with required="false", it will be a required field
   -->
 <uniqueKey>id</uniqueKey>

 <!-- field for the QueryParser to use when an explicit fieldname is absent -->
 <defaultSearchField>text</defaultSearchField>

 <!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
 <solrQueryParser defaultOperator="AND"/>

  <!-- copyField commands copy one field to another at the time a document
        is added to the index.  It's used either to index the same field differently,
        or to add multiple fields to the same field for easier/faster searching.  -->
<!-- -->
   <copyField source="title" dest="text"/>
   <copyField source="content" dest="text"/>

</schema>

7、重启 tomcat，然后手动在 E:\apache-solr-1.3.0\example\exampledocs 创建两个 xml 数据文件。分别保存为 demo-doc1.xml 和 demo-doc2.xml：

<?xml version="1.0" encoding="UTF-8" ?>
<add>
<doc>
<field name="id">1</field>
<field name="user">chenlb</field>
<field name="title">solr 应用演讲</field>
<field name="content">这一小节是讲提交数据给服务器做索引，这里有一些数据，如：服务器，可以试查找它。</field>
</doc>
</add>

<?xml version="1.0" encoding="UTF-8" ?>

<add>
<doc>
<field name="id">2</field>
<field name="user">bory.chan</field>
<field name="title">搜索引擎</field>
<field name="content">搜索服务器那边有很多数据。</field>
<field name="timestamp">2009-02-18T00:00:00Z</field>
</doc>
<doc>
<field name="id">3</field>
<field name="user">other</field>
<field name="title">这是什么</field>
<field name="content">你喜欢什么运动？篮球？</field>
<field name="timestamp">2009-02-18T12:33:05.123Z</field>
</doc>
</add>

8、提交数据做索引，到 E:\apache-solr-1.3.0\example\exampledocs，运行：

E:\apache-solr-1.3.0\example\exampledocs>java -Durl=http://localhost:8080/solr/update -Dcommit=yes -jar post.jar demo-doc*.xml
SimplePostTool: version 1.2
SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, other encodings are not currently supported
SimplePostTool: POSTing files to http://localhost:8080/solr/update..
SimplePostTool: POSTing file demo-doc1.xml
SimplePostTool: POSTing file demo-doc2.xml
SimplePostTool: COMMITting Solr index changes..

9、查看搜索结果：

所有内容 http://localhost:8080/solr/select/?q=*%3A*&version=2.2&start=0&rows=10&indent=on

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
<lst name="params">
<str name="indent">on</str>
<str name="start">0</str>
<str name="q">*:*</str>
<str name="rows">10</str>
<str name="version">2.2</str>
</lst>
</lst>
<result name="response" numFound="3" start="0">
<doc>
<str name="content">这一小节是讲提交数据给服务器做索引，这里有一些数据，如：服务器，可以试查找它。</str>
<int name="id">1</int>
<date name="timestamp">2009-05-27T04:07:54.89Z</date>
<str name="title">solr 应用演讲</str>
<str name="user">chenlb</str>
</doc>
<doc>
<str name="content">搜索服务器那边有很多数据。</str>
<int name="id">2</int>
<date name="timestamp">2009-02-18T00:00:00Z</date>
<str name="title">搜索引擎</str>
<str name="user">bory.chan</str>
</doc>
<doc>
<str name="content">你喜欢什么运动？篮球？</str>
<int name="id">3</int>
<date name="timestamp">2009-02-18T12:33:05.123Z</date>
<str name="title">这是什么</str>
<str name="user">other</str>
</doc>
</result>
</response>

bory.chan 用户的：http://localhost:8080/solr/select/?q=user%3Abory.chan&version=2.2&start=0&rows=10&indent=on

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
<lst name="params">
<str name="indent">on</str>
<str name="start">0</str>
<str name="q">user:bory.chan</str>
<str name="rows">10</str>
<str name="version">2.2</str>
</lst>
</lst>
<result name="response" numFound="1" start="0">
<doc>
<str name="content">搜索服务器那边有很多数据。</str>
<int name="id">2</int>
<date name="timestamp">2009-02-18T00:00:00Z</date>
<str name="title">搜索引擎</str>
<str name="user">bory.chan</str>
</doc>
</result>
</response>

时间 http://localhost:8080/solr/select/?q=timestamp%3A%5B%222009-02-18T00%3A00%3A00Z%22+TO+%222009-02-19T00%3A00%3A00Z%22%5D&version=2.2&start=0&rows=10&indent=on

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">16</int>
<lst name="params">
<str name="indent">on</str>
<str name="start">0</str>
<str name="q">timestamp:["2009-02-18T00:00:00Z" TO "2009-02-19T00:00:00Z"]</str>
<str name="rows">10</str>
<str name="version">2.2</str>
</lst>
</lst>
<result name="response" numFound="2" start="0">
<doc>
<str name="content">搜索服务器那边有很多数据。</str>
<int name="id">2</int>
<date name="timestamp">2009-02-18T00:00:00Z</date>
<str name="title">搜索引擎</str>
<str name="user">bory.chan</str>
</doc>
<doc>
<str name="content">你喜欢什么运动？篮球？</str>
<int name="id">3</int>
<date name="timestamp">2009-02-18T12:33:05.123Z</date>
<str name="title">这是什么</str>
<str name="user">other</str>
</doc>
</result>
</response>

常用的 solr 查询参数请看：solr 查询参数说明

简单的示例已经完成了，索引文件（默认）会在 CWD/solr/data/index 目录下，要改为 solr.home/data目录下，在 F:\apache-solr-1.3.0\example\solr\conf\solrconfig.xml 把 dataDir 注释掉，如：

<!--
<dataDir>${solr.data.dir:./solr/data}</dataDir>
-->

  <!--
  <dataDir>${solr.data.dir:./solr/data}</dataDir>
  -->

说明：上面没有使用中文分词，用官方的 CJK 分词，另外有 mmseg4j 中文分词的示例，请看：solr 中文分词 mmseg4j 使用例子

solr 1.4 运行这个例子是有这个问题。报错：

org.apache.solr.common.SolrException: QueryElevationComponent requires the schema to have a uniqueKeyField implemented using StrField。

org.apache.solr.common.SolrException: QueryElevationComponent requires the schema to have a uniqueKeyField implemented using StrField。

把 solrconfig.xml 中删除两个结点：

<!-- a search component that enables you to configure the top results for
a given query regardless of the normal lucene scoring.-->
<searchComponent name="elevator" class="solr.QueryElevationComponent" >
<str name="queryFieldType">string</str>
<str name="config-file">elevate.xml</str>
</searchComponent>
<requestHandler name="/elevate" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="echoParams">explicit</str>
</lst>
<arr name="last-components">
<str>elevator</str>
</arr>
</requestHandler>

即 Elevation 组件。

===============================

solr 中文分词 mmseg4j 使用例子

原文出处：http://blog.chenlb.com/2009/04/solr-chinese-segment-mmseg4j-use-demo.html

mmseg4j 第一个版本就可以方便地与 solr 集成，在 google code 上面有简单的说明，第一版的发布博客也有简单的使用说明：中文分词 mmseg4j。为了更清楚说明在 solr 中使用 mmseg4j 中文分词，还是写篇博客吧。

目前有两个版本的 mmseg4j，1.7 版比较耗内存（一个词库目录就要 50M 左右），所以在默认jvm内存大小会抛出 OutOfMemoryErroy。我这里示例两个词库目录，所以不用目前最新版 1.7.2。而用 1.6.2 版。下载：mmseg4j-1.6.2 和词库，或就下载一个源码包（包括了词库，从源码构建请看：中文分词 mmseg4j 1.7.2 版发布），把 mmseg4j-all-1.6.2.jar 放到 solr.home/lib 。

mmseg4j 在 solr 中主要支持两个参数：mode、dicPath。mode 表示是什么模式分词（有效值：simplex、complex、max-word，如果输入了无效的默认用 max-word。）。dicPath 是词库目录可以是绝对目录，也可以是相对目录（是相对 solr.home 目录下的，dic 就会在 solr.home/dic 目录下找词库文件），如果不指定就是默认在 CWD/data 目录（程序运行当前目录的data子目录）下找。

改 solr 配置文件，主要是修改 schema.xml，我添加三个 field type，如下：

<fieldType name="textComplex" class="solr.TextField" positionIncrementGap="100" >
<analyzer>
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="dic"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="textMaxWord" class="solr.TextField" positionIncrementGap="100" >
<analyzer>
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="max-word" dicPath="dic"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="textSimple" class="solr.TextField" positionIncrementGap="100" >
<analyzer>
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="simple" dicPath="n:/OpenSource/apache-solr-1.3.0/example/solr/my_dic"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

说明：有多少不同的词库目录就会有多少个词库数组结构的实例，由上面的配置，会有两个实例。注意用 1.7.2 版会内存溢出。

定义几个字段：

<field name="simple" type="textSimple" indexed="true" stored="true"/>
<field name="complex" type="textComplex" indexed="true" stored="true"/>
<field name="text" type="textMaxWord" indexed="true" stored="true"/>

再添加个 copyField（最后面加吧）：

<copyField source="text" dest="simple" />
<copyField source="text" dest="complex" />

现在 mmseg4j 在 solr 中的使用配置好了。接下来安装 solr 到 tomcat。

solr 1.3 版早就出了，我就用它为示例的 solr。下载：solr-1.3.0，如：解压放到 N:/OpenSource/apache-solr-1.3.0。在 tomcat 中怎么安装 solr 请看：Solr 使用入门介绍，以搜索论坛帖子为示例， solr install ， solr tomcat， solr on tomcat。

我是用 TOMCAT_HOME/conf/Catalina/localhost/solr.xml 的安装方式，指到 n:/OpenSource/apache-solr-1.3.0/example/solr。tomcat 6 可能没有这个目录，手动创建这目录。

启动 tomcat 可以看到 mmseg4j 的相关日志，然后在 http://localhost:8080/solr/admin/analysis.jsp 可以看 mmseg4j 的分词效果。在 Field 的下拉菜单选择 name，然后在应用输入 complex。分词的结果，如下图：

mmseg4j solr analysis 调试，点击放大

好了，可以运行起来了，那就添加个文档试下，在 n:/OpenSource/apache-solr-1.3.0/example/exampledocs 下创建 mmseg4j-solr-demo-doc.xml 文档：

<add>
<doc>
<field name="id">1</field>
<field name="text">京华时报２００９年1月23日报道昨天，受一股来自中西伯利亚的强冷空气影响，本市出现大风降温天气，白天最高气温只有零下7摄氏度，同时伴有6到7级的偏北风。</field>
</doc>
<doc>
<field name="id">2</field>
<field name="text">昨日金正日抵达长春市，进行两天的长春市内电话系统考察。</field>
</doc>
<doc>
<field name="id">3</field>
<field name="text">陈教授正在研究生命起源，他的研究生正在打球。</field>
</doc>
<doc>
<field name="id">4</field>
<field name="text">中国人民银行是中华人民共和国的中央银行。</field>
</doc>
</add>

然后提交到 solr，在 cmd 下运行 post.jar，如下：

N:\OpenSource\apache-solr-1.3.0\example\exampledocs>java -Durl=http://localhost:8080/solr/update -Dcommit=yes -jar post.jar mmseg4j-solr-demo-doc.xml
SimplePostTool: version 1.2
SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, other encodings are not currently supported
SimplePostTool: POSTing files to http://localhost:8080/solr/update..
SimplePostTool: POSTing file mmseg4j-solr-demo-doc.xml
SimplePostTool: COMMITting Solr index changes..

注意：mmseg4j-solr-demo-doc.xml 要是 UTF-8 格式，不然提交后会乱码。

看下是否有数据：http://localhost:8080/solr/select/?q=*:*，有数据，应该正常。

然后，找“西伯利亚”.

simple：http://localhost:8080/solr/select?indent=on&q=simple:%E8%A5%BF%E4%BC%AF%E5%88%A9%E4%BA%9A&hl=on&hl.fl=simple%2Ccomplex%2Ctext&fl=id，结果如下：

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
<lst name="params">
<str name="fl">id</str>
<str name="indent">on</str>
<str name="q">simple:西伯利亚</str>
<str name="hl.fl">simple,complex,text</str>
<str name="hl">on</str>
</lst>
</lst>
<result name="response" numFound="0" start="0"/>
<lst name="highlighting"/>
</response>

comlex：http://localhost:8080/solr/select?indent=on&q=complex:%E8%A5%BF%E4%BC%AF%E5%88%A9%E4%BA%9A&hl=on&hl.fl=simple%2Ccomplex%2Ctext&fl=id，结果如：

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
<lst name="params">
<str name="fl">id</str>
<str name="indent">on</str>
<str name="q">complex:西伯利亚</str>
<str name="hl.fl">simple,complex,text</str>
<str name="hl">on</str>
</lst>
</lst>
<result name="response" numFound="1" start="0">
<doc>
<str name="id">1</str>
</doc>
</result>
<lst name="highlighting">
<lst name="1">
<arr name="complex">
<str>京华时报２００９年1月23日报道昨天，受一股来自中西伯利亚的强冷空气影响，本市出现大风降温天气，白天最高气温只有零下7摄氏度，同时伴有6到7级的偏北风。</str>
</arr>
</lst>
</lst>
</response>

text（其实是 max-word）：http://localhost:8080/solr/select?indent=on&q=text:%E8%A5%BF%E4%BC%AF%E5%88%A9%E4%BA%9A&hl=on&hl.fl=simple%2Ccomplex%2Ctext&fl=id，结果：

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">15</int>
<lst name="params">
<str name="fl">id</str>
<str name="indent">on</str>
<str name="q">text:西伯利亚</str>
<str name="hl.fl">simple,complex,text</str>
<str name="hl">on</str>
</lst>
</lst>
<result name="response" numFound="1" start="0">
<doc>
<str name="id">1</str>
</doc>
</result>
<lst name="highlighting">
<lst name="1">
<arr name="text">
<str>京华时报２００９年1月23日报道昨天，受一股来自中西伯利亚的强冷空气影响，本市出现大风降温天气，白天最高气温只有零下7摄氏度，同时伴有6到7级的偏北风。</str>
</arr>
</lst>
</lst>
</response>