这个文档介绍如何从数据库中抓取数据, 简历索引库.
很多搜索程序存储要索引的内容到一个结构化数据store中, 例如一个关系数据库(relational database). 这个DIH 为从data store导入内容, 提供了一种机制, 并索引它. 除了关系型数据库, DIH 能够从基于数据源的HTTP索引数据, 例如RSS, ATOM feeds, e-mail 库和能够用一个XPath处理器去生成字段的结构化的XML. 这个 目录下 example/example-DIH 目录, 它包含了不多个集合, 很多数据导入处理程序的特性. 运行下面的’dish’ example:
solr -e dish
更多关于DIH的信息, 请看DataImportHandler
概念和术语
concepts and terminology
Terms
- Datasource
- Entity
- Processor
- Transformer
配置
配置solrconfig.xml
注册Data Import Handler 到solrconfig.xml中
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">/path/to/my/DIHconfigfile.xml</str>
</lst>
</requestHandler>
这唯一的请求参数是config参数, 它指定了DIH配置文件的位置. DIH 配置文件包含了一些规范,如 数据源data source, 如何抓取数据, 抓取什么数据, 如何处理数据, 用来生成发布到(post to)索引的Solr文档.
你可以有多个DIH 配置文件, 每一个文件将会需要再solrconfig.xml 文件中有一个单独的定义, 指定配置文件的路径.
配置DIH 配置文件
这个声明的配置文件, 依赖于dish 实例服务器的’db’集合. 它在example/example-DIH/solr/db/conf/db-data-config.xml 路径下. 它
用这种模式, 从一个简单的生产数据库并以的4个表提取数据. 更多关于参数和选项的信息, 展示在下面的代码中
<dataConfig>
<!-- The first element is the dataSource, in this case an HSQLDB database.
The path to the JDBC driver and the JDBC URL and login credentials are all specified here.
Other permissible attributes include whether or not to autocommit to Solr, the batchsize
used in the JDBC connection, a 'readOnly' flag.
The password attribute is optional if there is no password set for the DB.
-->
<dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:./example-DIH/hsqldb/ex" user="sa" password="secret"/>
<!--
Alternately the password can be encrypted as follows. This is the value obtained as a result of the command
openssl enc -aes-128-cbc -a -salt -in pwd.txt
password="U2FsdGVkX18QMjY0yfCqlfBMvAB4d3XkwY96L7gfO2o="
WHen the password is encrypted, you must provide an extra attribute
encryptKeyFile="/location/of/encryptionkey"
This file should a text file with a single line containing the encrypt/decrypt password
-->
<!-- A 'document' element follows, containing multiple 'entity' elements.
Note that 'entity' elements can be nested, and this allows the entity
relationships in the sample database to be mirrored here, so that we can
generate a denormalized Solr record which may include multiple features
for one item, for instance -->
<document>
<!-- The possible attributes for the entity element are described below.
Entity elements may contain one or more 'field' elements, which map
the data source field names to Solr fields, and optionally specify
per-field transformations -->
<!-- this entity is the 'root' entity. -->
<entity name="item" query="select * from item"
deltaQuery="select id from item where last_modified > '${dataimporter.last_index_time}'">
<field column="NAME" name="name" />
<!-- This entity is nested and reflects the one-to-many relationship between an item and its multiple features.
Note the use of variables; ${item.ID} is the value of the column 'ID' for the current item
('item' referring to the entity name) -->
<entity name="feature"
query="select DESCRIPTION from FEATURE where ITEM_ID='${item.ID}'"
deltaQuery="select ITEM_ID from FEATURE where last_modified > '${dataimporter.last_index_time}'"
parentDeltaQuery="select ID from item where ID=${feature.ITEM_ID}">
<field name="features" column="DESCRIPTION" />
</entity>
<entity name="item_category"
query="select CATEGORY_ID from item_category where ITEM_ID='${item.ID}'"
deltaQuery="select ITEM_ID, CATEGORY_ID from item_category where last_modified > '${dataimporter.last_index_time}'"
parentDeltaQuery="select ID from item where ID=${item_category.ITEM_ID}">
<entity name="category"
query="select DESCRIPTION from category where ID = '${item_category.CATEGORY_ID}'"
deltaQuery="select ID from category where last_modified > '${dataimporter.last_index_time}'"
parentDeltaQuery="select ITEM_ID, CATEGORY_ID from item_category where CATEGORY_ID=${category.ID}">
<field column="description" name="cat" />
</entity>
</entity>
</entity>
</document>
</dataConfig>
数据源依然能够被指定在solrconfig.xml中.
请求参数
DIH 命令
full-import 命令参数
属性写入器
数据源
内容流数据源
数据源的FieldReader
文件数据源
JDBC数据源
URL 数据源
实体处理器
Entity Processors
SQL 实体处理器
XPath实体处理器
Mail实体处理器
仅仅导入新的Emails
GMail扩展
Tika实体处理器
FileList实体处理器
line实体处理器
纯文本(PlainText)实体处理器
Solr实体处理器
转换器
Transformers