Nutch&Solr小计

最新推荐文章于 2017-08-24 01:29:00 发布

原创最新推荐文章于 2017-08-24 01:29:00 发布 · 1.2k 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#nutch #open source #search #hadoop

Distributed Computing 同时被 2 个专栏收录

4 篇文章

订阅专栏

Hadoop

2 篇文章

订阅专栏

本文介绍Nutch与Solr的不同版本及其环境搭建过程，包括Hadoop、HBase等组件的搭配使用。此外，还详细探讨了几种常用的中文分词插件，如jcseg、IKAnalyzer、mmseg4j和ansj等，并提供了插件安装方法。

新开一篇专门记Nutch&Solr。

版本

Nutch版本

Nutch目前是两条线路开发，所以2.x并不比1.x来的高，来的新。

1.x（目前，最新1.8，默认搭配hadoop1.2，可以搭配hadoop2.2。）
2.x（目前，最新2.2.1,默认搭配hadoop1.2，并且不可以搭配hadoop2.2,因为，gora0.3可以使用hbase0.90.x和0.92.x，但此版本的hbase不能用hadoop2.2,而hadoop1.2是可以的。）

Solr版本

环境搭建

Nutch搭建

1.x（目前，最新1.8.）
- Nutch 1.7, Hadoop 1.2.1, CentOS 6.5, JDK 1.7 把Nutch爬虫部署到Hadoop集群上
- Nutch 1.7 单机官方tutorial
2.x（目前，最新2.2.1）
- hadoop+hbase+Nutch2.1 Nutch的安装与配置（for linux）
- Nutch 2.2+MySQL+Solr4.2实现网站内容的抓取和索引
- 在Eclipse中运行Nutch
  - 官方tutorial

Solr搭建

Solr本身

4.7
- 官方tutorial
- 管理页面 http://localhost:8983/solr/#/

中文分词

分词插件

jcseg
- jcseg是使用Java开发的一个中文分词器，使用流行的mmseg算法实现。
- 目前最高版本：jcseg 1.9.3。兼容最高版本lucene-4.x和最高版本solr-4.x
- mmseg四种过滤算法，分词准确率达到了98.41%。
IK Analyzer
- 采用了特有的“正向迭代最细粒度切分算法“，支持细粒度和智能分词两种切分模式；
- 最新版本2012年10月
mmseg4j
- mmseg4j 用 Chih-Hao Tsai 的 MMSeg 算法(http://technology.chtsai.org/mmseg/ )实现的中文分词器，并实现 lucene 的 analyzer 和 solr 的TokenizerFactory 以方便在Lucene和Solr中使用。
- MMSeg 算法有两种分词方法：Simple和Complex，都是基于正向最大匹配。Complex 加了四个规则过虑。官方说：词语的正确识别率达到了 98.41%。mmseg4j 已经实现了这两种分词算法。
- 最新版本2013-07-13版本1.9.1兼容 solr 4.3.1
ansj
- ansj分词.ict的真正java实现.分词效果速度都超过开源版的ict. 中文分词,人名识别,词性标注,用户自定义词典
- 正在积极开发中
d

插件安装

smartcn & IK

Python&Solr

官方介绍

纯HTTP，官方说明。
mysolr
mysolr was born to be a fast and easy-to-use client for Apache Solr’s API and because existing Python clients didn’t fulfill these conditions.Since version 0.5 mysolr supports Python 3 except concurrent search feature.
pysolr （比较简单的API，目前，我使用的就是这个。）
pysolr is a lightweight Python wrapper for Apache Solr. It provides an interface that queries the server and returns results based on the query.
Haystack（比较复杂）
Haystack provides modular search for Django. It features a unified, familiar API that allows you to plug in different search backends (such as Solr, Elasticsearch, Whoosh, Xapian, etc.) without having to modify your code.
insol （看着不错但是对其支持的Solr版本比较怀疑，官方称兼容1.4）
- REPL friendly shortcuts module to start working right away
- Solr queries as Python objects, so that others can use your code abstracted away from inner workings of Solr - this is a design similar to Django ORM with it's Q and F objects
- fast and cache friendly - results as simple dicts, no builtin dict to object inflation code - either use the results as-is or provide your own inflation mechanism
  configuration module with live config reload to support connecting to multiple Solr instances or cores at run time
- flexible structure allowing you to customize the whole process of connecting to Solr instance and fetching documents without rewriting whole API
sunburnt
It's tested with Solr 1.4.1 and 3.1; previous versions were known to work with 1.3 and 1.4 as well.
solrpy