Announcing Couch Crawler, a CouchDB search engine/crawler

作者介绍了一个名为CouchCrawler的搜索引擎和爬虫,该工具基于CouchDB和Lucene,旨在为工作内网提供一个可定制的现代搜索解决方案。与Nutch相比,CouchCrawler提供了更灵活的更新机制和简洁的架构,通过Python爬虫抓取HTML,使用BeautifulSoup解析,并与CouchDB动态构建UI。

http://syntacticbayleaves.com/2010/01/17/announcing-couch-crawler-a-couchdb-search-enginecrawler/

Announcing Couch Crawler, a CouchDB search engine/crawler

Hi! So, for fun, I made couch-crawler, a search engine and crawler on top of the very excellent couchdb-lucene. I wanted to create a hackable search engine for my work intranet using modern tools. Lucene is great, but the Nutch search engine/crawler was kind of annoying to work with. I couldn’t figure out how to get it to update the search indexes without a restart of the server, which sucks. Also, I just really, really like CouchDB.

There’s no real web tier, CouchDB hosts static JavaScript/HTML/CSS files and the UI gets built up dynamically with AJAX calls to CouchDB. It’s kind of nice to be able to cut out a whole layer of glue code.

Templating is done with mustache.js, a JavaScript templating language that does a good job of being a dumb template language, making you define your presentation logic in JavaScript, where it should be.

On the indexing side of things, there’s a crawler written in Python that pulls down html, parses it with BeautifulSoup, extracts useful text content to be indexed then follows links within the page to a specified max depth. It probably could be smarter and parallel-er, but I wanted to start with a simple design and iterate over it.

The couchdb-lucene indexer indexes the title, url and contents, and saves the first 140 characters from the contents in the index to display with search results.

Ch-ch-check it out and let me know what you think.

P.S. If you use Homebrew for your OS X packaging needs, I have a fork of homebrew with a couchdb-lucene formula for easy installation.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值