nutchbase=nutch+hbase

当我们为nutch的架构发愁的时候,nutch的开发人员送来了nutchbase。我一些简单的测试表明,在hadoop0.20.1和hbase0.20.2上,稍加修改可以运行起来。
它的优点很明显:架构合理.

开发者是这样说的,引用自jira
[url]http://issues.apache.org/jira/browse/NUTCH-650[/url]


A) Why integrate with hbase?

All your data in a central location
No more segment/crawldb/linkdb merges.
No more "missing" data in a job. There are a lot of places where we copy data from one structure to another just so that it is available in a later job. For example, during parsing we don't have access to a URL's fetch status. So we copy fetch status into content metadata. This will no longer be necessary with hbase integration.
A much simpler data model. If you want to update a small part in a single record, now you have to write a MR job that reads the relevant directory, change the single record, remove old directory and rename new directory. With hbase, you can just update that record. Also, hbase gives us access to Yahoo! Pig, which I think, with its SQL-ish language may be easier for people to understand and use.
B) Design
Design is actually rather straightforward.

We store everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) in hbase. I have written a small utility class that creates "webtable" with necessary columns.
So now most jobs just take the name of the table as input.
There are two main classes for interfacing with hbase. ImmutableRowPart wraps around a RowResult and has helper getters (getStatus(), getContent(), etc.). RowPart is similar to ImmutableRowPart but also has setters. The idea is that RowPart also wraps RowResult but also keeps a list of updates done to that row. So when getSomething is called, it first checks if Something is already updated (if so then returns the updated version) or returns from RowResult. RowPart can also create a BatchUpdate from its list of updates.
URLs are stores in reversed host order. For example, http://bar.foo.com:8983/to/index.html?a=b becomes com.foo.bar:http:8983/to/index.html?a=b. This way, URLs from the same tld/host/domain are stored closer to each other. TableUtil has methods for reversing and unreversing URLs.
CrawlDatum Status-es are simplifed. Since everything is in central location now, no point in having a DB and FETCH status.
Jobs:

Each job marks rows so that the next job knows which rows to read. For example, if GeneratorHbase decides that a URL should be generated it marks the URL with a TMP_FETCH_MARK (Marking a url is simply creating a special metadata field.) When FetcherHbase runs, it skips over anything without this special mark.
InjectorHbase: First, a job runs where injected urls are marked. Then in the next job, if a row has the mark but nothing else (here, I assumed that if a row has "status:" column, that it already exists), InjectorHbase initializes the row.
GeneratorHbase: Supports max-per-host configuration and topN. Marks generated urls with a marker.
FetcherHbase: Very similar to original Fetcher. Marks urls successfully fetched. Skips over URLs not marked by GeneratorHbase
ParseTable: Similar to original Parser. Outlinks are stored "outlinks:<fromUrl>" -> "anchor".
UpdateTable: Does updatedb's and invertlink's job. Also clears any markers.
IndexerHbase: Indexes the entire table. Skips over URLs not parsed successfully.
资源下载链接为: https://pan.quark.cn/s/3d8e22c21839 随着 Web UI 框架(如 EasyUI、JqueryUI、Ext、DWZ 等)的不断发展与成熟,系统界面的统一化设计逐渐成为可能,同时代码生成器也能够生成符合统一规范的界面。在这种背景下,“代码生成 + 手工合并”的半智能开发模式正逐渐成为新的开发趋势。通过代码生成器,单表数据模型以及一对多数据模型的增删改查功能可以被直接生成并投入使用,这能够有效节省大约 80% 的开发工作量,从而显著提升开发效率。 JEECG(J2EE Code Generation)是一款基于代码生成器的智能开发平台。它引领了一种全新的开发模式,即从在线编码(Online Coding)到代码生成器生成代码,再到手工合并(Merge)的智能开发流程。该平台能够帮助开发者解决 Java 项目中大约 90% 的重复性工作,让开发者可以将更多的精力集中在业务逻辑的实现上。它不仅能够快速提高开发效率,帮助公司节省大量的人力成本,同时也保持了开发的灵活性。 JEECG 的核心宗旨是:对于简单的功能,可以通过在线编码配置来实现;对于复杂的功能,则利用代码生成器生成代码后,再进行手工合并;对于复杂的流程业务,采用表单自定义的方式进行处理,而业务流程则通过工作流来实现,并且可以扩展出任务接口,供开发者编写具体的业务逻辑。通过这种方式,JEECG 实现了流程任务节点和任务接口的灵活配置,既保证了开发的高效性,又兼顾了项目的灵活性和可扩展性。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值