自定义博客皮肤VIP专享

*博客头图:

格式为PNG、JPG,宽度*高度大于1920*100像素,不超过2MB,主视觉建议放在右侧,请参照线上博客头图

请上传大于1920*100像素的图片!

博客底图:

图片格式为PNG、JPG,不超过1MB,可上下左右平铺至整个背景

栏目图:

图片格式为PNG、JPG,图片宽度*高度为300*38像素,不超过0.5MB

主标题颜色:

RGB颜色,例如:#AFAFAF

Hover:

RGB颜色,例如:#AFAFAF

副标题颜色:

RGB颜色,例如:#AFAFAF

自定义博客皮肤

-+
  • 博客(16)
  • 收藏
  • 关注

原创 yii2疑问20140204

layouts的main.php的beginPage biginBody是取什么作用?

2014-02-04 22:22:43 116

原创 从Yii1.1升级到Yii2(翻译自yii2 / docs / guide / upgrade-from-v1.md )

【【Yii2 交流群 146409855】】在这个章节,我们列出从Yii1.1到Yii2.0的主要改动。我们希望这些列表将帮您更容易的从Yii1.1升级,和更快的在您现有的Yii认知上掌握Yii2.0。域名空间--------------------Yii2.0最明显的改动是使用了域名空间。基本上每个核心类都用域名空间。例如:yii\web\Request。“C”开头不再用于类...

2013-06-13 15:18:26 208

原创 What Is Text Mining?

Marti Hearst SIMS,UC Berkeley hearst@sims.berkeley.eduOctober 17, 2003I wrote this essay for people who are curious about the topic of text mining after having read the New York Times articl...

2012-12-26 23:12:59 196

原创 Method for extracting company names from text

A method for extracting company names from textual information uses a combination of heuristics, exception lists, and extensive corpus analysis. The method first locates company name suffixes (i.e., C...

2012-12-26 23:07:55 353

原创 org.archive.modules.extractor.Hop

/** * The kind of "hop" from one URI to another. Each hop type can be * represented by a single character; strings of these characters can * appear in logs. Eg, "LLLX" means that a URI was t...

2012-12-20 21:41:18 103

原创 org.archive.crawler.framework.ToeThread

[b]1、controller.getFetchChain().process(curi,this);[/b]1.1、org.archive.crawler.prefetch.Preselector, 1.2、org.archive.crawler.prefetch.PreconditionEnforcer, 1.3、org.archive.modules.fetcher.FetchD...

2012-12-17 23:15:41 147

原创 org.archive.modules.deciderules.DecideRuleSequence

ToeThread.run()ProcessorChain.prcess(CrawlURI curi, ChainStatusReceiver thread)Processor.process(CrawlURI curi)Scoper.isInScope(CrawlURI caUri)//foreach getRules() DecideResult r = rule.de...

2012-12-17 17:34:34 120

原创 Processor

When a URI is crawled, a ToeThread will execute a series of processors on it.The processors are split into 5 distinct chains that are exectued in sequence:Pre-fetch processing chainFetch pro...

2012-12-11 22:01:34 145

原创 crawler-beans.cxml

[b]1、CrawlMetadata[/b]: including identification of crawler/operator[b]org.archive.modules.CrawlMetadata[/b]: Basic crawl metadata, as consulted by functional modules and recorded in ARCs/WARCs....

2012-12-11 14:06:41 154

原创 Mirroring HTML Files Only

you would like to save the crawled files in a file/directory format instead of saving them in WARC files. First, create a job with a single seed, http://foo.org/bar/. Configure the warcWriter bean...

2012-12-11 08:10:31 110

原创 hbase-writer

An Hadoop HBase WriterPool implementation for the Heritrix crawler

2012-12-10 23:59:27 144

原创 org.archive.crawler.restlet.JobResource

1、build :validateConfiguration()2、launch:launch() new Thread start ,CrawlController.requestCrawlStart() getFrontier().run();3、pause:getCrawlController().requestCrawlPause()4、unpause:getC...

2012-12-09 23:30:49 124

原创 org.archive.crawler.Heritrix

1、ensure using java 1.6+ before hitting a later cryptic error2、Set some system properties early.ignoredSchemes,maxFormSize3、parsing command line options 4、DEFAULTS until changed by cmd-line op...

2012-12-09 22:26:35 201

原创 A Quick Guide to Running Your First Crawl Job

The Main Console page is displayed after you have installed Heritrix and logged into the WUI. Enter the name of the new job in the text box with the "Create new job with recommended starting confi...

2012-12-09 16:21:09 128

原创 How to install heritrix3

Use svn, checkout the project from the sourceforget.net on https: / / archive-crawler.svn.sourceforge.net/svnroot/archive-crawler/trunk/heritrix3 Especially if you're customizing Heritrix (as seem...

2012-12-09 12:11:02 135

scrapy缺省设置

BOT_NAME = 'scrapybot'CLOSESPIDER_TIMEOUT = 0CLOSESPIDER_PAGECOUNT = 0CLOSESPIDER_ITEMCOUNT = 0CLOSESPIDER_ERRORCOUNT = 0COMMANDS_MODULE = ''CONCURRENT_ITEMS = 100CONCURRENT_RE...

2012-12-04 22:49:33 143

空空如也

空空如也

TA创建的收藏夹 TA关注的收藏夹

TA关注的人

提示
确定要删除当前文章?
取消 删除