转载本文请注明出处:http://blog.youkuaiyun.com/pwlazy
Command "inject": net.nutch.db.WebDBInjector
> "inject: inject new urls into the database"
> Usage: WebDBInjector<db_dir>(-urlfile<url_file>|-dmozfile<dmoz_file>)[-subset<subsetDenominator>][-includeAdultMaterial][-skewskew][-noDmozDesc][-topicFile<topiclistfile>][-topic<topic>[-topic<topic>[...]]]
WebDBInjector.main() accepts two input-type options. "-urlfile" parses a simple list ofURLs with oneURL per line. "-dmozfile" is for parsingDMOZRDF files, which is useful for bootstrapping a whole-web database.
Let's see how it works. Create a file with oneURL, then run "bin/nutch inject":
$ vi spam_url.txt
$ bin/nutch inject spam -urlfile spam_url.txt
$ find spam -type file | xargs ls -l
-rw-r--r-- 1 kangas users 0 Oct 25 18:57 spam/dbreadlock
-rw-r--r-- 1 kangas users 0 Oct 25 18:57 spam/dbwritelock
-rw-r--r-- 1 kangas users 16 Oct 25 18:57 spam/webdb/linksByMD5/data
-rw-r--r-- 1 kangas users 16 Oct 25 18:57 spam/webdb/linksByMD5/index
-rw-r--r-- 1 kangas users 16 Oct 25 18:57 spam/webdb/linksByURL/data
-rw-r--r-- 1 kangas users 16 Oct 25 18:57 spam/webdb/linksByURL/index
-rw-r--r-- 1 kangas users 89 Oct 25 18:57 spam/webdb/pagesByMD5/data
-rw-r--r-- 1 kangas users 97 Oct 25 18:57 spam/webdb/pagesByMD5/index
-rw-r--r-- 1 kangas users 115 Oct 25 18:57 spam/webdb/pagesByURL/data
-rw-r--r-- 1 kangas users 58 Oct 25 18:57 spam/webdb/pagesByURL/index
-rw-r--r-- 1 kangas users 17 Oct 25 18:57 spam/webdb/stats
We can see that a new "stats" file was created, and the data/index files in the "pagesBy..." directories were modified.
命令:inject 对应net.nutch.db.WebDBInjector类
inject 将新的urls插入到数据库
调用方式: WebDBInjector<db_dir>(-urlfile<url_file>|-dmozfile<dmoz_file>)[-subset<subsetDenominator>][-includeAdultMaterial][-skewskew][-noDmozDesc][-topicFile<topiclistfile>][-topic<topic>[-topic<topic>[...]]]
WebDBInjector.main()方法接受两个输入选项。"-urlfile"以每行一个url的方式解析一个简单url列表。 "-dmozfile"用于解析DMOZRDF文件,后者用于启动基于整个web的数据库
然我看看命令是如何工作的,产生一个文件,填入url,然后运行bin/nutch inject
$vispam_url
.
txt
$bin
/
nutchinjectspam
-
urlfilespam_url
.
txt
$findspam
-
typefile
|
xargsls
-
l
-
rw
-
r
--
r
--
1
kangasusers
0
Oct
25
18
:
57
spam
/
dbreadlock
-
rw
-
r
--
r
--
1
kangasusers
0
Oct
25
18
:
57
spam
/
dbwritelock
-
rw
-
r
--
r
--
1
kangasusers
16
Oct
25
18
:
57
spam
/
webdb
/
linksByMD5
/
data
-
rw
-
r
--
r
--
1
kangasusers
16
Oct
25
18
:
57
spam
/
webdb
/
linksByMD5
/
index
-
rw
-
r
--
r
--
1
kangasusers
16
Oct
25
18
:
57
spam
/
webdb
/
linksByURL
/
data
-
rw
-
r
--
r
--
1
kangasusers
16
Oct
25
18
:
57
spam
/
webdb
/
linksByURL
/
index
-
rw
-
r
--
r
--
1
kangasusers
89
Oct
25
18
:
57
spam
/
webdb
/
pagesByMD5
/
data
-
rw
-
r
--
r
--
1
kangasusers
97
Oct
25
18
:
57
spam
/
webdb
/
pagesByMD5
/
index
-
rw
-
r
--
r
--
1
kangasusers
115
Oct
25
18
:
57
spam
/
webdb
/
pagesByURL
/
data
-
rw
-
r
--
r
--
1
kangasusers
58
Oct
25
18
:
57
spam
/
webdb
/
pagesByURL
/
index
-
rw
-
r
--
r
--
1
kangasusers
17
Oct
25
18
:
57
spam
/
webdb
/
stats
我们能看出一个新的stats文件产生了,而且在"pagesBy..."目录下的名为data和index的文件被修改了
本文详细介绍了Nutch爬虫中的注入模块(WebDBInjector)的使用方法及工作原理。该模块负责将新的URLs插入数据库中,支持从简单的URL列表文件或DMOZ RDF文件中读取数据。
573

被折叠的 条评论
为什么被折叠?



