you would like to save the crawled files in a file/directory format instead of saving them in WARC files.
First, create a job with a single seed, http://foo.org/bar/. Configure the warcWriter bean so that its class is org.archive.modules.writer.MirrorWriterProcessor. This Processor will store files in a directory structure that matches the crawled URIs. The files will be stored in the crawl job's mirror directory.
First, create a job with a single seed, http://foo.org/bar/. Configure the warcWriter bean so that its class is org.archive.modules.writer.MirrorWriterProcessor. This Processor will store files in a directory structure that matches the crawled URIs. The files will be stored in the crawl job's mirror directory.
将爬取文件保存为目录格式而非WARC文件
本文介绍如何配置爬虫任务,将爬取到的文件保存为目录格式,而非传统的WARC文件格式。通过使用MirrorWriterProcessor,可以确保文件按照爬取的URL结构存储于特定目录下。
5898

被折叠的 条评论
为什么被折叠?



