spider的程序如下:
class DmozSpider(Spider):
name='dmoz'
allowed_domains=['dmoz.org']
start_urls=[
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self,response):
sel=Selector(response)
sites=sel.xpath('//ul/li')
#inspect_response(response)
items=[]
#print '----------------------------------------------------'
for site in sites:
item=DmozItem()
item['title']=site.xpath('a/text()').extract()
item['link']=site.xpath('a/@href').extract()
item['desc']=site.xpath('text()').extract()
items.append(item)
#print type(items),'++++++++++++++++++++++++++++++++'
return items
pipelines的程序如下:
import json
class TutorialPipeline(object):
def __init__(self):
self.file=open('items.jl','w')
def process_item(self, item, spider):
#print type(item),'&&&&&&&&&&&&&&&&&&&&&&&&&&&'
lines=json.dumps(dict(item))+"\n"
self.file.write(lines)
#self.file.write("\n")
return item
发现如下情况:
当经过spiders程序处理过的列表数据,会迭代逐次调用pipelines的process_item方法,可以将打印信息放出
使用scrapy crawel dmoz --nolog 查看日志
程序来自http://scrapy-chs.readthedocs.org/zh_CN/0.22/topics/item-pipeline.html