修改了《Python下用Scrapy和MongoDB构建爬虫系统 》http://python.jobbole.com/81320/ 的一点小问题
1,创建项目:scrapy crawl air2,名为air2
爬取stackoverflow.com的首页http://stackoverflow.com/questions?pagesize=50&sort=newest
2,目录结构
├── scrapy.cfg
└── air2
├── __init__.py
├── items.py
├── pipelines.py
├── settings.py
└── spiders
├── air2_spider.py
└── __init__.py
3,修改之处:为每个文件中涉及到scrapy的类都增加了scrapy的作用域,防止出现Crawled 0 pages (at 0 pages/min 这样的错误提示。
修改两个文件:air2/spider/air2_spider.py以及air2/items.py.
air2/spider/air2_spider.py代码如下
import sys
sys.path.insert(0,'..')
import items
import scrapy
from scrapy import Spider
class Air2Spider(Spider):
name="air2"
allowed_domains=["stackoverflow.com"]
start_urls=["http://stackoverflow.com/questions?pagesize=50&sort=newest",]
def parse(self,response):
sel=scrapy.Selector(response)
questions=sel.xpath('//div[@class="summary"]/h3')
for question in questions:
item=items.Air2Item()
title=question.xpath('a[@class="question-hyperlink"]/text()').extract()[0]
url=question.xpath('a[@class="question-hyperlink"]/@href').extract()[0]
print(title)
print(url)
item['title']=title
item['url']=url
yield item
其中air2/items.py的代码如下:
import scrapy
class Air2Item(scrapy.Item):
# define the fields for your item here like:
title=scrapy.Field()
url=scrapy.Field()
4,运行
cd air2/air2/spiders/
scrapy crawl air2 -o items.json -t json,得到items.json文件
[{"url": "/questions/29971480/facebook-unity-multi-friend-selector-not-showing-pictures-on-developer-users-whe", "title": "Facebook Unity multi friend selector not showing pictures on developer users when doing FB.AppRequest"},
{"url": "/questions/29971477/reading-information-from-memory-array-into-excel-sheet-formula", "title": "Reading information from Memory array into excel sheet formula"},
...]