Scrapy的Spider支持处理HTML/XML/JSON/CSV等数据服务接口,以XML为例:
from scrapy import log
from scrapy.contrib.spiders import XMLFeedSpider
from myproject.items import TestItem
class MySpider(XMLFeedSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.techbrood.com/feed.xml']
iterator = 'iternodes' # This is actually unnecessary, since it's the default value
itertag = 'item'
def parse_node(self, response, node):
log.msg('Hi, this is a <%s> node!: %s' % (self.itertag, ''.join(node.extract())))
item = TestItem()
item['id'] = node.xpath('@id').extract()
item['name'] = node.xpath('name/text()').extract()
item['description'] = node.xpath('description/text()').extract()
return item
注意name/text()将从xml节点中去除xml标记而提取出实际内容。
参考文档:
http://doc.scrapy.org/en/latest/topics/spiders.html?highlight=xml%20parse#scrapy.contrib.spiders.XMLFeedSpider.parse_node