输入scrapy会显示帮助命令
$ scrapy
Scrapy 1.3.3 - project: chinese
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
check Check spider contracts
commands
crawl Run a spider
edit Edit spider
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
list List available spiders
parse Parse URL (using its spider) and print the results
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
Use "scrapy <command> -h" to see more info about a command
- bench 做一个快速的基准测试
$ scrapy bench
- crawl 运行一个爬虫
#这是一个爬虫 在项目的spider文件夹下
class EduSpider(scrapy.Spider):
name = "edu"
allowed_domains = ["xx.xxx.com"]
start_urls = ['http://xx.xxx.com/pinyi.html']
def parse(self, response):
for url in response.xpath('//table[@id="table1"]//a[@class="fontbox"]/@href').extract():
yield scrapy.Request('http://xx.xxx.com/' + url, callback=self.parse_item)
运行crawl需要在项目中
$ scrapy crawl edu
#edu是爬虫类中name的值
- fetch 用下载器下载一个链接
$ scrapy fetch http://www.baidu.com
- view 在浏览器中打开一个链接
$ scrapy view http://www.baidu.com
version 显示爬虫的版本号
list 显示项目中的爬虫
$ srcapy list
- genspider 新建一个爬虫(非常重要的一个命令)
genspider 有几个选项:
- –list,-l 显示出可用的模板
$ scrapy genspider -l
#以下是输出
Available templates:
basic
crawl
csvfeed
xmlfeed
表示一共有 basic,crawl,csvfeed,xmlfeed四个模板可用
2 - -edit,-e 创建后编辑
3 - -dump,-d 将创建的内容显示在标准输出
4 - -template,-t 指定使用那个模板(这些模板必须是在-l 选项列出的其中之一)
$ scrapy genspider -t crawl baidu www.baidu.com
5 - -force, 如果这个爬虫已经存在就覆盖
$ scrapy genspider -t crawl --force baidu www.baidu.com
- shell交互式命令
用来调试xpath和其他一下函数很有用,最好预先安装ipython
$ scrapy shell http://www.baidu.com
会有好多内置对象可供使用,经常用的有response
- check 合同测试
def parse(self, response):
""" This function parses a sample response. Some contracts are mingled
with this docstring.
@url http://www.amazon.com/s?field-keywords=selfish+gene
@returns items 1 16
@returns requests 0 0
@scrapes Title Author Year Price
"""
@url 必须的,测试的链接
@returns items items的个数上下限
@returns requests 0 0 requests的个数上下限
@scrapes Title Author Year Price item的字段
500

被折叠的 条评论
为什么被折叠?



