最近发现python中的scrapy库用来做爬虫是相当的好用,只要你有想法,就能实现一些抓取,本人的测试环境在ubuntu12.04下进行
第一步安装scrapy:
pip install Scrapy
easy_install scrapy
这两条命令完成了scrapy的安装,
下来我们用这些东西实现一个简单的抓取:
命令:scrapy
这个我们能看到下面出现好多的 命令这些命令是连起来用的
wangyu@ubuntu:~$ scrapy
Scrapy 0.12.0.2546 - no active project
Usage:
scrapy <command> [options] [args]
Available commands:
fetch Fetch a URL using the Scrapy downloader
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
Use "scrapy <command> -h" to see more info about a command
这个命令可以让你看到生成一个文件夹在当前目录下:scarpy startproject wo
wangyu@ubuntu:~/wo$ tree
.
├── scrapy.cfg
└── wo
├── __init__.py
├── items.py
├── pipelines.py
├── settings.py
└── spiders
└── __init__.py
2 directories, 6 files
我们可以通过这棵树看到我们当前的wo文件夹下有什么
我们先用vim来打开items.py
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/topics/items.html
from scrapy.item import Item, Field
class WoItem(Item):
# define the fields for your item here like:
# name = Field()
pass
~
~
~
~
~
~
~
这个是当前的空文件,我们要进行一下编辑:
1 # Define here the models for your scraped items
2 #-*-coding:utf-8-*-
3 # See documentation in:
4 # http://doc.scrapy.org/topics/items.html
5
6 from scrapy.item import Item, Field
7
8 class WoItem(Item):
9 # define the fields for your item here like:
10 title = Field()
11 link = Field()
12 #定义两个一个title,一个link
13
~
~
~
~
~
~
~
我添加了两句,一句是中文注释,在就是下面的类里面的东西要建立一个Spider,你必须为scrapy.spider.BaseSpider创建一个子类,并确定三个主要的、强制的属性
name必须唯一
start_urls 爬虫开始爬的地方
parse():爬虫的方法,调用时候传入从每一个URL传回的Response对象作为参数,response将会是parse方法的唯一的一个参数,
1 from scrapy.spider import BaseSpider
2
3 class wo(BaseSpider):
4 name = "wo"
5 allowed_wo =["jandan.net"]
6 start_urls=[
7 "http://jandan.net/new",
8 "http://jandan.net/fml"
9 ]
10 def parse(self,response):
11 filename=response.url.split("/" )[-2]
12 open(filename,'wb').write(respo nse.body)
13
~
~
~
~
~
这个中我们以简单为例。
这时我们运行:scrapy crawl wo
wangyu@ubuntu:~/wo$ scrapy crawl wo
2013-09-21 20:53:40+0800 [scrapy] INFO: Scrapy 0.12.0.2546 started (bot: wo)
2013-09-21 20:53:40+0800 [scrapy] DEBUG: Enabled extensions: TelnetConsole, SpiderContext, WebService, CoreStats, MemoryUsage, CloseSpider
2013-09-21 20:53:40+0800 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2013-09-21 20:53:40+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, DownloaderStats
2013-09-21 20:53:40+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-09-21 20:53:40+0800 [scrapy] DEBUG: Enabled item pipelines:
2013-09-21 20:53:40+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-09-21 20:53:40+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-09-21 20:53:40+0800 [wo] INFO: Spider opened
2013-09-21 20:53:40+0800 [wo] DEBUG: Crawled (200) <GET http://jandan.net/fml> (referer: None)
2013-09-21 20:53:40+0800 [wo] DEBUG: Crawled (200) <GET http://jandan.net/new> (referer: None)
2013-09-21 20:53:40+0800 [wo] INFO: Closing spider (finished)
2013-09-21 20:53:40+0800 [wo] INFO: Spider closed (finished)
这个就是我们的运行结果,我们能看到这些东西,我们返回wo这个文件夹,会发现一个被名民为jandan的文件,但是打开发现全是代码。不能达到真的直接吧一个网页下载下来,我们在改改这些代码: 1 from scrapy.spider import BaseSpider
2
3 class wo(BaseSpider):
4 name = "wo"
5 allowed_wo =["jiandan.net"]
6 start_urls=[
7 "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
8 "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
9 "http://jandan.net/ooxx"
10
11 ]
12 def parse(self,response):
13 filename=response.url.split("/")[-2]
14 open(filename,'wb').write(response.body)
15 #这个地方是给你的文件命名,在url最后的倒数地一个“/”之后的词语命名为你的文件名,下面调用写入文件
~
~
~
~
~
~
~
~
~
~
再次运行:
wangyu@ubuntu:~/wo$ scrapy crawl wo
2013-09-21 21:21:16+0800 [scrapy] INFO: Scrapy 0.12.0.2546 started (bot: wo)
2013-09-21 21:21:16+0800 [scrapy] DEBUG: Enabled extensions: TelnetConsole, SpiderContext, WebService, CoreStats, MemoryUsage, CloseSpider
2013-09-21 21:21:16+0800 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2013-09-21 21:21:16+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, DownloaderStats
2013-09-21 21:21:16+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-09-21 21:21:16+0800 [scrapy] DEBUG: Enabled item pipelines:
2013-09-21 21:21:16+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-09-21 21:21:16+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-09-21 21:21:16+0800 [wo] INFO: Spider opened
2013-09-21 21:21:16+0800 [wo] DEBUG: Crawled (200) <GET http://jandan.net/ooxx> (referer: None)
2013-09-21 21:21:18+0800 [wo] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2013-09-21 21:21:18+0800 [wo] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2013-09-21 21:21:18+0800 [wo] INFO: Closing spider (finished)
2013-09-21 21:21:18+0800 [wo] INFO: Spider closed (finished)
这个时候我能能看见在你这个项目的根文件夹下有三个文件叫做books和Resources和jiandan.net这个时候我们就完成了我们的下爬虫了