python 爬虫scrapy框架练习

最新推荐文章于 2024-03-20 10:00:00 发布

原创最新推荐文章于 2024-03-20 10:00:00 发布 · 7.3k 阅读

3 ·

CC 4.0 BY-SA版权

文章标签：

#Python #爬虫

python 专栏收录该内容

32 篇文章

订阅专栏

本文介绍如何使用Python的Scrapy库搭建基本的爬虫框架，包括安装配置、定义爬取模型及实现简单网页抓取的过程。

最近发现python中的scrapy库用来做爬虫是相当的好用，只要你有想法，就能实现一些抓取，本人的测试环境在ubuntu12.04下进行

第一步安装scrapy：

pip install Scrapy

easy_install scrapy

这两条命令完成了scrapy的安装，

下来我们用这些东西实现一个简单的抓取：

命令：scrapy

这个我们能看到下面出现好多的命令这些命令是连起来用的

wangyu@ubuntu:~$ scrapy
Scrapy 0.12.0.2546 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  fetch         Fetch a URL using the Scrapy downloader
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command

这个命令可以让你看到生成一个文件夹在当前目录下：

scarpy startproject wo

wangyu@ubuntu:~/wo$ tree
.
├── scrapy.cfg
└── wo
    ├── __init__.py
    ├── items.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        └── __init__.py

2 directories, 6 files

我们可以通过这棵树看到我们当前的wo文件夹下有什么

我们先用vim来打开items.py

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/topics/items.html

from scrapy.item import Item, Field

class WoItem(Item):
    # define the fields for your item here like:
    # name = Field()
    pass
~                                                                                         
~                                                                                         
~                                                                                         
~                                                                                         
~                                                                                         
~                                                                                         
~

这个是当前的空文件，我们要进行一下编辑：

  1 # Define here the models for your scraped items
  2 #-*-coding:utf-8-*-
  3 # See documentation in:
  4 # http://doc.scrapy.org/topics/items.html
  5 
  6 from scrapy.item import Item, Field
  7 
  8 class WoItem(Item):
  9     # define the fields for your item here like:
 10      title = Field()
 11      link  = Field()
 12 #定义两个一个title，一个link
 13                                                                                       
~                                                                                         
~                                                                                         
~                                                                                         
~                                                                                         
~                                                                                         
~                                                                                         
~

我添加了两句，一句是中文注释，在就是下面的类里面的东西

要建立一个Spider，你必须为scrapy.spider.BaseSpider创建一个子类，并确定三个主要的、强制的属性

name必须唯一

start_urls 爬虫开始爬的地方

parse()：爬虫的方法，调用时候传入从每一个URL传回的Response对象作为参数，response将会是parse方法的唯一的一个参数,

  1 from scrapy.spider import BaseSpider   
  2 
  3 class wo(BaseSpider):
  4     name = "wo"
  5     allowed_wo =["jandan.net"]
  6     start_urls=[
  7             "http://jandan.net/new",
  8             "http://jandan.net/fml"
  9             ]
 10     def parse(self,response):
 11         filename=response.url.split("/"    )[-2]
 12         open(filename,'wb').write(respo    nse.body)
 13 
~                                          
~                                          
~                                          
~                                          
~

这个中我们以简单为例。

这时我们运行：scrapy crawl wo

wangyu@ubuntu:~/wo$ scrapy crawl wo
2013-09-21 20:53:40+0800 [scrapy] INFO: Scrapy 0.12.0.2546 started (bot: wo)
2013-09-21 20:53:40+0800 [scrapy] DEBUG: Enabled extensions: TelnetConsole, SpiderContext, WebService, CoreStats, MemoryUsage, CloseSpider
2013-09-21 20:53:40+0800 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2013-09-21 20:53:40+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, DownloaderStats
2013-09-21 20:53:40+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-09-21 20:53:40+0800 [scrapy] DEBUG: Enabled item pipelines: 
2013-09-21 20:53:40+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-09-21 20:53:40+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-09-21 20:53:40+0800 [wo] INFO: Spider opened
2013-09-21 20:53:40+0800 [wo] DEBUG: Crawled (200) <GET http://jandan.net/fml> (referer: None)
2013-09-21 20:53:40+0800 [wo] DEBUG: Crawled (200) <GET http://jandan.net/new> (referer: None)
2013-09-21 20:53:40+0800 [wo] INFO: Closing spider (finished)
2013-09-21 20:53:40+0800 [wo] INFO: Spider closed (finished)

这个就是我们的运行结果，我们能看到这些东西，我们返回wo这个文件夹，会发现一个被名民为jandan的文件，但是打开发现全是代码。不能达到真的直接吧一个网页下载下来，我们在改改这些代码：

  1 from scrapy.spider import BaseSpider
  2 
  3 class wo(BaseSpider):
  4     name = "wo"
  5     allowed_wo =["jiandan.net"]
  6     start_urls=[
  7              "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
  8                      "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
  9                     "http://jandan.net/ooxx"                                       
 10 
 11             ]
 12     def parse(self,response):
 13         filename=response.url.split("/")[-2]
 14         open(filename,'wb').write(response.body)
 15     #这个地方是给你的文件命名，在url最后的倒数地一个“/”之后的词语命名为你的文件名，下面调用写入文件
~                                                                                      
~                                                                                      
~                                                                                      
~                                                                                      
~                                                                                      
~                                                                                      
~                                                                                      
~                                                                                      
~                                                                                      
~

再次运行：
wangyu@ubuntu:~/wo$ scrapy crawl wo

2013-09-21 21:21:16+0800 [scrapy] INFO: Scrapy 0.12.0.2546 started (bot: wo)
2013-09-21 21:21:16+0800 [scrapy] DEBUG: Enabled extensions: TelnetConsole, SpiderContext, WebService, CoreStats, MemoryUsage, CloseSpider
2013-09-21 21:21:16+0800 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2013-09-21 21:21:16+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, DownloaderStats
2013-09-21 21:21:16+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-09-21 21:21:16+0800 [scrapy] DEBUG: Enabled item pipelines: 
2013-09-21 21:21:16+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-09-21 21:21:16+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-09-21 21:21:16+0800 [wo] INFO: Spider opened
2013-09-21 21:21:16+0800 [wo] DEBUG: Crawled (200) <GET http://jandan.net/ooxx> (referer: None)
2013-09-21 21:21:18+0800 [wo] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2013-09-21 21:21:18+0800 [wo] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2013-09-21 21:21:18+0800 [wo] INFO: Closing spider (finished)
2013-09-21 21:21:18+0800 [wo] INFO: Spider closed (finished)

这个时候我能能看见在你这个项目的根文件夹下有三个文件叫做books和Resources和jiandan.net这个时候我们就完成了我们的下爬虫了