python scrapy 搭建

本文介绍如何从零开始安装配置Python环境及Scrapy框架,并通过一个简单的示例爬虫项目来演示Scrapy的基本用法。

正题

1. 安装python 2.7.8,记得把python的install path加到环境变量path中;

2. 下载get-pip.py,然后运行python get-pip.py,安装setup-tools和pip;

3. 为了安装twisted,首先安装zope.interface-4.1.1.win32-py2.7.exe,然后安装Twisted-14.0.2.win32-py2.7.msi;

4. cmd,运行pip install w3lib

5. cmd,运行pip install lxml

6. cmd,运行pip install pyopenssl

7. cmd,运行pip install service_identity

8. cmd,运行pip install scrapy

install过程中如果出现缺少的dependencies,pip安装,找到对应的tar/exe安装。

试运行第一个demo,在一个文件下,cmd中运行 scrapy startproject Test

可以看到:

Test/
    scrapy.cfg
    Test/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...

spiders/下,新建dmoz_spider.py,程序:

import scrapy

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        with open(filename, 'wb') as f:
            f.write(response.body)

修改Test / items.py:

import scrapy

class TestItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()


如果遇到ImportError: Error loading object 'scrapy.core.downloader.handlers.s3.S3Download
Handler': No module named win32api

重新安装对应版本的pywin32,并安装。


在cmd中运行 scrapy crawl dmoz,结果如下:

C:\Python27\Code\tutorial\Test>scrapy crawl dmoz
2014-11-08 22:51:36+0800 [scrapy] INFO: Scrapy 0.24.4 started (bot: Test)
2014-11-08 22:51:36+0800 [scrapy] INFO: Optional features available: ssl, http11

2014-11-08 22:51:36+0800 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE'
: 'Test.spiders', 'SPIDER_MODULES': ['Test.spiders'], 'BOT_NAME': 'Test'}
2014-11-08 22:51:38+0800 [scrapy] INFO: Enabled extensions: LogStats, TelnetCons
ole, CloseSpider, WebService, CoreStats, SpiderState
2014-11-08 22:51:39+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuth
Middleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, Def
aultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, Redirec
tMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-11-08 22:51:39+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMid
dleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddlew
are
2014-11-08 22:51:39+0800 [scrapy] INFO: Enabled item pipelines:
2014-11-08 22:51:39+0800 [dmoz] INFO: Spider opened
2014-11-08 22:51:40+0800 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped
0 items (at 0 items/min)
2014-11-08 22:51:40+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6
023
2014-11-08 22:51:40+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080

2014-11-08 22:51:43+0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Co
mputers/Programming/Languages/Python/Resources/> (referer: None)
2014-11-08 22:51:43+0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Co
mputers/Programming/Languages/Python/Books/> (referer: None)
2014-11-08 22:51:43+0800 [dmoz] INFO: Closing spider (finished)
2014-11-08 22:51:43+0800 [dmoz] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 516,
         'downloader/request_count': 2,
         'downloader/request_method_count/GET': 2,
         'downloader/response_bytes': 16337,
         'downloader/response_count': 2,
         'downloader/response_status_count/200': 2,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2014, 11, 8, 14, 51, 43, 150000),
         'log_count/DEBUG': 4,
         'log_count/INFO': 7,
         'response_received_count': 2,
         'scheduler/dequeued': 2,
         'scheduler/dequeued/memory': 2,
         'scheduler/enqueued': 2,
         'scheduler/enqueued/memory': 2,
         'start_time': datetime.datetime(2014, 11, 8, 14, 51, 40, 13000)}
2014-11-08 22:51:43+0800 [dmoz] INFO: Spider closed (finished)

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值