正题
1. 安装python 2.7.8,记得把python的install path加到环境变量path中;
2. 下载get-pip.py,然后运行python get-pip.py,安装setup-tools和pip;
3. 为了安装twisted,首先安装zope.interface-4.1.1.win32-py2.7.exe,然后安装Twisted-14.0.2.win32-py2.7.msi;
4. cmd,运行pip install w3lib
5. cmd,运行pip install lxml
6. cmd,运行pip install pyopenssl
7. cmd,运行pip install service_identity
8. cmd,运行pip install scrapy
install过程中如果出现缺少的dependencies,pip安装,找到对应的tar/exe安装。
试运行第一个demo,在一个文件下,cmd中运行 scrapy startproject Test
可以看到:
Test/
scrapy.cfg
Test/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...
spiders/下,新建dmoz_spider.py,程序:
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
with open(filename, 'wb') as f:
f.write(response.body)修改Test / items.py:
import scrapy
class TestItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()如果遇到ImportError: Error loading object 'scrapy.core.downloader.handlers.s3.S3Download
Handler': No module named win32api
重新安装对应版本的pywin32,并安装。
在cmd中运行 scrapy crawl dmoz,结果如下:
C:\Python27\Code\tutorial\Test>scrapy crawl dmoz
2014-11-08 22:51:36+0800 [scrapy] INFO: Scrapy 0.24.4 started (bot: Test)
2014-11-08 22:51:36+0800 [scrapy] INFO: Optional features available: ssl, http11
2014-11-08 22:51:36+0800 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE'
: 'Test.spiders', 'SPIDER_MODULES': ['Test.spiders'], 'BOT_NAME': 'Test'}
2014-11-08 22:51:38+0800 [scrapy] INFO: Enabled extensions: LogStats, TelnetCons
ole, CloseSpider, WebService, CoreStats, SpiderState
2014-11-08 22:51:39+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuth
Middleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, Def
aultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, Redirec
tMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-11-08 22:51:39+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMid
dleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddlew
are
2014-11-08 22:51:39+0800 [scrapy] INFO: Enabled item pipelines:
2014-11-08 22:51:39+0800 [dmoz] INFO: Spider opened
2014-11-08 22:51:40+0800 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped
0 items (at 0 items/min)
2014-11-08 22:51:40+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6
023
2014-11-08 22:51:40+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-11-08 22:51:43+0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Co
mputers/Programming/Languages/Python/Resources/> (referer: None)
2014-11-08 22:51:43+0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Co
mputers/Programming/Languages/Python/Books/> (referer: None)
2014-11-08 22:51:43+0800 [dmoz] INFO: Closing spider (finished)
2014-11-08 22:51:43+0800 [dmoz] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 516,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 16337,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 11, 8, 14, 51, 43, 150000),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2014, 11, 8, 14, 51, 40, 13000)}
2014-11-08 22:51:43+0800 [dmoz] INFO: Spider closed (finished)
本文介绍如何从零开始安装配置Python环境及Scrapy框架,并通过一个简单的示例爬虫项目来演示Scrapy的基本用法。
358

被折叠的 条评论
为什么被折叠?



