一个简单的scrapy爬虫

本文记录了一次使用Scrapy框架进行网页爬取的过程,从创建项目到解决运行中出现的技术问题,最终成功抓取并保存网页内容。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

写好了一个爬虫Demo,准备小试牛刀,执行过程中遇到如下问题


第一步

C:\Users\Administrator\PycharmProjects\mySpider\mySpiderOne\mySpiderOne>scrapy c
rawl tiebaSpider
2017-08-22 23:44:26 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: mySpider
One)
2017-08-22 23:44:26 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': '
mySpiderOne', 'NEWSPIDER_MODULE': 'mySpiderOne.spiders', 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['mySpiderOne.spiders']}
2017-08-22 23:44:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
Unhandled error in Deferred:
2017-08-22 23:44:27 [twisted] CRITICAL: Unhandled error in Deferred:


2017-08-22 23:44:27 [twisted] CRITICAL:
Traceback (most recent call last):
  File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\crawler.py", line 77, in crawl
    self.engine = self._create_engine()
  File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\crawler.py", line 102, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\core\engine.py", line 69, in __init__
    self.downloader = downloader_cls(crawler)
  File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\core\downloader\__init__.py", line 88, in __init__
    self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
  File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\middleware.py", line 58, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\middleware.py", line 34, in from_settings
    mwcls = load_object(clspath)
  File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\utils\misc.py", line 44, in load_object
    mod = import_module(module)
  File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\imp
ortlib\__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 978, in _gcd_import
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load
  File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 205, in _call_with_frames_removed
  File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\downloadermiddlewares\retry.py", line 20, in <module>
    from twisted.web.client import ResponseFailed
  File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\twisted\web\client.py", line 42, in <module>
    from twisted.internet.endpoints import HostnameEndpoint, wrapClientTLS
  File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\twisted\internet\endpoints.py", line 41, in <module>
    from twisted.internet.stdio import StandardIO, PipeAddress
  File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\twisted\internet\stdio.py", line 30, in <module>
    from twisted.internet import _win32stdio
  File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\twisted\internet\_win32stdio.py", line 9, in <module>
    import win32api
ModuleNotFoundError: No module named 'win32api'


可以看到,找不到win32api,


下载安装 win32 pyhttps://jaist.dl.sourceforge.net/project/pywin32/pywin32/Build%20221/pywin32-221.win32-py3.6.exe

第二步

继续执行

scrapy crawl tiebaSpider

C:\Users\Administrator\PycharmProjects\mySpider\mySpiderOne\mySpiderOne>scrapy
rawl tiebaSpider
2017-08-23 00:06:49 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: mySpide
One)
2017-08-23 00:06:49 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME':
mySpiderOne', 'NEWSPIDER_MODULE': 'mySpiderOne.spiders', 'ROBOTSTXT_OBEY': True
 'SPIDER_MODULES': ['mySpiderOne.spiders']}
2017-08-23 00:06:49 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2017-08-23 00:06:49 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-08-23 00:06:49 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-08-23 00:06:49 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-08-23 00:06:49 [scrapy.core.engine] INFO: Spider opened
2017-08-23 00:06:49 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pa
es/min), scraped 0 items (at 0 items/min)
2017-08-23 00:06:49 [scrapy.extensions.telnet] DEBUG: Telnet console listening
n 127.0.0.1:6023
2017-08-23 00:06:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tieb
.baidu.com/robots.txt> (referer: None)
2017-08-23 00:06:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tieb
.baidu.com/f?kw=%E5%9C%A8%E5%AE%B6%E8%B5%9A%E9%92%B1> (referer: None)
2017-08-23 00:06:59 [scrapy.core.scraper] ERROR: Spider error processing <GET h
tps://tieba.baidu.com/f?kw=%E5%9C%A8%E5%AE%B6%E8%B5%9A%E9%92%B1> (referer: None


Traceback (most recent call last):
  File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\si
e-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "C:\Users\Administrator\PycharmProjects\mySpider\mySpiderOne\mySpiderOne
spiders\tiebaSpider.py", line 12, in parse
    open(filename, 'w').write(response.body)
TypeError: write() argument must be str, not bytes
2017-08-23 00:06:59 [scrapy.core.engine] INFO: Closing spider (finished)
2017-08-23 00:06:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 532,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 51398,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 8, 22, 16, 6, 59, 862077),
 'log_count/DEBUG': 3,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/TypeError': 1,
 'start_time': datetime.datetime(2017, 8, 22, 16, 6, 49, 571488)}
2017-08-23 00:06:59 [scrapy.core.engine] INFO: Spider closed (finished)


报错信息:TypeError: write() argument must be str, not bytes

open(filename, 'w').write(response.body)
修改为

open(filename, "wb+").write(response.body)


第三步
继续执行scrapy crawl tiebaSpider

C:\Users\Administrator\PycharmProjects\mySpider\mySpiderOne\mySpiderOne>scrapy c
rawl tiebaSpider
2017-08-23 00:17:22 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: mySpider
One)
2017-08-23 00:17:22 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': '
mySpiderOne', 'NEWSPIDER_MODULE': 'mySpiderOne.spiders', 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['mySpiderOne.spiders']}
2017-08-23 00:17:22 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2017-08-23 00:17:23 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-08-23 00:17:23 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-08-23 00:17:23 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-08-23 00:17:23 [scrapy.core.engine] INFO: Spider opened
2017-08-23 00:17:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pag
es/min), scraped 0 items (at 0 items/min)
2017-08-23 00:17:23 [scrapy.extensions.telnet] DEBUG: Telnet console listening o
n 127.0.0.1:6023
2017-08-23 00:17:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tieba
.baidu.com/robots.txt> (referer: None)
2017-08-23 00:17:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tieba
.baidu.com/f?kw=%E5%9C%A8%E5%AE%B6%E8%B5%9A%E9%92%B1> (referer: None)
2017-08-23 00:17:33 [scrapy.core.engine] INFO: Closing spider (finished)
2017-08-23 00:17:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 532,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 51440,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 8, 22, 16, 17, 33, 246304),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 8, 22, 16, 17, 23, 203730)}
2017-08-23 00:17:33 [scrapy.core.engine] INFO: Spider closed (finished)

这次终于算是成功了,下载了html文件到本地。

===========================================================
备注:整个项目是通过
scrapy startproject mySpiderOne 命令生成的

而具体的爬虫则是通过
scrapy genspider tiebaSpider "xxx" 生成的。

唯一做了修改的地方就是生成文件的那个地方,稍微做了修改。


评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值