写好了一个爬虫Demo,准备小试牛刀,执行过程中遇到如下问题
第一步
C:\Users\Administrator\PycharmProjects\mySpider\mySpiderOne\mySpiderOne>scrapy c
rawl tiebaSpider
2017-08-22 23:44:26 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: mySpider
One)
2017-08-22 23:44:26 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': '
mySpiderOne', 'NEWSPIDER_MODULE': 'mySpiderOne.spiders', 'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['mySpiderOne.spiders']}
2017-08-22 23:44:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
Unhandled error in Deferred:
2017-08-22 23:44:27 [twisted] CRITICAL: Unhandled error in Deferred:
2017-08-22 23:44:27 [twisted] CRITICAL:
Traceback (most recent call last):
File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\crawler.py", line 77, in crawl
self.engine = self._create_engine()
File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\crawler.py", line 102, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\core\engine.py", line 69, in __init__
self.downloader = downloader_cls(crawler)
File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\core\downloader\__init__.py", line 88, in __init__
self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\middleware.py", line 34, in from_settings
mwcls = load_object(clspath)
File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\utils\misc.py", line 44, in load_object
mod = import_module(module)
File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\imp
ortlib\__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 978, in _gcd_import
File "<frozen importlib._bootstrap>", line 961, in _find_and_load
File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 678, in exec_module
File "<frozen importlib._bootstrap>", line 205, in _call_with_frames_removed
File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\downloadermiddlewares\retry.py", line 20, in <module>
from twisted.web.client import ResponseFailed
File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\twisted\web\client.py", line 42, in <module>
from twisted.internet.endpoints import HostnameEndpoint, wrapClientTLS
File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\twisted\internet\endpoints.py", line 41, in <module>
from twisted.internet.stdio import StandardIO, PipeAddress
File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\twisted\internet\stdio.py", line 30, in <module>
from twisted.internet import _win32stdio
File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\twisted\internet\_win32stdio.py", line 9, in <module>
import win32api
ModuleNotFoundError: No module named 'win32api'
可以看到,找不到win32api,
下载安装 win32 pyhttps://jaist.dl.sourceforge.net/project/pywin32/pywin32/Build%20221/pywin32-221.win32-py3.6.exe
第二步:
继续执行
scrapy crawl tiebaSpider
C:\Users\Administrator\PycharmProjects\mySpider\mySpiderOne\mySpiderOne>scrapy
rawl tiebaSpider
2017-08-23 00:06:49 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: mySpide
One)
2017-08-23 00:06:49 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME':
mySpiderOne', 'NEWSPIDER_MODULE': 'mySpiderOne.spiders', 'ROBOTSTXT_OBEY': True
'SPIDER_MODULES': ['mySpiderOne.spiders']}
2017-08-23 00:06:49 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2017-08-23 00:06:49 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-08-23 00:06:49 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-08-23 00:06:49 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-08-23 00:06:49 [scrapy.core.engine] INFO: Spider opened
2017-08-23 00:06:49 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pa
es/min), scraped 0 items (at 0 items/min)
2017-08-23 00:06:49 [scrapy.extensions.telnet] DEBUG: Telnet console listening
n 127.0.0.1:6023
2017-08-23 00:06:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tieb
.baidu.com/robots.txt> (referer: None)
2017-08-23 00:06:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tieb
.baidu.com/f?kw=%E5%9C%A8%E5%AE%B6%E8%B5%9A%E9%92%B1> (referer: None)
2017-08-23 00:06:59 [scrapy.core.scraper] ERROR: Spider error processing <GET h
tps://tieba.baidu.com/f?kw=%E5%9C%A8%E5%AE%B6%E8%B5%9A%E9%92%B1> (referer: None
Traceback (most recent call last):
File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\si
e-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Users\Administrator\PycharmProjects\mySpider\mySpiderOne\mySpiderOne
spiders\tiebaSpider.py", line 12, in parse
open(filename, 'w').write(response.body)
TypeError: write() argument must be str, not bytes
2017-08-23 00:06:59 [scrapy.core.engine] INFO: Closing spider (finished)
2017-08-23 00:06:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 532,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 51398,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 8, 22, 16, 6, 59, 862077),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/TypeError': 1,
'start_time': datetime.datetime(2017, 8, 22, 16, 6, 49, 571488)}
2017-08-23 00:06:59 [scrapy.core.engine] INFO: Spider closed (finished)
报错信息:TypeError: write() argument must be str, not bytes
open(filename, 'w').write(response.body)修改为
open(filename, "wb+").write(response.body)
第三步
继续执行scrapy crawl tiebaSpider
C:\Users\Administrator\PycharmProjects\mySpider\mySpiderOne\mySpiderOne>scrapy c rawl tiebaSpider 2017-08-23 00:17:22 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: mySpider One) 2017-08-23 00:17:22 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': ' mySpiderOne', 'NEWSPIDER_MODULE': 'mySpiderOne.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['mySpiderOne.spiders']} 2017-08-23 00:17:22 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2017-08-23 00:17:23 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-08-23 00:17:23 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-08-23 00:17:23 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-08-23 00:17:23 [scrapy.core.engine] INFO: Spider opened 2017-08-23 00:17:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pag es/min), scraped 0 items (at 0 items/min) 2017-08-23 00:17:23 [scrapy.extensions.telnet] DEBUG: Telnet console listening o n 127.0.0.1:6023 2017-08-23 00:17:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tieba .baidu.com/robots.txt> (referer: None) 2017-08-23 00:17:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tieba .baidu.com/f?kw=%E5%9C%A8%E5%AE%B6%E8%B5%9A%E9%92%B1> (referer: None) 2017-08-23 00:17:33 [scrapy.core.engine] INFO: Closing spider (finished) 2017-08-23 00:17:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 532, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 51440, 'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 8, 22, 16, 17, 33, 246304), 'log_count/DEBUG': 3, 'log_count/INFO': 7, 'response_received_count': 2, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2017, 8, 22, 16, 17, 23, 203730)} 2017-08-23 00:17:33 [scrapy.core.engine] INFO: Spider closed (finished)
这次终于算是成功了,下载了html文件到本地。
===========================================================
备注:整个项目是通过
scrapy startproject mySpiderOne 命令生成的
而具体的爬虫则是通过
scrapy genspider tiebaSpider "xxx" 生成的。
唯一做了修改的地方就是生成文件的那个地方,稍微做了修改。