分享一下这几天玩爬虫碰到的坑。
因为初学,所以边看书,边看别人的实例,本想照着别人的先搞出个小名堂,不料深陷403泥潭。我用的是scrapy框架,具体报错如下:
[root@Uu tutorial]# scrapy crawl dmoz -o torrents.jl
2018-08-23 22:49:26 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: tutorial)
2018-08-23 22:49:26 [scrapy.utils.log] INFO: Versions: lxml 3.2.1.0, libxml2 2.9.1, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 2.7.5 (default, Jul 13 2018, 13:06:57) - [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Linux-3.10.0-693.el7.x86_64-x86_64-with-centos-7.4.1708-Core
2018-08-23 22:49:26 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'FEED_URI': 'torrents.jl', 'CONCURRENT_REQUESTS': 1, 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial', 'COOKIES_ENABLED': False, 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1