python scrapy 爬虫未完待续

最新推荐文章于 2024-03-13 09:52:01 发布

原创最新推荐文章于 2024-03-13 09:52:01 发布 · 940 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#python #scrapy

python 专栏收录该内容

58 篇文章

订阅专栏

本文介绍了Python的Scrapy框架，详细讲解了爬虫的组成部分，包括下载器和网页处理。阐述了Scrapy的安装及Twisted库的作用。接着，文章探讨了创建第一个Scrapy爬虫、如何模拟浏览器解析JS以及模拟登录。最后提到了爬取京东商品评论的实践，但内容未完待续。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

0. 爬虫

Scrapy 轻松定制网络爬虫

0.1 爬虫的两部分：

1.下载Web页面

最大程度的利用本地带宽
调度针对不同站点的Web请求以减轻对方服务器的负担
DNS查询
遵循一些行规（如robots.txt）

2.对网页的处理

获取动态内容
Spider Trap
内容去重

1.scrapy

1.1 安装scrapy

pip install scrapy
pip install service_identity

不装service_identity会出现警告：

warning：
:0: UserWarning: You do not have a working installation of the service_identity
module: 'No module named service_identity'.  Please install it from <https://pyp
i.python.org/pypi/service_identity> and make sure all of its dependencies are sa
tisfied.  Without the service_identity module and a recent enough pyOpenSSL to s
upport it, Twisted can perform only rudimentary TLS client hostname verification
.  Many valid certificate/hostname mappings may be rejected.

Traceback (most recent call last):
  File "C:\Users\2015\Anaconda\lib\site-packages\boto\utils.py", line 210, in retry_url
    r = opener.open(req, timeout=timeout)
  File "C:\Users\2015\Anaconda\lib\urllib2.py", line 431, in open
    response = self._open(req, data)
  File "C:\Users\2015\Anaconda\lib\urllib2.py", line 449, in _open
    '_open', req)
  File "C:\Users\2015\Anaconda\lib\urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "C:\Users\2015\Anaconda\lib\urllib2.py", line 1227, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "C:\Users\2015\Anaconda\lib\urllib2.py", line 1197, in do_open
    raise URLError(err)
URLError: <urlopen error timed out>
URLError:<urlopen error [errno 10051]

solution:

https://github.com/scrapy/scrapy/issues/1054

DOWNLOAD_HANDLERS = {
  's3': None,
}

1.2 Twisted

scrapy使用Twisted这个异步网络库来处理网络通讯，整体架构如下图：

这里写图片描述
绿线是数据流向，首先从初始 URL 开始，Scheduler 会将其交给 Downloader 进行下载，下载之后会交给 Spider 进行分析，Spider 分析出来的结果有两种：一种是需要进一步抓取的链接，例如之前分析的“下一页”的链接，这些东西会被传回 Scheduler ；另一种是需要保存的数据，它们则被送到 Item Pipeline 那里，那是对数据进行后期处理（详细分析、过滤、存储等）的地方。另外，在数据流动的通道里还可以安装各种中间件，进行必要的处理。

2. 第一个scrapy爬虫

Scrapy Tutorial

#在你想建爬虫的地方（如D:/xx/yy）shift+右键，调出命令行，输入：
scrapy startproject projectname

3. 模拟浏览器解析js

Click a Button in Scrapy
selenium with scrapy for dynamic page

from selenium import webdriver
class northshoreSpider(Spider):
    name = 'xxx'
    allowed_domains = ['www.example.org']
    start_urls = ['https://www.example.org']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self,response):
            self.driver.get('https://www.example.org/abc')

            while True:
                try:
                    next = self.driver.find_element_by_xpath('//*[@id="BTN_NEXT"]')
                    url = 'http://www.example.org/abcd'
                    yield Request(url,callback=self.parse2)
                    next.click()
                except:
                    break

            self.driver.close()

    def parse2(self,response):
        print 'you are here!'

4.模拟登录模块 FormRequest(没用到)

5.爬京东的商品评论

start_urls=["http://item.jd.com/1217499.html",]
#好像通过ajax接口直接爬评论的url的时候，要先去上面那个主页逛一圈？也许是有个
Referer:http://item.jd.com/1217499.html
unicode(response.body.decode(response.encoding)).encode('utf-8')
  #一般用这条语句将字符编码转到utf-8