0. 爬虫
0.1 爬虫的两部分:
1.下载Web页面
- 最大程度的利用本地带宽
- 调度针对不同站点的Web请求以减轻对方服务器的负担
- DNS查询
- 遵循一些行规(如robots.txt)
2.对网页的处理
- 获取动态内容
- Spider Trap
- 内容去重
1.scrapy
1.1 安装scrapy
pip install scrapy
pip install service_identity
不装service_identity会出现警告:
warning:
:0: UserWarning: You do not have a working installation of the service_identity
module: 'No module named service_identity'. Please install it from <https://pyp
i.python.org/pypi/service_identity> and make sure all of its dependencies are sa
tisfied. Without the service_identity module and a recent enough pyOpenSSL to s
upport it, Twisted can perform only rudimentary TLS client hostname verification
. Many valid certificate/hostname mappings may be rejected.
Traceback (most recent call last):
File "C:\Users\2015\Anaconda\lib\site-packages\boto\utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "C:\Users\2015\Anaconda\lib\urllib2.py", line 431, in open
response = self._open(req, data)
File "C:\Users\2015\Anaconda\lib\urllib2.py", line 449, in _open
'_open', req)
File "C:\Users\2015\Anaconda\lib\urllib2.py", line 409, in _call_chain
result = func(*args)
File "C:\Users\2015\Anaconda\lib\urllib2.py", line 1227, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "C:\Users\2015\Anaconda\lib\urllib2.py", line 1197, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
URLError:<urlopen error [errno 10051]
solution:
https://github.com/scrapy/scrapy/issues/1054
DOWNLOAD_HANDLERS = {
's3': None,
}
1.2 Twisted
scrapy使用Twisted这个异步网络库来处理网络通讯,整体架构如下图:
绿线是数据流向,首先从初始 URL 开始,Scheduler 会将其交给 Downloader 进行下载,下载之后会交给 Spider 进行分析,Spider 分析出来的结果有两种:一种是需要进一步抓取的链接,例如之前分析的“下一页”的链接,这些东西会被传回 Scheduler ;另一种是需要保存的数据,它们则被送到 Item Pipeline 那里,那是对数据进行后期处理(详细分析、过滤、存储等)的地方。另外,在数据流动的通道里还可以安装各种中间件,进行必要的处理。
2. 第一个scrapy爬虫
#在你想建爬虫的地方(如D:/xx/yy)shift+右键,调出命令行,输入:
scrapy startproject projectname
3. 模拟浏览器解析js
Click a Button in Scrapy
selenium with scrapy for dynamic page
from selenium import webdriver
class northshoreSpider(Spider):
name = 'xxx'
allowed_domains = ['www.example.org']
start_urls = ['https://www.example.org']
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self,response):
self.driver.get('https://www.example.org/abc')
while True:
try:
next = self.driver.find_element_by_xpath('//*[@id="BTN_NEXT"]')
url = 'http://www.example.org/abcd'
yield Request(url,callback=self.parse2)
next.click()
except:
break
self.driver.close()
def parse2(self,response):
print 'you are here!'
4.模拟登录 模块 FormRequest(没用到)
5.爬京东的商品评论
start_urls=["http://item.jd.com/1217499.html",]
#好像通过ajax接口直接爬评论的url的时候,要先去上面那个主页逛一圈?也许是有个
Referer:http://item.jd.com/1217499.html
unicode(response.body.decode(response.encoding)).encode('utf-8')
#一般用这条语句将字符编码转到utf-8
未完待续
参考:
passing selenium response url to scrapy
下载器中间件(Downloader Middleware)
Scrapy Tutorial
初窥scrapy
stack overflow上
scrapy解析ajax:
Can scrapy be used to scrape dynamic content from websites that are using AJAX?
Scrapy follow pagination AJAX Request - POST
using scrapy to scrap asp.net website with javascript buttons and ajax requests