爬虫

最新推荐文章于 2020-02-26 00:34:19 发布

原创最新推荐文章于 2020-02-26 00:34:19 发布 · 373 阅读

CC 4.0 BY-SA版权

本文介绍了网络爬虫的基本工作流程，包括URL选取、抓取、解析及去重。接着深入讲解了Python中HTTP请求的实现，如GET和POST请求，以及请求头中的User-Agent和Referer字段的作用。同时，文章还提到了如何处理HTTP响应，如获取状态码、读取响应内容以及设置超时、处理重定向和代理。最后，探讨了Cookie的处理方法。

网络爬虫的基本工作流程
1.首先选取一部分精心挑选的种子URL
2.将这些URL放入待抓取的URL队列
3.从待抓取的URL队列中中抓取队列URL，解析DNS，并且得到主机IP，并将URL对于的网页下载下来，存储进已下载的网页库中，此外，将这些URL放进抓取的URL队列中。
4.分析已抓取的URL队列中的URL，从已下载的网页数据中分析出其他URL，并和已URL进行去重对比。最后将去过重的URL放入待抓取的URL队列中，从而进入下一个循环。

HTTP请求的Python实现
使用urllib2/urllib
用urllib2的urlopen函数见下列代码(GET请求)
import urllib2
response=urllib2.urlopen("http://www.zhifu.com")#响应对象，此方法有二个步骤，请求和响应,相当于下面二行代码
#请求request=urllib2.Request('http://www.zhifu.com')
#响应request=urllib2.Request(request)
help(response)#查看response对象帮助文档
print(response.getcode())#查看response.getcode()返回响应的状态码
print(response.geturl())#查看返回请求的URL
a=response.info()#返回httplib.HTTPMessage对象
help(a)#获取a的帮助文档
print('a.getencoding()',a.getencoding())#编码方式‘7bit'
print(a['Date'])#a对象可以象字典一样使用
print('a.getplist()',a.getplist())#返回一个列表
print('a.parsetype()',a.parsetype())
print('a is',type(a))
#a是httplib.HTTPMessage对象，可以通过字典方法dict[key]以下是key的取值（这里好像是http请求头的一些信息）
# Date: Fri, 14 Sep 2018 02:05:37 GMT
# Server: NOYB
# X-Frame-Options: SAMEORIGIN
# Last-Modified: Wed, 10 May 2017 22:23:20 GMT
# ETag: "64c-54f32eb686600"
# Accept-Ranges: bytes
# Content-Length: 1612
# X-XSS-Protection: 1; mode=block
# X-Content-Type-Options: nosniff
# Content-Security-Policy: script-src 'self'
# Cache-Control: no-cache
# Pragma: no-cache
# Keep-Alive: timeout=15, max=96
# Connection: Keep-Alive
# Content-Type: text/html; charset=UTF-8
print(response.info())
help(response)
html=response.read()#读取响应
print html

post请求
请求头的User-Agent域和Referer域的含义

Referer域：它表示一个来源，列如Referer=http://www.google.com(详细见https://www.sojson.com/blog/58.html)
作用可以来防盗链
作用：
   1.防盗链。
   刚刚前面有提到一个小 Demo 。
   我在www.sojson.com里有一个www.baidu.com链接，那么点击这个www.baidu.com，它的header信息里就有：
   Referer=http://www.sojson.com
   那么可以利用这个来防止盗链了，比如我只允许我自己的网站访问我自己的图片服务器，那我的域名是www.sojson.com，
   那么图片服务器每次取到Referer来判断一下是不是我自己的域名www.sojson.com，如果是就继续访问，不是就拦截。
   这是不是就达到防盗链的效果了？
   2.防止恶意请求。
   比如我的SOJSON网站上，静态请求是*.html结尾的，动态请求是*.shtml，那么由此可以这么用，所有的*.shtml请求，必须 Referer 为我自己的网站。
   Referer=http://www.sojson.com
   空Referer是怎么回事？什么情况下会出现Referer?
   首先，我们对空 Referer 的定义为， Referer 头部的内容为空，或者，一个 HTTP 请求中根本不包含 Referer 头部。
   那么什么时候 HTTP 请求会不包含 Referer 字段呢？根据Referer的定义，它的作用是指示一个请求是从哪里链接过来，那么当一个请求并不是由链接触发产生的，那么自然也就不需要指定这个请求的链接来源。
   比如，直接在浏览器的地址栏中输入一个资源的URL地址，那么这种请求是不会包含 Referer 字段的，因为这是一个“凭空产生”的 HTTP 请求，并不是从一个地方链接过去的。
   那么在防盗链设置中，允许空Referer和不允许空Referer有什么区别？
   允许 Referer 为空，意味着你允许比如浏览器直接访问，就是空。

User-Agent:表示访问的对象，如浏览器，python客服端等等

使用python设置请求头：
http的请求格式：
POST /search HTTP/1.1 #请求行
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms-excel #请求头
Referer: <a href="http://www.google.cn/">http://www.google.cn/</a>
Accept-Language: zh-cn
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; TheWorld)
Host: <a href="http://www.google.cn">www.google.cn</a>
Connection: Keep-Alive
Cookie: PREF=ID=80a06da87be9ae3c:U=f7167333e2c3b714:NW=1:TM=1261551909:LM=1261551917:S=ybYcq2wpfefs4V9g; #请求头

hl=zh-CN&source=hp&q=domety#请求体

urllib2.Request是一个类，可以使用这类来设置请求的URL,请求头，请求体，并得体请求体如下
#初始化方法参数如下
class Request:
def __init__(self, url, data=None, headers={},
origin_req_host=None, unverifiable=False)

url：访问的地址
data:应该是请求体里的data，列如postdata={'username':'zhang','possword':1234}，这个请求体要使用data=urllib.urlencode(postdata)转换格式
headers:请求里的headers，直接设置headers={'User-Agent':user_agent,'Referer':'http://www.baidu'}

将得到的请求体传入urllib2.urlopen中得到我们需要的响应

response=urllib2.urlopen(request)
下面方法也可添加使用Request对象的add_header和add_data方法
import urllib
import urllib2
url="http://xxxx"
user_agent="xx"
referer="xx"
postdata={xx:xx,xx:xx}
data=urllib.urlencode(postdata)
req.add_header("User-Agent",user_agent)
req.add_data(data)

header中需要注意的：
User-Agent
Conten-Type:在使用REST接口时，服务器会检查改值，用来确定HTTP Body中的内容解析。
取值有：applicaation/xml(在XML RPC如RESTful/SOAP调用时)
application/json(在JSON RPC调用时使用)
application/x-www-form-urlencodeed(在浏览器提交表单时使用)
Referer:服务器有时候会检查防盗链

Cookie处理
得到Cookie值
import urllib2
import cookielib
cookie=cookielib.CookieJar()#使用这个来得到cookie
opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
response=opener.open('http://www.zhihu.com')
for item in cookie:
print item.name+':'+item.value

设置cookle值
import urllib2
opener=urllib2.build_opener()
opener.addheaders.append(("Cookie",'email='+"xxxx@163.com"))#验证后设置不成功
req=urllib2.Request("http://www.zhihu.com/")
response=opener.open(req)
print(response.headers)

Timeout设置超时

import urllib2
request=urllib2.Request('http://www.zhihu.com')
response=urllib2.urlopen(request,timeout=2)
html=respose.read()
print html

获取HTTP响应码
import urllib2
try:
   respose=urllib2.urlopen("http://www.goodle.com")
except urllib2.THHPError as e:
       if hasattr(e,'code'):
          print 'error code',e.code# 打印错误的码

重定向

urllib2默认情况下会针对HTTP 3XX返回码自动进行重定向动作。检测是否发生了重定向动作。判断是否发生重定向只需要对比一下请求的url和响应的url,
检测url
import urllib2
response=urllib2.urlopen('http://zhihu.cn')
isRedirected=respose.geturl()=='http://zhihu.cn'

如果不想重定向，可以自定义HTTPRedirectHandler,如下:
import urllib2
class RedirectHandler(urllib2.HTTPRedirectHandler):
    def http_error_301(self, req, fp, code, msg, headers):
        pass
    def http_error_302(self, req, fp, code, msg, headers):
        result=urllib2.HTTPRedirectHandler.http_error_301(self, req, fp, code, msg, headers)
        result.status=code
        result.newurl=result.geturl()
        return result
opener=urllib2.build_opener(RedirectHandler)
opener.open('http://www.zhihu.cn')

proxy代理设置

import urllib2
proxy=urllib2.ProxyHandler({"http":"127.0.0.1:8087"})
opener=urllib2.build_opener([prox,])
urllib2.install_opener(opener)#使用这个会设置全局opener,所有的HTTP访问都会使用这个代理
response=urllib2.urlopen('http://www.zhihu.com')
print response.read()

如果不需要只使用这个代理
import urllib2
proxy=urllib2.ProxyHandler({"http":"127.0.0.1:8087"})
opener=urllib2.build_opener([prox,])
response=opener.open('http://www.zhihu.com')
print response.read()