Python爬虫开发与项目实战 3: 初识爬虫

最新推荐文章于 2025-03-11 15:27:28 发布

原创最新推荐文章于 2025-03-11 15:27:28 发布 · 1.5k 阅读

4 ·

CC 4.0 BY-SA版权

爬虫专栏收录该内容

20 篇文章

订阅专栏

本文介绍了网络爬虫的基本概念及分类，包括通用网络爬虫、聚焦爬虫、增量式爬虫和深层爬虫，并详细阐述了它们的特点及应用场景。同时，文章还深入探讨了Python中实现HTTP请求的方法，包括urllib2/urllib、httplib/urllib和Requests库的使用技巧。

3.1 网络爬虫概述

概念：按照系统结构和实现技术，大致可分：通用网络爬虫、聚焦爬虫、增量式爬虫、深层爬虫。实际的爬虫系统通常是几种技术的相结合实现的。

搜索引擎：属于通用爬虫，但存在一定的局限性：

检索结果包含大量用户不关心的网页

有限的服务器资源与无限的网络数据资源之间的矛盾

SEO往往对信息含量密集且具有一定结构的数据无能为力，如音视频等

基于关键字的检索，难以支持根据语义信息提出的查询

为了解决上述问题，定向抓取相关网页资源的聚焦爬虫应运而生

聚焦爬虫：一个自动下载网页的程序，为面向主题的用户查询准备数据资源

增量式爬虫：采取更新和只爬新产生的网页。减少时间和空间上的耗费，但增加算法复杂度和实现难度

深层爬虫：网页分表层网页（SEO可以索引的）和深层网页（表单后的）

场景：BT搜索网站（https://www.cilisou.org/），云盘搜索网站（http://www.pansou.com/）

基本工作流程如下：

首先选取一部分精心挑选的种子URL
将这些URL放入待抓取URL队列
从待抓取URL队列中读取URL，解析DNS，得到IP，下载网页，存储网页，将URL放进已抓取URL队列
分析已抓取URL队列中的URL，分析网页中的URL，比较去重，后放入待抓取URL队列，进入下一个循环。

3.2 HTTP请求的Python实现

Python中实现HTTP请求的三种方式：urllib2/urllib httplib/urllib Requests

urllib2/urllib实现：Python中的两个内置模块，以urllib2为主，urllib为辅

1.实现一个完整的请求与响应模型

import urllib2
response = urllib2.urlopen('http://www.zhihu.com')
html = response.read()
print html

将请求响应分为两步：一步是请求，一步是响应

import urllib2
request = urllib2.Request('http://www.zhihu.com')
response = urllib2.urlopen(request)
html = response.read()
print html

POST方式：

有时服务器拒绝你的访问，因为服务器会检验请求头。常用的反爬虫的手段。

2、实现请求头headers处理

import urllib
import urllib2
url = 'http://www.xxxx.com/login'
user_agent = ''
referer = 'http://www.xxxx.com/'
postdata = {'username': 'qiye',
             'password': 'qiye_pass' }
# 写入头信息
headers = {'User-Agent': user_agent, 'Referer': referer}
data = urllib.urlencode(postdata)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
html = response.read()

3、Cookie处理：使用CookieJar函数进行Cookie的管理

import urllib2
import cookielib
cookie = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
response = opener.open('http://www.zhihu.com')
for item in cookie:
	print item.name + ':' + item.value

SessionID_R3:4y3gT2mcOjBQEQ7RDiqDz6DfdauvG8C5j6jxFg8jIcJvE5ih4USzM0h8WRt1PZomR1C9755SGG5YIzDJZj7XVraQyomhEFA0v6pvBzV94V88uQqUyeDnsMj8MALBSKr
4、Timeout设置超时

import urllib2
request = urllib2.Request('http://www.zhihu.com')
response = urllib2.urlopen(request, timeout=2)
html = response.read()
print html

5、获取HTTP响应码

import urllib2
try:
	response = urllib2.urlopen('http://www.google.com')
	print response
except urllib2.HTTPError as e:
	if hasattr(e, 'code'):
		print 'Error code:', e.code

6、重定向：urllib2默认情况下会针对HTTP 3XX返回码自动进行重定向

只要检查Response的URL和Request的URL是否相同

import urllib2
response = urllib2.urlopen('http://www.zhihu.com')
isRedirected = response.geturl() == 'http://www.zhihu.com'

7、Proxy的设置：urllib2默认会使用环境变量http_proxy来设置HTTP Proxy，但我们一般不采用这种方式，而用ProxyHandler在程序中动态设置代理。

import urllib2
proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'})
opener = urllib2.build_opener(proxy, )
urllib2.install_opener(opener)
response = urllib2.urlopen('http://www.zhihu.com/')
print response.read()

install_opener()会设置全局opener,但如想使用两个不同的Proxy代理，比较好的做法是直接调用的open方法代替全局urlopen方法

import urllib2
proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'})
opener = urllib2.build_opener(proxy, )
response = opener.open('http://www.zhihu.com/')
print response.read()

httplib/urllib实现：一个底层基础模块，可以看到建立HTTP请求的每一步，但是实现的功能比较少。

Requests：更人性化，是第三方模块，pip install requests

import requests
r = requests.get('http://www.baidu.com')
print r.content

2、响应与编码

import requests
r = requests.get('http://www.baidu.com')
print 'content-->' + r.content
print 'text-->' + r.text
print 'encoding-->' + r.encoding
r.encoding = 'utf-8'
print 'new text-->' + r.text

pip install chardet 一个非常优秀的字符串/文件编码检查模块

直接将chardet探测到的编码，赋给r.encoding实现解码，r.text输出就不会有乱码了。

import requests
import chardet
r = requests.get('http://www.baidu.com')
print chardet.detect(r.content)
r.encoding = chardet.detect(r.content)['encoding']
print r.text

流模式

import requests
r = requests.get('http://www.baidu.com', stream=True)
print r.raw.read(10)

3、请求头headers处理

import requests
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
headers = {'User-Agent': user_agent}
r = requests.get('http://www.baidu.com', headers=headers)
print r.content

4、响应码code和响应头headers处理

# -*- coding: utf-8 -*-
import requests
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
headers = {'User-Agent': user_agent}
r = requests.get('http://www.baidu.com', headers=headers)
if r.status_code == requests.codes.ok:
	print r.status_code    #响应码
	print r.headers        #响应头
	print r.headers.get('content-type')  # 推荐这种方式
	print r.headers['content-type']      # 不推荐这种方式
else:
	r.raise_for_status()

5、Cookie处理

# -*- coding: utf-8 -*-
import requests
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
headers = {'User-Agent': user_agent}
r = requests.get('http://www.baidu.com', headers=headers)
# 遍历出所有的cookie字段的值
for cookie in r.cookies.keys():
	print cookie + ":" + r.cookies.get(cookie)

将自定义的Cookie值发送出去

# -*- coding: utf-8 -*-
import requests
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
headers = {'User-Agent': user_agent}
cookies = dict(name='qiye', age='10')
r = requests.get('http://www.baidu.com', headers=headers, cookies=cookies)
print r.text

Requests提供了session的概念，使我们不需要关心Cookie值，可连续访问网页

# -*- coding: utf-8 -*-
import requests
loginUrl = "http://www.xxx.com/login"
s = requests.Session()
# 首次访问，作为游客，服务器分配一个cookie
r = s.get(loginUrl, allow_redirects=True)
datas = {'name':'qiye', 'passwd': 'qiye'}
# 向登录链接发送post请求，游客权限转为会员权限
r = s.post(loginUrl, data=datas.allow_redirects=Trues)
print r.text

这是一个正式遇到的问题，如果没有第一不访问登录的页面，而是直接向登录链接发送Post请求，系统会把你当做非法用户，因为访问登录界面式会分配一个Cookie，需要将这个Cookie在发送Post请求时带上，这种使用Session函数处理Cookie的方式之后会很常用。

6、重定向与历史信息

只需设置以下allow_redicts字段即可，可通过r.history字段查看历史信息

# -*- coding: utf-8 -*-
import requests
r= requests.get('http://github.com')   # 重定向为https://github.com
print r.url
print r.status_code
print r.history

7、超时设置

requests.get('http://github.com', timeout=2)

8、代理设置

# -*- coding: utf-8 -*-
import requests
proxies = {
	"http" = "http://0.10.10.01:3234",
	"https" = "http://0.0.0.2:1020",
}
r= requests.get('http://github.com', proxies=proxies)

也可通过环境变量HTTP_PROXY和HTTPS_PROXY来配置，但不常用。

你的代理需要使用HTTP Basic Auth，可以用http://user:password&host/语法