python爬虫的使用

最新推荐文章于 2025-05-12 10:33:52 发布

house.zhang

最新推荐文章于 2025-05-12 10:33:52 发布

阅读量1.2k

点赞数 2

CC 4.0 BY-SA版权

分类专栏：数据采集文章标签：爬虫框架

本文链接：https://blog.youkuaiyun.com/pop_xiaohao/article/details/51585771

数据采集专栏收录该内容

3 篇文章

订阅专栏

1、简述，Python在爬虫方面有独天得厚的优势，几行代码就可以写出一个简单的爬虫,python有些比较强大的库比如urlib库、Beautiful库、scrapy框架都非常好用。

一个简单爬虫

from  urllib.request import urlopen
response = urlopen("http://www.sina.com")
print(response.read())

通过简单的两行代码，就可以把整个网站的内容全部抓取出来，浏览器可以解释成格式化美观的显示效果，不过上段程序是整个数据都抓取出来。

2、复杂的爬虫

网页请求一般比较复杂，大多都需要添加参数，有的网页还需要登陆或者验证码才能访问。网络请求的数据传送方式分为Post 和 Get两种常见方式，当然还有PUT 、Delete等方式，这里主要是前面两种。

Post :

urlopen(url,data,timeout):urlopen有三个参数，第一个是url地址，第二个data是访问url地址时需要传递的参数，第三个是设置的超时的时间，当post请求需要参数的时候，就需要带上第二个参数，

如：

from urllib.request import urlopen
from urllib.request import Request
from urllib.parse import urlencode
value ={"username":"xxx@126.com","password":"xxx"}
data = urlencode(value)
url = "http://www.xxx.com/login"
request = Request(url,data)
response = urlopen(request)
print response.read()

Get：

get方式就比较简单，也没有post那么安全，get参数直接附加在url后面：

from urllib.request import urlopen
from urllib.request import Request
from urllib.parse import urlencode
values={}
values["name"]="xxx@qq.com"
values["password"]= "xxx"
url = ''http://www.XXX.com/login"
geturl = url+"?"+data
request = Request(geturl)
response = urlopen(request)
print response.read()

这就是通过urlib爬去普通页面，以及简单的post 、get网络爬取方式，当然爬取没那么简单，还需要一些其它的设置，比如设置header，或者其它方面的参数设置。

3、更高级的爬虫

许多网站对爬虫做了许多限制，知道你是伪装的程序抓取而不是浏览球进行正常的浏览访问，所以有时候需要伪装成浏览器进行抓取，这就需要进行其它设置，

比如说HTTP 首部的设置，Http header常见承载用户的其中首部，from，User-Agent,Referer 、Authorization、Client -IP ,X-forwarded-for cookie等

a设置headers,agent,代表着请求的身份，代表这是浏览器进行访问：

from urllib.request import urlopen
from  urllib.request import   Request
from urllib.parse import  urlencode

value = {"username":"xxxx@126.com","password":"xxx"}
data = urlencode(value)
url = "https://passport.xxx.net/account/login"
user_agent=''Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11"
headers= {'User-Agent':user_agent}
request = Request(url,data,headers)
response = urlopen(request)
print(response.read())

上面加入了header，header中的User-agent 指名了用户的浏览器，这就是用代码设置header中的agent伪装浏览器进行访问
Referer，提供了用户来源页面的URL，用来表示用户之前访问了哪个页面，服务器有时会识别referer是不是来源他自己，用来防盗链。

我们可以这样设置：

user_agent=''Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11"
headers= {'User-Agent':user_agent，“referer”:"wwww.myweb.com"}

其它属性

Content－Type：使用Rest接口时，服务器会检查该值，用来确定HTTP Body中的内容该怎么解析

application/xml:在XML RPC，如RESTful/SOAP调用时，使用

application/json:在JSON RPC调用时使用

application/x-www-form-urlencoded:浏览器提交Web表单时使用

其中注意的是，在访问RESTful或SOAP服务时，Content-Type设置错误会导致服务器拒绝服务

b、代理的设置

为什么要设置代理了，因为许多时候服务器经常会防爬虫，禁止某个IP的访问次数，这时候我们可以通过设置代理，不停的换ip，迷惑服务器

from urllib.request import urlopen
from  urllib.request import   Request
from urllib.parse import  urlencode
from urllib.request import ProxyHandler
import urllib
#设置代理访问
value = {"username":"xxxxx@126.com","password":"xxxxx"}
data = urlencode(value)
url = "https://passport.xxx.net/account/login"
request = Request(url,data)

enable_proxy = True
proxy_handler = ProxyHandler({"http" : 'http://some-proxy.com:8080'})
null_proxy_handler = ProxyHandler({"http" : 'http://some-proxy.com:8080'})
if enable_proxy:
    opener = urllib.request.build_opener(proxy_handler)
else:
    opener = urllib.request.build_opener(null_proxy_handler)

opener.open(request)

3、Timeout设置

from  urllib.request import urlopen
response = urlopen("http://www.sina.com"，timeout=10)

异常处理：

from urllib.request import urlopen

try:

urllib.urlopen("http://www.xxxx.com")

except urllib.HTTPError,e:

print e.code

except urllib.URLError,e:

print e.reason

HTTPError是URLError的子集，所以把URLError放在后面，这也是编程的小技巧