Python爬虫urllib笔记整合

最新推荐文章于 2021-09-15 16:06:38 发布

yihan.z

最新推荐文章于 2021-09-15 16:06:38 发布

阅读量339

点赞数 1

CC 4.0 BY-SA版权

本文链接：https://blog.youkuaiyun.com/qq_33361618/article/details/80836096

本文详细介绍使用Python的urllib库进行网页爬取的方法，包括POST请求、异常处理、模拟浏览器及使用代理IP等技巧，并提供了针对新浪、优快云、淘宝等网站的具体示例。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本篇文章主要整理出urllib爬取post网页、爬虫异常处理、模拟浏览器、采取IP代理的实现程序，其中包括爬取新浪的个人界面、新闻，csdn博客，淘宝图片四个例子作为说明。

程序一：爬取post网页文件。

第一步：进行网页爬取的关键在于了解网页结构，清楚网页代码，找到自己需要的网页内容（一般指所在标签，类别，样式等）是什么；本次程序是找到post表单所在位置并了解必须传递的参数是什么。

第二步：导入urllib的抓包（parse），参数名与表单中的name属性一致，写入表单传递参数；

第三步：使用urllib请求网页并提交参数。

第一步需要自己去观察网页结构，就不细讲，现在把爬取新浪微博的个人界面的代码放在下面。但考虑到保密性，本次代码传递的参数没有实际的效果。

#爬取post网页文件
import urllib.request
import urllib.parse #抓包
from urllib.request import urlopen
#from urllib import urlencode
url = "http://login.sina.com.cn/signup/signin.php"
login = urllib.parse.urlencode({
    "username":"your username",
    "password":"your password"
}).encode("utf-8")
req = urllib.request.Request(url,login) #发送地址和提交变量
#req.add_header()#伪装成对应浏览器
data = urllib.request.urlopen(req).read()
text = data.decode("gb2312")
fh = open("E:/Python/test/sinalogin.html","wt") #打开文件并写入
fh.write(text)
fh.close()
print(text)

程序二：爬虫异常处理

Python常用异常处理为try.....except，本次主要是用urllib自带的错误请求处理，其中URLError出现的几种原因：连接服务器失败，远程url不存在，本地网络未连接，触发HTTPError子类。程序如下：

import urllib.error
import urllib.request
try:
    urllib.request.urlopen("http://blog.youkuaiyun.com")
except urllib.error.URLError as e:
    if hasattr(e,"code"):
        print(e.code)
    if hasattr(e,"reason"):
        print(e.reason)

程序三：模拟浏览器

有很多网站限制爬虫，所以需要传递一些参数来模拟成浏览器请求网页。大部分的网页可能只需要传递标头（user-agent）就行了，但有的网页反爬技术更强，就需要传递多个参数或者代理IP才能请求到网页。

以火狐浏览器为例：打开要爬取的网页链接—按F12进入开发者选项—点击网络—随便点开一个网页状态—所有参数都在消息头的请求头信息中。模拟浏览器需要参数所在位置如下图：

程序如下：

url = "http://blog.youkuaiyun.com/"
headers = ("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0")
opener = urllib.request.build_opener()
opener.addheaders=[headers]
urllib.request.install_opener(opener) #将opener对象添加为全局
data = urllib.request.urlopen(url).read().decode("utf-8","ignore")
pat='<a strategy="wechat" href="(.*?)"'
result=re.compile(pat).findall(data)
for i in range(0,len(result)):
    file="E:/Python/test/cdsn/"+str(i)+".html"
    urllib.request.urlretrieve(result[i],filename=file)
    print("第"+str(i+1)+"次爬取成功")

程序四：实现IP代理

前面已经提到过，有些网站的反爬较严，在一个IP请求频繁的情况下，极有可能出使用IP被封的情况，那么，代理IP就很有必要。

代理IP可以从网上获取免费的，也可以直接购买。下面的程序就是实现代理IP：

#实现IP代理
import urllib.request
def use_proxy(url,headers,proxy_addr):
    proxy = urllib.request.ProxyHandler({"http":proxy_addr})
    opener = urllib.request.build_opener(proxy,urllib.request.HTTPHandler)
    opener.addheaders=[headers] #执行标头
    urllib.request.install_opener(opener) #添加全局变量
    data = urllib.request.urlopen(url).read().decode("utf-8","ignore")
    return data
if __name__ == '__main__':
    proxy_addr = "119.28.152.208:80"  #网址：http://www.xicidaili.com/
    url = "http://www.baidu.com"
    headers = ("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0")
    data = use_proxy(url,headers,proxy_addr)
    print(len(data))

实例一：爬取淘宝网连衣裙的图片

本实例主要涉及有两个问题：一是构建网页链接和实现翻页，其中关键字编码的问题，使用了quote函数，将中文关键字转化为网页链接所展示格式；二是用正则匹配到图片所在位置，提取图片。其他不多说，直接上代码：

import urllib.request
import re
keyname="连衣裙"
key=urllib.request.quote(keyname)
headers=("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0")
opener=urllib.request.build_opener()
opener.addheaders=[headers]
urllib.request.install_opener(opener)
for i in range(1,2):
    url="https://s.taobao.com/list?q="+key+"&cat=16&style=grid&seller_type=taobao&spm=a219r.lm874.1000187.1&bcoffset=12&s="+str(i*60)
    data=urllib.request.urlopen(url).read().decode("utf-8","ignore")
    print(url)
    pat='pic_url":"//(.*?)"'
    imagelist=re.compile(pat).findall(data)
    for j in range(0,len(imagelist)):
        thisimg=imagelist[j]
        thisimgurl="http://"+thisimg
        file="E:/Python/test/img/"+str(i)+str(j)+".jpg"
        urllib.request.urlretrieve(thisimgurl,filename=file)