python爬虫-模拟浏览器

最新推荐文章于 2020-12-04 00:14:16 发布

qq_42459926

最新推荐文章于 2020-12-04 00:14:16 发布

阅读量326

点赞数 1

CC 4.0 BY-SA版权

本文链接：https://blog.youkuaiyun.com/qq_42459926/article/details/80707892

本文介绍了一种使用Python的urllib库进行网页抓取的方法。通过设置User-Agent来模拟浏览器行为并避免被网站封禁IP。代码示例展示了如何构造请求头、发起HTTP请求以及读取响应数据。

import urllib.request
import  random

url="http://www.badu.com"
'''
#设置一个较完整的请求头
headers={
    "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
   " Content-Type":"text/html;charset=utf-8",
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"
}
#设置一个请求体
req=urllib.request.Request(url,headers=headers)
#发起请求
response=urllib.request.urlopen(req)
data=response.read().decode("utf-8")
print(data)
'''
#多弄几个UA就可以防止封ip
agentsList=[
    "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11"
]
#随机拿取一个AG
agentStr=random.choice(agentsList)
req=urllib.request.Request(url)
#用add_header直接向请求体里添加了User-Agent
req.add_header("User-Agent",agentStr)
response=urllib.request.urlopen(req)
print(response.read().decode("utf-8"))