爬虫-基于GET请求

最新推荐文章于 2024-03-27 11:10:01 发布

lmw-xiaoxin

最新推荐文章于 2024-03-27 11:10:01 发布

阅读量570

点赞数

CC 4.0 BY-SA版权

分类专栏：爬虫

本文链接：https://blog.youkuaiyun.com/lmw1239225096/article/details/79144213

爬虫专栏收录该内容

4 篇文章

订阅专栏

本文介绍了网络爬虫的基础知识，重点关注GET请求在爬虫中的应用。通过学习，读者将掌握如何发送GET请求来获取网页数据，并了解其在爬虫技术中的重要性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一.基本请求

import requests
response=requests.get('http://dig.chouti.com/')
print(response.text)

二.带参数的GET请求-->params

#在请求头内将自己伪装成浏览器，否则百度不会正常返回页面内容
import requests
response=requests.get('https://www.baidu.com/s?wd=python&pn=1',
                      headers={
                        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
                      })
print(response.text)


#如果查询关键词是中文或者有其他特殊符号，则不得不进行url编码
from urllib.parse import urlencode
wd='egon老师'
encode_res=urlencode({'k':wd},encoding='utf-8')
keyword=encode_res.split('=')[1]
print(keyword)
# 然后拼接成url
url='https://www.baidu.com/s?wd=%s&pn=1' %keyword

response=requests.get(url,
                      headers={
                        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
                      })
res1=response.text

params参数的使用

#上述操作可以用requests模块的一个params参数搞定，本质还是调用urlencode
from urllib.parse import urlencode
wd='egon老师'
pn=1

response=requests.get('https://www.baidu.com/s',
                      params={
                          'wd':wd,
                          'pn':pn
                      },
                      headers={
                        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
                      })
res2=response.text

#验证结果，打开a.html与b.html页面内容一样
with open('a.html','w',encoding='utf-8') as f:
    f.write(res1) 
with open('b.html', 'w', encoding='utf-8') as f:
    f.write(res2)

三.带参数的GET请求-->headers

#通常我们在发送请求时都需要带上请求头，请求头是将自身伪装成浏览器的关键，常见的有用的请求头如下
Host
Referer #大型网站通常都会根据该参数判断请求的来源
User-Agent #客户端
Cookie #Cookie信息虽然包含在请求头里，但requests模块有单独的参数来处理他，headers={}内就不要放它了

#添加headers(浏览器会识别请求头,不加可能会被拒绝访问,比如访问https://www.zhihu.com/explore)
import requests
response=requests.get('https://www.zhihu.com/explore')
response.status_code #500


#自己定制headers
headers={
    'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36',

}
respone=requests.get('https://www.zhihu.com/explore',
                     headers=headers)
print(respone.status_code) #200

四.带参数的GET请求-->Cookie

#登录github，然后从浏览器中获取cookies，以后就可以直接拿着cookie登录了，无需输入用户名密码
#用户名:egonlin 邮箱378533872@qq.com 密码lhf@123

import requests

Cookies={   'user_session':'wGMHFJKgDcmRIVvcA14_Wrt_3xaUyJNsBnPbYzEL6L0bHcfc',
}

response=requests.get('https://github.com/settings/emails',
             cookies=Cookies) #github对请求头没有什么限制，我们无需定制user-agent，对于其他网站可能还需要定制


print('378533872@qq.com' in response.text) #True