Python爬虫MOOC入门笔记（1）

爬虫实战技巧

最新推荐文章于 2024-03-08 09:34:16 发布

原创最新推荐文章于 2024-03-08 09:34:16 发布 · 365 阅读

0 ·

CC 4.0 BY-SA版权

Python 专栏收录该内容

2 篇文章

订阅专栏

本文介绍爬虫技术的应用实例，涵盖文本、图片及百度搜索结果的爬取方法。针对反爬措施，采用更改User-Agent的方式成功获取目标网页内容。此外，还提供了爬取特定商品详情页与图片的具体代码。

爬文本：

1）先r.encoding，观察是不是‘gbk’形式，如果不是，需要替换（中文）
r.encoding = r.apparent_encoding
或者
r.encoding = ‘gbk’
2)记住常用的’utf-8’编码
3）一些反爬网站需要更换身份，才能进行相关操作

爬图片

1）利用文件f操作时需要缩进，否则报错
2）利用r.status_code，如果返回不是200说明访问错误

以下为具体事例，大部分代码来源于MOOC

一、JD某3080显卡文本爬取

>>> import requests
>>> url = "https://item.jd.com/100015151410.html#crumb-wrap"
>>> r = requests.get(url)
>>> r.encoding
'UTF-8'
>>> r.text
#没有返回正常结果，说明遇见反爬了
"<script>window.location.href='https://passport.jd.com/uc/login?ReturnUrl=http%3A%2F%2Fitem.jd.com%2F100015151410.html'</script>"
>>> r.status_code
200
>>> r.apparent_encoding
'ascii'
>>> kv = {'user-agent':'Mozilla/5.0'}#更换常用浏览器身份‘Mozilla/5.0’
>>> r = requests.get(url,headers = kv)
>>> r.status_code
200
>>> r.request.headers
{'user-agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
>>> r.text[:1000]
#正常的返回结果
'<!DOCTYPE HTML>\n<html lang="zh-CN">\n<head>\n    <!--yushou-->\n    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n    <title>【七彩虹iGame GeForce RTX 3080 Vulcan 10G】七彩虹（Colorful）火神iGame GeForce RTX 3080 Vulcan 10G 1710Mhz GDDR6X 电竞游戏电脑显卡【行情 报价 价格 评测】-京东</title>\n    <meta name="keywords" content="ColorfuliGame GeForce RTX 3080 Vulcan 10G,七彩虹iGame GeForce RTX 3080 Vulcan 10G,七彩虹iGame GeForce RTX 3080 Vulcan 10G报价,ColorfuliGame GeForce RTX 3080 Vulcan 10G报价"/>\n    <meta name="description" content="【七彩虹iGame GeForce RTX 3080 Vulcan 10G】京东JD.COM提供七彩虹iGame GeForce RTX 3080 Vulcan 10G正品行货，并包括ColorfuliGame GeForce RTX 3080 Vulcan 10G网购指南，以及七彩虹iGame GeForce RTX 3080 Vulcan 10G图片、iGame GeForce RTX 3080 Vulcan 10G参数、iGame GeForce RTX 3080 Vulcan 10G评论、iGame GeForce RTX 3080 Vulcan 10G心得、iGame GeForce RTX 3080 Vulcan 10G技巧等信息，网购七彩虹iGame GeForce RTX 3080 Vulcan 10G上京东,放心又轻松" />\n    <meta name="format-detection" content="telephone=no">\n    <meta http-equiv="mobile-agent" content="'

二、利用IP查询网站的API调用返回查询结果

这里就不作展示了。很多IP查询网站的API都是收费的，又或者是网站页面全是广告很难提取结果。

三、爬取页面图片

>>> import requests
>>> path = "D:/abc.jpg"#需要放置的图片地址
>>> url = "https://i0.hdslb.com/bfs/album/f52de8194ac32c2cfe3ac5327f44410ba72fdfe1.jpg@1036w_1e_1c.jpg"#你想爬取的图片的地址
>>> r = requests.get(url)
>>> r.status_code
200
>>> with open(path,'wb') as f:
...     f.write(r.content)#注意缩进
...     f.close()
...
132173#返回一串数字，说明爬取成功，此时可以返回文件夹查找图片了

四、爬取百度搜索结果

>>> import requests
>>> kv = {'wd':'吉林大学'}#爬取吉林大学在百度搜素的结果
>>> r = requests.get("http://www.baidu.com/s",params = kv)#替换
>>> r.status_code
200
>>> r.request.url
'http://www.baidu.com/s?wd=%E5%90%89%E6%9E%97%E5%A4%A7%E5%AD%A6'
>>> r.text
#以下为结果，由于文字太多不便展示