爬虫初学1

最新推荐文章于 2023-03-02 10:36:32 发布

mr_xinL

最新推荐文章于 2023-03-02 10:36:32 发布

阅读量181

点赞数

分类专栏：爬虫文章标签： python

原文链接：https://blog.youkuaiyun.com/sunon_/article/details/90634253?depth_1-utm_source=distribute.pc_relevant.none-task&utm_source=distribute.pc_relevant.none-task

版权

爬虫专栏收录该内容

13 篇文章

订阅专栏

本文介绍了一种使用Python爬虫从新浪网站抓取图片的方法，包括网页读取、编码识别、正则表达式匹配以及图片下载到本地的过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

模仿代码，爬取新浪图片

import urllib.request
import re
import chardet
'''#打开网页，读取网页，网页解码'''
page = urllib.request.urlopen('http://photo.sina.com.cn/')  # 打开网页
htmlCode = page.read()  # 获取网页源代码
# print(chardet.detect(htmlCode))  # 打印返回网页的编码方式      #使用中，chardet.detect()返回字典，其中confidence是检测精确度，encoding是编码形式
# print(htmlCode.decode('utf-8'))  # 打印网页源代码    #Python decode() 方法以 encoding 指定的编码格式解码字符串。默认编码为字符串编码。该方法返回解码后的字符串。

'''#网页数据存入'''
pageFile = open('D:\MEITU\pageCode.txt', 'wb')  # 以写的方式打开pageCode.txt
pageFile.write(htmlCode)  # 写入
pageFile.close()  # 开了记得关

'''#正则，找到图片'''
data = htmlCode.decode('utf-8')
reg = r'src="(.+?\.jpg)"'  # 正则表达式
reg_img = re.compile(reg)  # 编译一下，运行更快
imglist = reg_img.findall(data)  # 进行匹配
# for img in imglist:
#     print(img)

'''#下载图片到本地'''
x = 0
for img in imglist:
    print(img)
    urllib.request.urlretrieve(img, 'D:\MEITU\PIG\%s.jpg' % x) #保存在指定文件夹内
    x += 1    #出现HTTP Error 502: Bad Gateway，需要加入请求头

# ————————————————
# 版权声明：本文为优快云博主「sunon_」的原创文章，遵循CC
# 4.0
# BY - SA版权协议，转载请附上原文出处链接及本声明。
# 原文链接：https: // blog.youkuaiyun.com / sunon_ / article / details / 90634253

博主：sunon_
博文：第一个Python爬虫
原文链接：https://blog.youkuaiyun.com/sunon_/article/details/90634253?depth_1-utm_source=distribute.pc_relevant.none-task&utm_source=distribute.pc_relevant.none-task