用Python爬取网上的图片或者文字的方法笔记

最新推荐文章于 2024-04-21 12:45:23 发布

smithereens_photog

最新推荐文章于 2024-04-21 12:45:23 发布

阅读量1k

点赞数 3

分类专栏： Python 文章标签： python 爬虫

本文链接：https://blog.youkuaiyun.com/weixin_42588379/article/details/84974567

版权

Python 专栏收录该内容

2 篇文章

订阅专栏

本文介绍了一种使用Python进行网页爬取的方法，包括如何抓取图片和新闻文章的文字内容。通过使用urllib和BeautifulSoup库，演示了从凤凰网抓取图片和从新浪网抓取新闻的具体实现过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

废话不多说直接上代码

# -*- coding: utf-8 -*-
"""
Created on Fri Nov 23 10:44:47 2018

@author: fengl
"""
#Ex7_5.py
import urllib.request
import re

def getHtml(url):
    html=urllib.request.urlopen(url).read()
    html= html.decode('utf-8')
    return html

def getImg(html):
    #构建正则表达式，从页面代码里提取出图片url地址。
    img_re=re.compile(r'(?<=src=")\S+?jpg')
    #输出找到图片的类型
    print("the type of html is :",type(html))
    #输出整个html的结构
    print(html)
    #返回一个装img下载地址的集合
    img_list=img_re.findall(html)
    return img_list

def getUrl(source):
    if source.startswith("//"):
        url = "http:"+source
    else:
        url=source
    
    return url
#调用getHtml函数输入网址   
html=getHtml('http://www.ifeng.com')
#调用getImg函数得到集合
img_list = getImg(html)
print("正在下载图片......")

for i in range(len(img_list)):
    print(img_list[i])
    #调用urlretrieve方法把图片下载到指定地址
    urllib.request.urlretrieve(getUrl(img_list[i]),'.\img\%s.jpg' % i)
    
print("完成图片下载......")

注意要在文件的同一级地址下创建一个img为名的文件夹，而且只能读取后缀名为jpg的图片，

需要读取png图片只需稍加修改，把代码中的png改为jpg即可

以上是爬取图片的代码

下面分享的是爬取文字的代码

# -*- coding: utf-8 -*-
"""
Created on Fri Nov 23 21:14:54 2018

@author: fengl
"""
#Ex7_8.py
import urllib.request
from bs4 import BeautifulSoup
import re

def getHtml(url):
    html=urllib.request.urlopen(url).read()
    html= html.decode('utf-8')
    return html

html = getHtml("https://news.sina.com.cn/c/xl/2018-12-01/doc-ihmutuec5171858.shtml")
bsObj = BeautifulSoup(html, "html.parser")
downloadList=bsObj.select('p')

text_re=re.compile(r'<p>(\s+?\S+?)</p>')
text_list=[]
for txt in downloadList:
    html="{}".format(txt)
    text_list+=text_re.findall(html)
print(text_list)

这段代码获取的是p标签中的文字。