python 调用xlrd 读取excel 内容BeautifulSoup获取html特定标签的属性

最新推荐文章于 2024-06-22 19:07:40 发布

原创最新推荐文章于 2024-06-22 19:07:40 发布 · 462 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#python #excel

python 专栏收录该内容

2 篇文章

订阅专栏

本文详细介绍了如何使用Python的xlrd库读取包含HTML代码的Excel文件，并结合BeautifulSoup解析HTML，提取所有img标签的src地址，适用于处理大量数据的场景。

需求：有一个excel 中只有一列八百多行，内容为html代码，要从中取出所有的img 标签的所有src 地址

python 读取excel 插件很多，最简单常用的还是xlrd xlwt 这连个，两个互补，xlrd读取，xlwt 写入

从网上找到一张操作excel的插件对比表

xlrd 读取excel 的代码如下，xlrd 只能操作xls文件，xlsx文件会报错

wb = xlrd.open_workbook(filename="00excel/list.xls")
table=wb.sheet_by_index(0)
nrows111 = table.nrows                 #行数
ncols111 = table.ncols                 #列数

读取到内容之后调用BeautifulSoup进行html 解析，获取img 的src地址

soup = bs(con, 'html.parser')
#imglist=soup.find_all('img')
for link in soup.find_all("img"):
	print(link['src'])

国内pip 安装插件的时候会提示服务器拒绝访问啥的，可以使用国内的镜像源

1.安装xlwt

pip install -i https://pypi.douban.com/simple xlwt

2.安装 BeautifulSoup

pip install -i https://pypi.douban.com/simple bs4

完整代码如下

import xlrd
import re
from bs4 import BeautifulSoup as bs



wb = xlrd.open_workbook(filename="00excel/list.xls")
table=wb.sheet_by_index(0)
nrows111 = table.nrows                 #行数
ncols111 = table.ncols                 #列数

#print(table)
print("行数")
print(nrows111)
print("列数")
print(ncols111)
con=""
for rowi in range(nrows111):   
	#print(table.row(rowi))#按行打印数据
	con=con+table.cell(rowi,0).value


#print(con)



soup = bs(con, 'html.parser')
#imglist=soup.find_all('img')
for link in soup.find_all("img"):
	print(link['src'])