1、安装所需要的包
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple html2text
2、模块说明
官方文档: https://pypi.org/project/html2text/
一、 基础用法:
获取text文本
import html2text as ht
text_maker = ht.HTML2Text()
path = r"C:/Users/dcg/Desktop/html/1.html"
html_file = open(path, 'r', encoding='utf8')
html_page = html_file.read()
html_file.close()
text = text_maker.handle(html_page)
print(text)
二、还可以加上一些选项
1, 先看有哪些选项
2、
保存为makdown格式的文本
# coding=gbk
import html2text as ht
import re
text_maker = ht.HTML2Text()
# 属性设置
text_maker.ignore_links = True
text_maker.bypass_tables = False
path = r"C:/Users/dcg/Desktop/html/1.html"
html_file = open(path, 'r', encoding='utf8')
html_page = html_file.read()
html_file.close()
text = text_maker.handle(html_page)
a = re.sub(r'\* \d+', '', text)
file = open("1.md", "w", encoding='utf8')
file.write(a)
file.close()
运行结果: