参考链接
用Python清理收藏夹里已失效的网站,快测试一下你的收藏夹吧_chen801090的博客-优快云博客
划重点
- 使用readlines()逐行读取收藏夹导出的html文件
- 正则解析每一行内容,h3是目录名称,a是书签内容
- 考虑到书签名称中有html转码字符,使用html.unescape()解码
- 层级关系,用每一行开头的空格个数//4来保证
代码示例
import re
import html
bookmarks_f = open("./Fav20220108-003550.html", encoding='UTF-8')
booklists = bookmarks_f.readlines()
bookmarks_f.close()
h3_pattern = r'(.*?)<DT><H3 .*?>(.*?)</H3>'
a_pattern = r'(.*?)<DT><A .*?HREF="(.*?)" .*?>(.*?)</A>'
while len(booklists) > 0:
bookmark = booklists.pop(0)
dir_details = re.search(h3_pattern, bookmark)
if dir_details:
dir_degree = len(dir_details.group(1))//4 + 1
dir_name = html.unescape(dir_details.group(2))
print(d