Python 正则表达式删除非的其它<xml> tag_python去除所有-优快云博客

本文深入探讨了如何通过正则表达式智能地匹配并替换HTML中的标签与属性，排除了特定标签如和，同时对包含属性的标签进行了区分处理。详细介绍了三种不同场景下的匹配与替换策略，包括基本标签匹配、特殊属性的白名单匹配以及对<a>标签中非href或title属性的匹配。文中还指出了一处替换结果的小错误，并寻求解决方案。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1. Match tags except <em> and <strong>
Match:
</?(?!(?:em|strong)\b)[a-zA-Z](?:[^>"']|"[^"]*"|'[^']*')*>

Replace:
NULL

eg.
<b fdjfkjdk>inikkk</b fdjkfjk>
<i fff>nihao</i nihao>
<em ff>nihao</em nihao>
<big ff>nihao</big nihao>

result:
inikkk
nihao
<em ff>nihao</em nihao>
nihao

2. Match tags except <em> and <stong>, and any tags that contain attributes
Match:
</?(?!(?:em|strong)\s*>)[a-zA-Z](?:[^>"']|"[^"]*"|'[^']*')*>

Replace:
NULL

eg.
<b fdjfkjdk>inikkk</b fdjkfjk>
<i fff>nihao</i nihao>
<em ff>nihao</em nihao>
<big ff>nihao</big nihao>
<strong>nihao</strong>

result:
inikkk
nihao
nihao
nihao
<strong>nihao</strong>

3. Whitelist specific attributes
Match all tags except <a> <em> <strong>, with two exceptions,
Any <a> tags that have attributes other than href or title should be matached
Match:
</?(?!(?:em|strong|a(?:\s+(:href|title)\s*=\s*(?:"[^"]*"|'[^']*'))*)\s*>)[a-zA-Z](?:[^>"']|"[^"]*"|'[^']*')*>

Replace:
NULL

eg.
<b fdjfkjdk>inikkk</b fdjkfjk>
<i fff>nihao</i nihao>
<em ff>nihao</em nihao>
<big ff>nihao</big nihao>
<strong>nihao</strong>
<a nihao>fdjkf</nihao>
<a href="2222">fdjkf</nihao>

Result:
inikkk
nihao
nihao
nihao
<strong>nihao</strong>
fdjkf
<a href="2222">fdjkf   <===A little error, </a> is also removed, can anyone supply solution? thanks.