Python 正则表达式删除非<em><strong>的其它<xml> tag

本文深入探讨了如何通过正则表达式智能地匹配并替换HTML中的标签与属性,排除了特定标签如<em>和<strong>,同时对包含属性的标签进行了区分处理。详细介绍了三种不同场景下的匹配与替换策略,包括基本标签匹配、特殊属性的白名单匹配以及对<a>标签中非href或title属性的匹配。文中还指出了一处替换结果的小错误,并寻求解决方案。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1. Match tags except <em> and <strong>
Match:
</?(?!(?:em|strong)\b)[a-zA-Z](?:[^>"']|"[^"]*"|'[^']*')*>

Replace:
NULL

eg.
<b fdjfkjdk>inikkk</b fdjkfjk>
<i fff>nihao</i nihao>
<em ff>nihao</em nihao>
<big ff>nihao</big nihao>

result:
inikkk
nihao
<em ff>nihao</em nihao>
nihao

2. Match tags except <em> and <stong>, and any tags that contain attributes
Match:
</?(?!(?:em|strong)\s*>)[a-zA-Z](?:[^>"']|"[^"]*"|'[^']*')*>

Replace:
NULL

eg.
<b fdjfkjdk>inikkk</b fdjkfjk>
<i fff>nihao</i nihao>
<em ff>nihao</em nihao>
<big ff>nihao</big nihao>
<strong>nihao</strong>

result:
inikkk
nihao
nihao
nihao
<strong>nihao</strong>

3. Whitelist specific attributes
Match all tags except <a> <em> <strong>, with two exceptions,
Any <a> tags that have attributes other than href or title should be matached
Match:
</?(?!(?:em|strong|a(?:\s+(:href|title)\s*=\s*(?:"[^"]*"|'[^']*'))*)\s*>)[a-zA-Z](?:[^>"']|"[^"]*"|'[^']*')*>

Replace:
NULL

eg.
<b fdjfkjdk>inikkk</b fdjkfjk>
<i fff>nihao</i nihao>
<em ff>nihao</em nihao>
<big ff>nihao</big nihao>
<strong>nihao</strong>
<a nihao>fdjkf</nihao>
<a href="2222">fdjkf</nihao>

Result:
inikkk
nihao
nihao
nihao
<strong>nihao</strong>
fdjkf
<a href="2222">fdjkf   <===A little error, </a> is also removed, can anyone supply solution? thanks.

添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值