html 子标签删除吗,HTML敏捷包 - 删除不需要的标签而不删除内容？-优快云博客

这篇博客探讨了如何从HTML字符串中删除不必要的标签，保留如`<strong>`、`<em>`和`<u>`等特定标签。作者分享了多个C#代码实现，包括递归和非递归方法，用于高效地清理HTML内容。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

5 个答案:

答案 0 :(得分：53)

我根据Oded的建议写了一个算法。这里是。像魅力一样。

删除除strong，em，u和原始文本节点以外的所有代码。

internal static string RemoveUnwantedTags(string data)

{

if(string.IsNullOrEmpty(data)) return string.Empty;

var document = new HtmlDocument();

document.LoadHtml(data);

var acceptableTags = new String[] { "strong", "em", "u"};

var nodes = new Queue(document.DocumentNode.SelectNodes("./*|./text()"));

while(nodes.Count > 0)

{

var node = nodes.Dequeue();

var parentNode = node.ParentNode;

if(!acceptableTags.Contains(node.Name) && node.Name != "#text")

{

var childNodes = node.SelectNodes("./*|./text()");

if (childNodes != null)

{

foreach (var child in childNodes)

{

nodes.Enqueue(child);

parentNode.InsertBefore(child, node);

}

parentNode.RemoveChild(node);

}

return document.DocumentNode.InnerHtml;

}

答案 1 :(得分：11)

如何以递归方式从html字符串中删除不需要的html标记列表

我接受了@mathias的回答并改进了他的扩展程序，以便您可以提供要排除为List的标记列表(例如{"a","p","hr"})。我还修改了逻辑，以便它以递归方式正常工作：

public static string RemoveUnwantedHtmlTags(this string html, List unwantedTags)

{

if (String.IsNullOrEmpty(html))

{

return html;

}

var document = new HtmlDocument();

document.LoadHtml(html);

HtmlNodeCollection tryGetNodes = document.DocumentNode.SelectNodes("./*|./text()");

if (tryGetNodes == null || !tryGetNodes.Any())

{

return html;

}

var nodes = new Queue(tryGetNodes);

while (nodes.Count > 0)

{

var node = nodes.Dequeue();

var parentNode = node.ParentNode;

var childNodes = node.SelectNodes("./*|./text()");

if (childNodes != null)

{

foreach (var child in childNodes)

{

nodes.Enqueue(child);

}

if (unwantedTags.Any(tag => tag == node.Name))

{

if (childNodes != null)

{

foreach (var child in childNodes)

{

parentNode.InsertBefore(child, node);

}

parentNode.RemoveChild(node);

}

return document.DocumentNode.InnerHtml;

}

答案 2 :(得分：8)

尝试以下操作，您可能会发现它比其他提议的解决方案更整洁：

public static int RemoveNodesButKeepChildren(this HtmlNode rootNode, string xPath)

{

HtmlNodeCollection nodes = rootNode.SelectNodes(xPath);

if (nodes == null)

return 0;

foreach (HtmlNode node in nodes)

node.RemoveButKeepChildren();

return nodes.Count;

}

public static void RemoveButKeepChildren(this HtmlNode node)

{

foreach (HtmlNode child in node.ChildNodes)

node.ParentNode.InsertBefore(child, node);

node.Remove();

}

public static bool TestYourSpecificExample()

{

string html = "

my paragraph

and my div

are italic and bold";

HtmlDocument document = new HtmlDocument();

document.LoadHtml(html);

document.DocumentNode.RemoveNodesButKeepChildren("//div");

document.DocumentNode.RemoveNodesButKeepChildren("//p");

return document.DocumentNode.InnerHtml == "my paragraph and my div are italic and bold";

}

答案 3 :(得分：4)

在删除节点之前，获取其父节点及其InnerText，然后删除节点并将InnerText重新分配给父节点。

var parent = node.ParentNode;

var innerText = parent.InnerText;

node.Remove();

parent.AppendChild(doc.CreateTextNode(innerText));

答案 4 :(得分：2)

如果您不想使用Html敏捷包并仍想删除不需要的Html标记，则可以执行以下操作。

public static string RemoveHtmlTags(string strHtml)

{

string strText = Regex.Replace(strHtml, "", String.Empty);

strText = HttpUtility.HtmlDecode(strText);

strText = Regex.Replace(strText, @"\s+", " ");

return strText;

}