【问题】
C#中,想要去除html的标签tag,且同时去除注释comment。
【解决过程】
1.参考:
去试试用:public string htmlRemoveTag(string html)
{
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(html);
if (htmlDoc == null)
{
return "";
}
string filteredHtml = "";
foreach (var node in htmlDoc.DocumentNode.ChildNodes)
{
filteredHtml += node.InnerText;
}
return filteredHtml;
}
结果是,可以去除所有的tag了。
但是对于html的注释: Frigidaire Mini Air Conditioner Frigidaire’s FRA052XT7 5,000 BTU 115-Volt Window-Mounted Mini-Compact Air Conditioner is perfect for rooms up to 150 square feet. It quickly cools a room on hot days and quie。。。。。。。。
却没去掉。
2.继续去除comment。
参考:
然后用:public string htmlRemoveTag(string html)
{
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(html);
if (htmlDoc == null)
{
return "";
}
// 1. remove all comments
//(1)get all comment nodes using XPATH
foreach (HtmlNode comment in htmlDoc.DocumentNode.SelectNodes("//comment()"))
{
//(2) remove comment node itself
comment.ParentNode.RemoveChild(comment);
}
//2. get all content
string filteredHtml = "";
foreach (var node in htmlDoc.DocumentNode.ChildNodes)
{
filteredHtml += node.InnerText;
}
return filteredHtml;
}
就实现了目的,结果是html的内容,没有tag,没有comment:” Frigidaire Mini Air Conditioner Frigidaire’s FRA052XT7 5,000 BTU 115-Volt Window-Mounted Mini-Compact Air Conditioner is perfect for rooms up to 150 square feet. It quickly cools a room on hot days and quiet operation keeps you cool without keeping you awake. This unit features mechanical rotary controls and top, full-width, 2-way air direction control. The antimicrobial mesh filter with side, slide-out access cleans the air removing harmful bacteria. Low voltage start-up conserves energy and saves you money 。。。。。。。。。。。。。。
【总结】
想要去除html的tag,并且不保留对应的comment,那么可以用:using HtmlAgilityPack;
public string htmlRemoveTag(string html)
{
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(html);
if (htmlDoc == null)
{
return "";
}
// 1. remove all comments
//(1)get all comment nodes using XPATH
foreach (HtmlNode comment in htmlDoc.DocumentNode.SelectNodes("//comment()"))
{
//(2) remove comment node itself
comment.ParentNode.RemoveChild(comment);
}
//2. get all content
string filteredHtml = "";
foreach (var node in htmlDoc.DocumentNode.ChildNodes)
{
filteredHtml += node.InnerText;
}
return filteredHtml;
}