HTML Agility Pack 搭配 ScrapySharp，解析Html解析

最新推荐文章于 2025-10-10 18:00:07 发布

原创最新推荐文章于 2025-10-10 18:00:07 发布 · 825 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#html解析 #ScrapySharp #HtmlAgilityPack

C# 专栏收录该内容

51 篇文章

订阅专栏

本文介绍HtmlAgilityPack 1.8.0和ScrapySharp 2.6.2两个强大的.NET库。HtmlAgilityPack用于解析HTML文件并支持XPath和XSLT操作；ScrapySharp则提供了基于HtmlAgilityPack的扩展，支持使用CSS选择器来选取元素。本文详细展示了如何使用这两个库进行网页抓取和数据解析。

HtmlAgilityPack 1.8.0

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

PM> Install-Package HtmlAgilityPack -Version 1.8.0

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();  
doc.LoadHtml(html);  
HtmlAgilityPack.HtmlNode rootnode = doc.DocumentNode;  
HtmlAgilityPack.HtmlNode row = rootnode.SelectSingleNode("//*[@id='content']/div[3]/div[1]");

ScrapySharp 2.6.2

Scraping Framework containing :
- a web client able to simulate a web browser.
- an HtmlAgilityPack extension to select elements using css selector (like JQuery)

PM> Install-Package ScrapySharp -Version 2.6.2

 html.CssSelect("div"); //all div elements
    html.CssSelect("div.content"); //all div elements with css class 'content'
    html.CssSelect("div.widget.monthlist"); //all div elements with the both css class
    html.CssSelect("#postPaging"); //all HTML elements with the id postPaging
    html.CssSelect("div#postPaging.testClass");     // all HTML elements with the id postPaging and css class testClass
    html.CssSelect("div.content > p.para");     //p elements who are direct children of div elements with css class 'content'
    html.CssSelect("input[type = text].login");     // textbox with css class login

更多的CSS选择器使用方法可以参看W3的网页：CSS 选择器参考手册