HtmlAgilityPack 1.8.0
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
PM> Install-Package HtmlAgilityPack -Version 1.8.0
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
HtmlAgilityPack.HtmlNode rootnode = doc.DocumentNode;
HtmlAgilityPack.HtmlNode row = rootnode.SelectSingleNode("//*[@id='content']/div[3]/div[1]");
ScrapySharp 2.6.2
Scraping Framework containing :
- a web client able to simulate a web browser.
- an HtmlAgilityPack extension to select elements using css selector (like JQuery)
PM> Install-Package ScrapySharp -Version 2.6.2
html.CssSelect("div"); //all div elements
html.CssSelect("div.content"); //all div elements with css class 'content'
html.CssSelect("div.widget.monthlist"); //all div elements with the both css class
html.CssSelect("#postPaging"); //all HTML elements with the id postPaging
html.CssSelect("div#postPaging.testClass"); // all HTML elements with the id postPaging and css class testClass
html.CssSelect("div.content > p.para"); //p elements who are direct children of div elements with css class 'content'
html.CssSelect("input[type = text].login"); // textbox with css class login
更多的CSS选择器使用方法可以参看W3的网页:CSS 选择器参考手册
本文介绍HtmlAgilityPack 1.8.0和ScrapySharp 2.6.2两个强大的.NET库。HtmlAgilityPack用于解析HTML文件并支持XPath和XSLT操作;ScrapySharp则提供了基于HtmlAgilityPack的扩展,支持使用CSS选择器来选取元素。本文详细展示了如何使用这两个库进行网页抓取和数据解析。
913

被折叠的 条评论
为什么被折叠?



