让这个问题如此有趣的原因在于,HTML看起来和气味就像XML一样,后者由于其良好的行为和有序的结构而具有更好的可编程性.在理想的世界中,HTML将是XML的一个子集,但现实世界中的HTML显然不是XML.如果您将问题中的示例提供给任何XML解析器,它将会避免各种违规行为.话虽如此,使用单行PowerShell可以实现所需的结果.这个返回href的全文:
Select-NodeContent $doc.DocumentNode "//a/@href"
这个提取所需的子字符串:
Select-NodeContent $doc.DocumentNode "//a/@href" "IP_PHONE_BACKUP-(.*)\.zip"
但是,捕获的是开销/设置,以便能够运行那一行代码.你需要:
>安装HtmlAgilityPack以使HTML解析看起来就像XML解析一样.
>如果要解析实时网页,请安装PowerShell Community Extensions.
>了解XPath能够构建到目标节点的可导航路径.
>了解正则表达式,以便能够从目标节点中提取子字符串.
满足这些要求后,您可以将HTMLAgilityPath类型添加到您的环境并定义Select-NodeContent函数,如下所示.代码的最后部分显示了如何为上述单行中使用的$doc变量赋值.我将展示如何根据您的需要从文件或Web加载HTML.
Set-StrictMode -Version Latest
$HtmlAgilityPackPath = [System.IO.Path]::Combine((Get-Item $PROFILE).DirectoryName,"bin\HtmlAgilityPack.dll")
Add-Type -Path $HtmlAgilityPackPath
function Select-NodeContent(
[HtmlAgilityPack.HtmlNode]$node,[string] $xpath,[string] $regex,[Object] $default = "")
{
if ($xpath -match "(.*)/@(\w+)$") {
# If standard XPath to retrieve an attribute is given,# map to supported operations to retrieve the attribute's text.
($xpath,$attribute) = $matches[1],$matches[2]
$resultNode = $node.SelectSingleNode($xpath)
$text = ?: { $resultNode } { $resultNode.Attributes[$attribute].Value } { $default }
}
else { # retrieve an element's text
$resultNode = $node.SelectSingleNode($xpath)
$text = ?: { $resultNode } { $resultNode.InnerText } { $default }
}
# If a regex is given,use it to extract a substring from the text
if ($regex) {
if ($text -match $regex) { $text = $matches[1] }
else { $text = $default }
}
return $text
}
$doc = New-Object HtmlAgilityPack.HtmlDocument
$result = $doc.Load("tmp\temp.html") # Use this to load a file
#$result = $doc.LoadHtml((Get-HttpResource $url)) # Use this PSCX cmdlet to load a live web page