1.获取所有某种tag的元素
Parser parser = new Parser();
parser.setURL(path);
//parser.setInputHTML(node.toHtml());
parser.setEncoding("UTF-8");
//获取某种tag的元素
NodeFilter filter = new TagNameFilter("A");
NodeList nodes=parser.extractAllNodesThatMatch(filter);
//创建HtmlPage,获取相关元素
HtmlPage page= new HtmlPage(parser);
parser.visitAllNodesWith(page);
NodeList list = page.getBody();
2.Node常用的操作
String s = certainNode.toPlainTextString()//获取node的纯文本内容(不包括tag)
NodeList nl = certainNode.getChildren();//获取子节点
Node n = certainNode.getParent();//获取父节点
Node n = certainNode.getPreviousSibling();//获取上一个兄弟节点
Node n = certainNode.getNextSibling();//获取下一个兄弟节点
String s = certainNode.toHtml()//获取node的html内容(包括tag)
boolean b = certainNode instanceof LinkTag;//是否是某一类型的tag,LinkTag是内置的,在org.htmlparser.tags下
3.NodeList常用的操作
int size = certainNodeList.size();
Node n = certainNodeList.elementAt(int i);//获取某一元素
4.遍历NodeList
SimpleNodeIterator iterator = nodeList.elements();
while(iterator.hasMoreNodes()){
Nodenode =iterator.nextNode();
...
}
5.判断某一节点类型的方法
1)node.getClass().toString().equals("class org.htmlparser.nodes.TextNode");
2)node.toHtml().trim().toLowerCase().startsWith("<ul");
3)node instanceof HeadingTag;
附录:
htmlparser的tag:
reference:http://htmlparser.sourceforge.net/javadoc/org/htmlparser/tags/package-summary.html
AppletTag represents an <Applet> tag. | |
BaseHrefTag represents an <Base> tag. | |
A Body Tag. | |
A bullet tag. | |
A bullet list tag. | |
The base class for tags that have an end tag. | |
A definition list tag (dl). | |
A definition list bullet tag (either DD or DT). | |
A div tag. | |
The HTML Document Declaration Tag can identify <!DOCTYPE> tags. | |
Represents a FORM tag. | |
Identifies an frame set tag. | |
Identifies a frame tag | |
A heading (h1 - h6) tag. | |
A head tag. | |
A html tag. | |
Identifies an image tag. | |
An input tag in a form. | |
The JSP/ASP tags like <%...%> can be identified by this class. | |
A label tag. | |
Identifies a link tag. | |
A Meta Tag | |
ObjectTag represents an <Object> tag. | |
An option tag within a form. | |
A paragraph (p) tag. | |
The XML processing instructions like <?xml ... | |
A script tag. | |
A select tag within a form. | |
A span tag. | |
A StyleTag represents a <style> tag. | |
A table column tag. | |
A table header tag. | |
A table row tag. | |
A table tag. | |
A text area tag within a form. | |
A title tag. |