1.获取所有某种tag的元素
Parser parser = new Parser();
parser.setURL(path);
//parser.setInputHTML(node.toHtml());
parser.setEncoding("UTF-8");
//获取某种tag的元素
NodeFilter filter = new TagNameFilter("A");
NodeList nodes=parser.extractAllNodesThatMatch(filter);
//创建HtmlPage,获取相关元素
HtmlPage page= new HtmlPage(parser);
parser.visitAllNodesWith(page);
NodeList list = page.getBody();
2.Node常用的操作
String s = certainNode.toPlainTextString()//获取node的纯文本内容(不包括tag)
NodeList nl = certainNode.getChildren();//获取子节点
Node n = certainNode.getParent();//获取父节点
Node n = certainNode.getPreviousSibling();//获取上一个兄弟节点
Node n = certainNode.getNextSibling();//获取下一个兄弟节点
String s = certainNode.toHtml()//获取node的html内容(包括tag)
boolean b = certainNode instanceof LinkTag;//是否是某一类型的tag,LinkTag是内置的,在org.htmlparser.tags下
3.NodeList常用的操作
int size = certainNodeList.size();
Node n = certainNodeList.elementAt(int i);//获取某一元素
4.遍历NodeList
SimpleNodeIterator iterator = nodeList.elements();
while(iterator.hasMoreNodes()){
Nodenode =iterator.nextNode();
...
}
5.判断某一节点类型的方法
1)node.getClass().toString().equals("class org.htmlparser.nodes.TextNode");
2)node.toHtml().trim().toLowerCase().startsWith("<ul");
3)node instanceof HeadingTag;
附录:
htmlparser的tag:
reference:http://htmlparser.sourceforge.net/javadoc/org/htmlparser/tags/package-summary.html
| AppletTag represents an <Applet> tag. | |
| BaseHrefTag represents an <Base> tag. | |
| A Body Tag. | |
| A bullet tag. | |
| A bullet list tag. | |
| The base class for tags that have an end tag. | |
| A definition list tag (dl). | |
| A definition list bullet tag (either DD or DT). | |
| A div tag. | |
| The HTML Document Declaration Tag can identify <!DOCTYPE> tags. | |
| Represents a FORM tag. | |
| Identifies an frame set tag. | |
| Identifies a frame tag | |
| A heading (h1 - h6) tag. | |
| A head tag. | |
| A html tag. | |
| Identifies an image tag. | |
| An input tag in a form. | |
| The JSP/ASP tags like <%...%> can be identified by this class. | |
| A label tag. | |
| Identifies a link tag. | |
| A Meta Tag | |
| ObjectTag represents an <Object> tag. | |
| An option tag within a form. | |
| A paragraph (p) tag. | |
| The XML processing instructions like <?xml ... | |
| A script tag. | |
| A select tag within a form. | |
| A span tag. | |
| A StyleTag represents a <style> tag. | |
| A table column tag. | |
| A table header tag. | |
| A table row tag. | |
| A table tag. | |
| A text area tag within a form. | |
| A title tag. |
244

被折叠的 条评论
为什么被折叠?



