httpparser 是一个开源的项目的 主要有两种方式进行解析
主要操作类是parser类
Parser parser=new Parser(httpUrlConnection(new URL(url)))
parser.setInputHtml(getHtml(str)); //用流的形式进行解析 防止乱码
parser.rerest();使用解析器重新选择条件解析
//获取html文本内容
public static String getHtml(String urlAddress) {
BufferedReader buff = null;
HttpURLConnection urlconn = null;
String html = null;
try {
URL url = new URL(urlAddress);
urlconn = (HttpURLConnection) url.openConnection();
urlconn.setConnectTimeout(10000);
urlconn.setReadTimeout(70000);
// 设置通用的请求属性
urlconn.setRequestProperty("accept", "*/*");
urlconn.setRequestProperty("connection", "Keep-Alive");
urlconn.setRequestProperty("user-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;SV1)");
urlconn.setRequestProperty("Content-type", "text/html");
// urlconn.setRequestProperty("Accept-Charset", "utf-8");
// urlconn.setRequestProperty("contentType", "GBK");
urlconn.setDoInput(true);
urlconn.setDoOutput(true);
urlconn.setRequestMethod("GET");
// 建立连接
// System.out.println(urlconn.getContentType());
urlconn.connect();
InputStreamReader inputStreamReader = new InputStreamReader(urlconn.getInputStream(), "GBK");
buff = new BufferedReader(inputStreamReader);
StringBuilder sb = new StringBuilder();
String s = null;
if (null != buff) {
while (null != (s = buff.readLine())) { // 读取页面内容
s = new String(s.getBytes(), "GBK");
// System.out.println(s + "\t\t");
sb.append(s + "\n");
}
html = sb.toString();
// System.out.println(html);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (null != buff) {
buff.close();
}
if (null != urlconn) {
urlconn.disconnect();
}
} catch (IOException ex) {
ex.printStackTrace();
}
return html;
}
}
}
1.Filter类
2.visitor类
Filter类有
自定义标签类
PrototypicalNodeFactory pnfPrototypicalNodeFactory = new PrototypicalNodeFactory(); //生成节点工厂实
pnfPrototypicalNodeFactory.registerTag(new IFrameTag()); //注册标签
parser.setNodeFactory(pnfPrototypicalNodeFactory); // 设置解析器的标签库
TagNameFilter
HasAttarbuiterFilter (attr,defaultvlaue)'
下面通过样式进行解析
HasAttributeFilter attrFilter = new HasAttributeFilter("class", attr);
NodeList nodeList3 = parser.extractAllNodesThatMatch(attrFilter);
NodeIterator iterator1 = nodeList3.elements();
节点类型:
TextNode
TagNode
compositeNode 复合节点
属性有哪些
getText()
toPlainText() //得到纯文本
toHtml() 得到Html
export JAVA_HOME=/data/boss_linux/jdk1.6.0_24
export JAVA_BIN=/data/boss_linux/jdk1.6.0_24/bin
export PATH=$PATH:$JAVA_HOME/bin:/url/local/jad/jad_linux
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export JRE_HOME=$JAVA_HOME/jre