最近又接到一个任务,做一套push RSS的系统,主要的问题是要一并爬取正文然后push个用户,用户直接就可以查看
这个项目的主要难点在获取文章的正文部分,大家也知道网页是千奇百怪,想找到正文部分是不容易的哦
下面先给大家一段解析rss的代码
URL rssUrl = new URL(url);
URLConnection connection = rssUrl.openConnection();
connection.setRequestProperty("User-Agent", "nbg");
SyndFeedInput input = new SyndFeedInput();
SyndFeed feed = input.build(new XmlReader(connection));
Date pubDate = feed.getPublishedDate();
if (pubDate != null) {
urlBean.setTime(pubDate.getTime());
} else {
urlBean.setTime(System.currentTimeMillis());
}
List<SyndEntry> list = feed.getEntries();
解析用的是rome-1.0.jar的开源工程,大家也知道,做java嘛,能有好的开源尽量用开源,自己只完成自己项目特殊的部分,不过不建议大家修改源码哦(特殊情况除外,为了以后兼容)
这部分很简单,可以很容易获得rss内容的List,connection.setRequestProperty("User-Agent", "nbg");这行肯定有人问,加它干啥,不为别的google的rss认这个你用什么agent都可以,就是不能用默认的,默认的那个是什么java....,反正它会认为你是程序,然后就封你
取得了每个rss的标题,介绍不是我们需要的,我们需要的网页的正文,下面就来弄我们最重要的部分
我们采用的Htmlunit的开源工程,是用来解析html的不错的java程序,htmlunit的应用还算是比较复杂,这部分有时间的时候单写一篇文章
我们这里假设大家已经可以得到了 文章对应的htmlpage对象了
我这里有几个封装好的 函数 是用来查找 正文标签的,只是个初级版本,以后我会继续完善的,(小弟心理素质不好,可别随便泼水哦)
/**
* 自动查找最合适的节点
*
* @param page
* @return
*/
public static DomNode getAutoNode(HtmlPage page) {
if (page == null) {
return null;
}
DomNode nodeMax = null;
String tagName = "p";
int allLength = page.asText().length();
double cell = 0;
NodeList list = page.getElementsByTagName(tagName);
for (int i = 0; i < list.getLength(); i++) {
DomNode domNode = (DomNode) list.item(i);
DomNode parentNode = domNode.getParentNode();
if (parentNode != null) {
int length = parentNode.asText().length();
NodeList nodeList = parentNode.getChildNodes();
int childNum = nodeList.getLength();
int k = 0;
for (int j = 0; j < childNum; j++) {
if (tagName.equals(nodeList.item(j).getNodeName())) {
k++;
}
}
if (k > 2) {
childNum = length / childNum;
if (childNum > cell && allLength / length < 30) {
cell = childNum;
nodeMax = parentNode;
}
}
}
}
if (nodeMax != null) {
HtmlElement element = page.createElement(nodeMax.getNodeName());
Iterable<HtmlElement> elements = nodeMax.getAllHtmlChildElements();
for (HtmlElement htmlElement : elements) {
String name = htmlElement.getNodeName();
HtmlElement tagElement = page.createElement(tagName);
if (tagName.equals(name)) {
tagElement.setTextContent(htmlElement.asText());
element.appendChild(tagElement);
} else if ("div".equals(name)) {
tagElement.setTextContent(htmlElement.getTextContent());
element.appendChild(tagElement);
} else if ("img".equals(name)) {
element.appendChild(htmlElement.cloneNode(false));
}
}
Iterable<HtmlElement> iterable = element.getAllHtmlChildElements();
for (HtmlElement htmlElement : iterable) {
htmlElement.removeAttribute("xpath");
}
return element;
} else {
String title = page.getTitleText().trim();
DomNode node = page.getFirstByXPath("/html/body");
if (node != null && node instanceof HtmlElement) {
HtmlElement element = (HtmlElement) node;
Iterable<HtmlElement> iterable = element
.getAllHtmlChildElements();
Map<String, HtmlElement> nodeMap = new ConcurrentHashMap<String, HtmlElement>();
for (HtmlElement childNode : iterable) {
String text = childNode.asText().trim();
if (isEqualTitle(title, text)) {
String xpath = childNode.getAttribute("xpath");
nodeMap.put(xpath, childNode);
}
}
Set<String> keySet = nodeMap.keySet();
String keyStr = "";
for (String key : keySet) {
if (key.length() > keyStr.length()) {
keyStr = key;
}
}
if (keyStr.length() > 0) {
HtmlElement childNode = nodeMap.get(keyStr);
int localLength = childNode.asText().length();
while (localLength < 100) {
logger.debug("6:| localLength:|" + localLength);
DomNode parent = childNode.getParentNode();
if (parent instanceof HtmlElement) {
if (parent != null) {
childNode = (HtmlElement) parent;
localLength = childNode.asText().length();
} else {
break;
}
}
}
return childNode;
}
}
}
return null;
}
/**
* 判断文本内容和网页标题是否相同
*
* @param title
* @param text
* @return
*/
private static boolean isEqualTitle(String title, String text) {
if (isNotEmpty(title) && isNotEmpty(text)) {
text = text.trim();
title = title.trim();
if (text.endsWith(")")) {
int a = text.lastIndexOf('(');
int b = text.lastIndexOf(')');
if (a > 0 && b > a) {
String num = text.substring(a + 1, b);
try {
Integer.parseInt(num);
text = text.substring(0, a);
} catch (NumberFormatException e) {
}
}
}
if (title.startsWith(text)) {
return true;
}
String[] tStr = title.split(" ");
for (String string : tStr) {
if (text.equalsIgnoreCase(string.trim())) {
return true;
}
}
}
return false;
}
private static boolean isNotEmpty(String title) {
return title != null && title.length() > 0;
}
public static String htmlToText(HtmlPage page, String text)
throws SAXException, IOException {
final HtmlElement element = page.createElement("pqrs");
HTMLParser.parseFragment(element, text);
return element.asText();
}
public static void getContentList(HtmlPage page, DomNode domNode,
List<ContentTag> contentList) {
String text = domNode.asText();
contentList.add(new ContentTag("text", text));
Iterable<HtmlElement> iterator = domNode.getAllHtmlChildElements();
for (HtmlElement htmlElement : iterator) {
if (htmlElement instanceof HtmlImage) {
HtmlImage image = (HtmlImage) htmlElement;
String src = image.getSrcAttribute();
try {
src = getFullHref(page, src);
contentList.add(new ContentTag("image", src));
logger.info("load image src:{}", src);
} catch (MalformedURLException e) {
logger.info("error", e);
}
}
}
}
private static String getFullHref(HtmlPage page, String url)
throws MalformedURLException {
url = page.getFullyQualifiedUrl(url).toString();
while (url.indexOf("/../") > 0) {
url = url.replaceAll("/../", "/");
}
return url;
}
大体意思就是查找到网页的标题,然后根据标题去匹配正文标题,然后再找到正文标题所在的那部分,然后就是向上层遍历找到整个网页大部分文字所在的部分,就是正文
这个项目的主要难点在获取文章的正文部分,大家也知道网页是千奇百怪,想找到正文部分是不容易的哦
下面先给大家一段解析rss的代码
URL rssUrl = new URL(url);
URLConnection connection = rssUrl.openConnection();
connection.setRequestProperty("User-Agent", "nbg");
SyndFeedInput input = new SyndFeedInput();
SyndFeed feed = input.build(new XmlReader(connection));
Date pubDate = feed.getPublishedDate();
if (pubDate != null) {
urlBean.setTime(pubDate.getTime());
} else {
urlBean.setTime(System.currentTimeMillis());
}
List<SyndEntry> list = feed.getEntries();
解析用的是rome-1.0.jar的开源工程,大家也知道,做java嘛,能有好的开源尽量用开源,自己只完成自己项目特殊的部分,不过不建议大家修改源码哦(特殊情况除外,为了以后兼容)
这部分很简单,可以很容易获得rss内容的List,connection.setRequestProperty("User-Agent", "nbg");这行肯定有人问,加它干啥,不为别的google的rss认这个你用什么agent都可以,就是不能用默认的,默认的那个是什么java....,反正它会认为你是程序,然后就封你
取得了每个rss的标题,介绍不是我们需要的,我们需要的网页的正文,下面就来弄我们最重要的部分
我们采用的Htmlunit的开源工程,是用来解析html的不错的java程序,htmlunit的应用还算是比较复杂,这部分有时间的时候单写一篇文章
我们这里假设大家已经可以得到了 文章对应的htmlpage对象了
我这里有几个封装好的 函数 是用来查找 正文标签的,只是个初级版本,以后我会继续完善的,(小弟心理素质不好,可别随便泼水哦)
/**
* 自动查找最合适的节点
*
* @param page
* @return
*/
public static DomNode getAutoNode(HtmlPage page) {
if (page == null) {
return null;
}
DomNode nodeMax = null;
String tagName = "p";
int allLength = page.asText().length();
double cell = 0;
NodeList list = page.getElementsByTagName(tagName);
for (int i = 0; i < list.getLength(); i++) {
DomNode domNode = (DomNode) list.item(i);
DomNode parentNode = domNode.getParentNode();
if (parentNode != null) {
int length = parentNode.asText().length();
NodeList nodeList = parentNode.getChildNodes();
int childNum = nodeList.getLength();
int k = 0;
for (int j = 0; j < childNum; j++) {
if (tagName.equals(nodeList.item(j).getNodeName())) {
k++;
}
}
if (k > 2) {
childNum = length / childNum;
if (childNum > cell && allLength / length < 30) {
cell = childNum;
nodeMax = parentNode;
}
}
}
}
if (nodeMax != null) {
HtmlElement element = page.createElement(nodeMax.getNodeName());
Iterable<HtmlElement> elements = nodeMax.getAllHtmlChildElements();
for (HtmlElement htmlElement : elements) {
String name = htmlElement.getNodeName();
HtmlElement tagElement = page.createElement(tagName);
if (tagName.equals(name)) {
tagElement.setTextContent(htmlElement.asText());
element.appendChild(tagElement);
} else if ("div".equals(name)) {
tagElement.setTextContent(htmlElement.getTextContent());
element.appendChild(tagElement);
} else if ("img".equals(name)) {
element.appendChild(htmlElement.cloneNode(false));
}
}
Iterable<HtmlElement> iterable = element.getAllHtmlChildElements();
for (HtmlElement htmlElement : iterable) {
htmlElement.removeAttribute("xpath");
}
return element;
} else {
String title = page.getTitleText().trim();
DomNode node = page.getFirstByXPath("/html/body");
if (node != null && node instanceof HtmlElement) {
HtmlElement element = (HtmlElement) node;
Iterable<HtmlElement> iterable = element
.getAllHtmlChildElements();
Map<String, HtmlElement> nodeMap = new ConcurrentHashMap<String, HtmlElement>();
for (HtmlElement childNode : iterable) {
String text = childNode.asText().trim();
if (isEqualTitle(title, text)) {
String xpath = childNode.getAttribute("xpath");
nodeMap.put(xpath, childNode);
}
}
Set<String> keySet = nodeMap.keySet();
String keyStr = "";
for (String key : keySet) {
if (key.length() > keyStr.length()) {
keyStr = key;
}
}
if (keyStr.length() > 0) {
HtmlElement childNode = nodeMap.get(keyStr);
int localLength = childNode.asText().length();
while (localLength < 100) {
logger.debug("6:| localLength:|" + localLength);
DomNode parent = childNode.getParentNode();
if (parent instanceof HtmlElement) {
if (parent != null) {
childNode = (HtmlElement) parent;
localLength = childNode.asText().length();
} else {
break;
}
}
}
return childNode;
}
}
}
return null;
}
/**
* 判断文本内容和网页标题是否相同
*
* @param title
* @param text
* @return
*/
private static boolean isEqualTitle(String title, String text) {
if (isNotEmpty(title) && isNotEmpty(text)) {
text = text.trim();
title = title.trim();
if (text.endsWith(")")) {
int a = text.lastIndexOf('(');
int b = text.lastIndexOf(')');
if (a > 0 && b > a) {
String num = text.substring(a + 1, b);
try {
Integer.parseInt(num);
text = text.substring(0, a);
} catch (NumberFormatException e) {
}
}
}
if (title.startsWith(text)) {
return true;
}
String[] tStr = title.split(" ");
for (String string : tStr) {
if (text.equalsIgnoreCase(string.trim())) {
return true;
}
}
}
return false;
}
private static boolean isNotEmpty(String title) {
return title != null && title.length() > 0;
}
public static String htmlToText(HtmlPage page, String text)
throws SAXException, IOException {
final HtmlElement element = page.createElement("pqrs");
HTMLParser.parseFragment(element, text);
return element.asText();
}
public static void getContentList(HtmlPage page, DomNode domNode,
List<ContentTag> contentList) {
String text = domNode.asText();
contentList.add(new ContentTag("text", text));
Iterable<HtmlElement> iterator = domNode.getAllHtmlChildElements();
for (HtmlElement htmlElement : iterator) {
if (htmlElement instanceof HtmlImage) {
HtmlImage image = (HtmlImage) htmlElement;
String src = image.getSrcAttribute();
try {
src = getFullHref(page, src);
contentList.add(new ContentTag("image", src));
logger.info("load image src:{}", src);
} catch (MalformedURLException e) {
logger.info("error", e);
}
}
}
}
private static String getFullHref(HtmlPage page, String url)
throws MalformedURLException {
url = page.getFullyQualifiedUrl(url).toString();
while (url.indexOf("/../") > 0) {
url = url.replaceAll("/../", "/");
}
return url;
}
大体意思就是查找到网页的标题,然后根据标题去匹配正文标题,然后再找到正文标题所在的那部分,然后就是向上层遍历找到整个网页大部分文字所在的部分,就是正文