html2text java,parsing - Convert HTML to plain text in Java - Stack Overflow

I need to convert HTML to plain text. My only requirement of formatting is to retain new lines in the plain text. New lines should be displayed not only in the case of
but other tags, e.g.

, leads to a new line too.

Sample HTML pages for testing are:

Note that these are only random URLs.

I have tried out various libraries (JSoup, Javax.swing, Apache utils) mentioned in the answers to this StackOverflow question to convert HTML to plain text.

Example using JSoup:

public class JSoupTest {

@Test

public void SimpleParse() {

try {

Document doc = Jsoup.connect("http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html").get();

System.out.print(doc.text());

} catch (IOException e) {

// TODO Auto-generated catch block

e.printStackTrace();

}

}

}

Example with HTMLEditorKit:

import javax.swing.text.html.*;

import javax.swing.text.html.parser.*;

public class Html2Text extends HTMLEditorKit.ParserCallback {

StringBuffer s;

public Html2Text() {}

public void parse(Reader in) throws IOException {

s = new StringBuffer();

ParserDelegator delegator = new ParserDelegator();

// the third parameter is TRUE to ignore charset directive

delegator.parse(in, this, Boolean.TRUE);

}

public void handleText(char[] text, int pos) {

s.append(text);

}

public String getText() {

return s.toString();

}

public static void main (String[] args) {

try {

// the HTML to convert

URL url = new URL("http://www.javadb.com/write-to-file-using-bufferedwriter");

URLConnection conn = url.openConnection();

BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));

String inputLine;

String finalContents = "";

while ((inputLine = reader.readLine()) != null) {

finalContents += "\n" + inputLine.replace("

}

BufferedWriter writer = new BufferedWriter(new FileWriter("samples/testHtml.html"));

writer.write(finalContents);

writer.close();

FileReader in = new FileReader("samples/testHtml.html");

Html2Text parser = new Html2Text();

parser.parse(in);

in.close();

System.out.println(parser.getText());

}

catch (Exception e) {

e.printStackTrace();

}

}

}

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值