今天在写打包程序时,发现了一个及其奇怪的问题,就是OutputStreamWriter.getEncoding()的应用,在官方文
档上解释是这样说的:
返回此流使用的字符编码的名称。如果此编码具有历史名称,则返回该名称;否则返回该编码的规范化名称。
如果此实例是用 OutputStreamWriter(OutputStream, String)
构造方法创建的,则返回的由此编码生成的
惟一名称可能与传递给该构造方法的名称不一样。如果流已经关闭,则此方法可能返回 null。
(注意红色标记的关键字,就是问题的关键)
在这里给大家推荐一个高效生成XML的开源项目WAX
(http://www.ociweb.com/mark/programming/WAX.html),生成5000条数据100M左右的XML文件只需要20秒左右
的时间(个人感觉这个效率还是很强的),正是在使用该项目的过程中发现问题。问题描述如下:
//在WAX.java源文件中有如下一段代码
if (writer instanceof OutputStreamWriter) {
encoding = ((OutputStreamWriter) writer).getEncoding();
}
而我在构造OutputStreamWriter对象时采用的是 OutputStreamWriter(OutputStream, String)
构造方法,传
递的字符编码是“UTF-8”,但是,通过OutputStreamWriter.getEncoding()返回的却是“UTF8”,这样,通过WAX
生成XML文件的XML declaration部分的encoding就变成了UTF8,导致生成的XML文件其它解析器无法解析。将
上面的一段源码进行修改如下:
if (writer instanceof OutputStreamWriter) {
//encoding = ((OutputStreamWriter) writer).getEncoding();
//pengfeng modify
encoding = Charset.forName(((OutputStreamWriter) writer).getEncoding()).name();
}
OK,一切问题解决!
呵呵,上面只说了问题的解决办法,该办法主要来源于http://jira.codehaus.org/browse/WSTX-146,引用如下:
The problem appears to be caused by the following code in WstxOutputFactory:
// we may still be able to figure out the encoding:
if (enc == null) {
if (w instanceof OutputStreamWriter) {
enc = ((OutputStreamWriter) w).getEncoding();
}
}
According to the Javadoc for OutputStreamWriter.getEncoding(), "If the encoding has an historical
name then that name is returned; otherwise the encoding's canonical name is returned." The historical
name for UTF-8 is UTF8.
I believe the correct code would be:
// we may still be able to figure out the encoding:
if (enc == null) {
if (w instanceof OutputStreamWriter) {
enc = Charset.forName(((OutputStreamWriter) w).getEncoding()).name();
}
}
One minor note: since Charset class was introduced in JDK 1.4, the fix as suggested can only be added
to Woodstox 4.0. For 3.2, if the fix is to be backported I could just add specific work-around for UTF-8,
since that is likely to be the most common case (additionally if others are identified they can be added
of course).
I will fix this shortly for trunk (4.0).
I think I'll also backport it to 3.2.x, since there's no API change and this should be a safe fix.