OutputStreamWriter类getEncoding方法的应用-优快云博客

今天在写打包程序时，发现了一个及其奇怪的问题，就是OutputStreamWriter.getEncoding()的应用，在官方文

档上解释是这样说的：

返回此流使用的字符编码的名称。

如果此编码具有历史名称，则返回该名称；否则返回该编码的规范化名称。

如果此实例是用 OutputStreamWriter(OutputStream, String) 构造方法创建的，则返回的由此编码生成的

惟一名称可能与传递给该构造方法的名称不一样。如果流已经关闭，则此方法可能返回 null。

（注意红色标记的关键字，就是问题的关键）

在这里给大家推荐一个高效生成XML的开源项目WAX

（http://www.ociweb.com/mark/programming/WAX.html），生成5000条数据100M左右的XML文件只需要20秒左右

的时间（个人感觉这个效率还是很强的），正是在使用该项目的过程中发现问题。问题描述如下：

        //在WAX.java源文件中有如下一段代码
        if (writer instanceof OutputStreamWriter) {
                encoding = ((OutputStreamWriter) writer).getEncoding();
            }

而我在构造OutputStreamWriter对象时采用的是 OutputStreamWriter(OutputStream, String) 构造方法，传

递的字符编码是“UTF-8”，但是，通过OutputStreamWriter.getEncoding()返回的却是“UTF8”，这样，通过WAX

生成XML文件的XML declaration部分的encoding就变成了UTF8，导致生成的XML文件其它解析器无法解析。将

上面的一段源码进行修改如下：

      if (writer instanceof OutputStreamWriter) {
            //encoding = ((OutputStreamWriter) writer).getEncoding();
            //pengfeng modify
            encoding = Charset.forName(((OutputStreamWriter) writer).getEncoding()).name();
        }

OK，一切问题解决！

呵呵，上面只说了问题的解决办法，该办法主要来源于http://jira.codehaus.org/browse/WSTX-146，引用如下：

The code in the attached unit test incorrectly creates a document with an encoding in the XML

declaration of 'UTF8'. According to section 4.3.3 of the XML 1.0 spec, all XML parsers must support

encoding 'UTF-8', but not 'UTF8'. So many parsers including Xerces, won't parse that document.

The problem appears to be caused by the following code in WstxOutputFactory:

// we may still be able to figure out the encoding:
            if (enc == null) {
                if (w instanceof OutputStreamWriter) {
                    enc = ((OutputStreamWriter) w).getEncoding();
                }
            }

According to the Javadoc for OutputStreamWriter.getEncoding(), "If the encoding has an historical

name then that name is returned; otherwise the encoding's canonical name is returned." The historical

name for UTF-8 is UTF8.

I believe the correct code would be:

// we may still be able to figure out the encoding:
            if (enc == null) {
                if (w instanceof OutputStreamWriter) {
                    enc = Charset.forName(((OutputStreamWriter) w).getEncoding()).name();
                }
            }

Tatu Saloranta added a comment - 18/Mar/08 02:09 PM

Thanks! This does indeeed look like sub-optimal behavior. As a work-around, applications shouldn't

rely on auto-detection of the encoding, but I definitely want to improve this auto-detection as well.

One minor note: since Charset class was introduced in JDK 1.4, the fix as suggested can only be added

to Woodstox 4.0. For 3.2, if the fix is to be backported I could just add specific work-around for UTF-8,

since that is likely to be the most common case (additionally if others are identified they can be added

of course).

I will fix this shortly for trunk (4.0).

Tatu Saloranta added a comment - 26/Mar/08 12:57 PM

Easy to fix, will call Charset.normalize() for the value (which already knows many legacy conversions).
I think I'll also backport it to 3.2.x, since there's no API change and this should be a safe fix.