OutputStreamWriter类getEncoding方法的应用

      今天在写打包程序时,发现了一个及其奇怪的问题,就是OutputStreamWriter.getEncoding()的应用,在官方文

档上解释是这样说的:

返回此流使用的字符编码的名称。

          如果此编码具有历史名称,则返回该名称;否则返回该编码的规范化名称。

          如果此实例是用 OutputStreamWriter(OutputStream, String) 构造方法创建的,则返回的由此编码生成的

惟一名称可能与传递给该构造方法的名称不一样。如果流已经关闭,则此方法可能返回 null

     (注意红色标记的关键字,就是问题的关键)

   在这里给大家推荐一个高效生成XML的开源项目WAX

http://www.ociweb.com/mark/programming/WAX.html),生成5000条数据100M左右的XML文件只需要20秒左右

的时间(个人感觉这个效率还是很强的),正是在使用该项目的过程中发现问题。问题描述如下:

 

        //在WAX.java源文件中有如下一段代码
        if (writer instanceof OutputStreamWriter) {
                encoding = ((OutputStreamWriter) writer).getEncoding();
            }

 

  而我在构造OutputStreamWriter对象时采用的是 OutputStreamWriter(OutputStream, String) 构造方法,传

递的字符编码是“UTF-8”,但是,通过OutputStreamWriter.getEncoding()返回的却是“UTF8”,这样,通过WAX

生成XML文件的XML declaration部分的encoding就变成了UTF8,导致生成的XML文件其它解析器无法解析。将

上面的一段源码进行修改如下:

      if (writer instanceof OutputStreamWriter) {
            //encoding = ((OutputStreamWriter) writer).getEncoding();
            //pengfeng modify
            encoding = Charset.forName(((OutputStreamWriter) writer).getEncoding()).name();
        }

    OK,一切问题解决!

    呵呵,上面只说了问题的解决办法,该办法主要来源于http://jira.codehaus.org/browse/WSTX-146,引用如下:

 

   The code in the attached unit test incorrectly creates a document with an encoding in the XML
declaration of 'UTF8'. According to section 4.3.3 of the XML 1.0 spec, all XML parsers must support
encoding 'UTF-8', but not 'UTF8'. So many parsers including Xerces, won't parse that document.

The problem appears to be caused by the following code in WstxOutputFactory:

 

// we may still be able to figure out the encoding:
            if (enc == null) {
                if (w instanceof OutputStreamWriter) {
                    enc = ((OutputStreamWriter) w).getEncoding();
                }
            }

   According to the Javadoc for OutputStreamWriter.getEncoding(), "If the encoding has an historical

name then that name is returned; otherwise the encoding's canonical name is returned." The historical

name for UTF-8 is UTF8.

I believe the correct code would be:

// we may still be able to figure out the encoding:
            if (enc == null) {
                if (w instanceof OutputStreamWriter) {
                    enc = Charset.forName(((OutputStreamWriter) w).getEncoding()).name();
                }
            }

Tatu Saloranta added a comment - 18/Mar/08 02:09 PM
Thanks! This does indeeed look like sub-optimal behavior. As a work-around, applications shouldn't
rely on auto-detection of the encoding, but I definitely want to improve this auto-detection as well.

One minor note: since Charset class was introduced in JDK 1.4, the fix as suggested can only be added

to Woodstox 4.0. For 3.2, if the fix is to be backported I could just add specific work-around for UTF-8,

since that is likely to be the most common case (additionally if others are identified they can be added

of course).

I will fix this shortly for trunk (4.0).

 

Tatu Saloranta added a comment - 26/Mar/08 12:57 PM
Easy to fix, will call Charset.normalize() for the value (which already knows many legacy conversions).
I think I'll also backport it to 3.2.x, since there's no API change and this should be a safe fix.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值