Handle UTF8 file with BOM

本文深入探讨了UTF-8文件中字节顺序标记(BOM)的作用及其对不同编程语言如Java的影响,介绍了如何在创建UTF-8文件时添加BOM,并提供了将UTF-8文件转换为ANSI格式的方法,同时讨论了如何在读取文件时忽略BOM以避免出现异常字符。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

From  Wikipedia , the byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.

The common BOMs are :

EncodingRepresentation (hexadecimal)Representation (decimal)
UTF-8EF BB BF239 187 191
UTF-16 (BE)FE FF254 255
UTF-16 (LE)FF FE255 254
UTF-32 (BE)00 00 FE FF0 0 254 255
UTF-32 (LE)FF FE 00 00255 254 0 0

UTF8 file are a special case because it is not recommended to add a BOM to them because it can break other tools like Java. In fact, Java assumes the UTF8 don't have a BOM so if the BOM is present it won't be discarded and it will be seen as data.

To create an UTF8 file with a BOM, open the Windows create a simple text file and save it as utf8.txt with the encoding UTF-8.

Now if you examine the file content as binary, you see the BOM at the beginning. 

If we read it with Java.

import java.io.*;

public class x {

  public static void main(String args[]) {
    try {
        FileInputStream fis = new FileInputStream("c:/temp/utf8.txt");
        BufferedReader r = new BufferedReader(new InputStreamReader(fis,
                "UTF8"));
        for (String s = ""; (s = r.readLine()) != null;) {
            System.out.println(s);
        }
        r.close();
        System.exit(0);
    }

    catch (Exception e) {
        e.printStackTrace();
        System.exit(1);
    }
  }
}
The output contains a strange character at the beginning because the BOM is not discarded :
?helloworld

This behaviour is documented in the Java bug database, here and here. There will be no fix for now because it will break existing tools like javadoc ou xml parsers.

The Apache IO Commons provides some tools to handle this situation. The BOMInputStream class detects the BOM and, if required, can automatically skip it and return the subsequent byte as the first byte in the stream.

Or you can do it manually. The next example converts an UTF8 file to ANSI. We check the first line for the presence of the BOM and if present, we simply discard it.

import java.io.*;

public class UTF8ToAnsiUtils {

    // FEFF because this is the Unicode char represented by the UTF-8 byte order mark (EF BB BF).
    public static final String UTF8_BOM = "\uFEFF";

    public static void main(String args[]) {
        try {
            if (args.length != 2) {
                System.out
                        .println("Usage : java UTF8ToAnsiUtils utf8file ansifile");
                System.exit(1);
            }

            boolean firstLine = true;
            FileInputStream fis = new FileInputStream(args[0]);
            BufferedReader r = new BufferedReader(new InputStreamReader(fis,
                    "UTF8"));
            FileOutputStream fos = new FileOutputStream(args[1]);
            Writer w = new BufferedWriter(new OutputStreamWriter(fos, "Cp1252"));
            for (String s = ""; (s = r.readLine()) != null;) {
                if (firstLine) {
                    s = UTF8ToAnsiUtils.removeUTF8BOM(s);
                    firstLine = false;
                }
                w.write(s + System.getProperty("line.separator"));
                w.flush();
            }

            w.close();
            r.close();
            System.exit(0);
        }

        catch (Exception e) {
            e.printStackTrace();
            System.exit(1);
        }
    }

    private static String removeUTF8BOM(String s) {
        if (s.startsWith(UTF8_BOM)) {
            s = s.substring(1);
        }
        return s;
    }
}
 
 
http://www.rgagnon.com/javadetails/java-handle-utf8-file-with-bom.html


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值