BOM

byte order mark (BOM)

名词解释 2010-03-04 16:36:04 阅读20 评论0 字号:

参考: http://en.wikipedia.org/wiki/Byte-order_mark

UTF-8
public final static byte [] UTF8_HEAD ={(byte) 0xEF,(byte) 0xBB,(byte) 0xBF};
fos.write(UTF8_HEAD);
    

The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. Its code point is de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">U+FEFFde>. BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.

 

Because Unicode can be encoded as 16-bit or 32-bit integers, a computer receiving Unicode text from arbitrary sources needs to know which byte order the integers are encoded in. The BOM gives the producer of the text a way to describe the text stream's endianness to the consumer of the text without requiring some contract or metadata outside of the text stream itself. Once the receiving computer has consumed the text stream, it presumably processes the characters in its own native byte order and no longer needs the BOM. Hence the need for a BOM arises in the context of text interchange, rather than in normal text processing within a closed environment.

 

 

Usage

In UTF-16, a BOM (de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">U+FEFFde>) is placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream.

  • If the 16-bit units are represented in big-endian byte order, this BOM character will appear in the sequence of bytes as de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">0xFEde>followed by de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">0xFFde> (where "de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">0xde>" indicates hexadecimal);
  • if the 16-bit units use little-endian order, the sequence of bytes will have de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">0xFFde> followed by de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">0xFEde>.

The Unicode value de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">U+FFFEde> is guaranteed never to be assigned as a Unicode character; this implies that in a Unicode context the de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">0xFFde>, de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">0xFEde> byte pattern can only be interpreted as the de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">U+FEFFde> character expressed in little-endian byte order (since it could not be a de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">U+FFFEde> character expressed in big-endian byte order).

While UTF-8 does not have byte order issues, a BOM encoded in UTF-8 may nonetheless be encountered. A UTF-8 BOM is explicitly allowed by the Unicode standard[2], but is not recommended[3], as it only identifies a file as UTF-8 and does not state anything about byte order.[4] Many Windows programs (including Windows Notepad) add BOMs to UTF-8 files by default. However in Unix-like systems (which make heavy use of text files for file formats as well as for inter-process communication) this practice is not recommended, as it will interfere with correct processing of important codes such as the shebangat the start of an interpreted script.[5] It may also interfere with source for programming languages that don't recognise it. For example, gcc reports stray characters at the beginning of a source file, and in PHP, if output buffering is disabled, it has the subtle effect of causing the page to start being sent to the browser, preventing custom headers from being specified by the PHP script. The UTF-8 representation of the BOM is the byte sequence de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">EF BB BFde>, which appears as the ISO-8859-1 characters de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">???de> in most text editors and web browsers not prepared to handle UTF-8.

Although a BOM could be used with UTF-32, this encoding is rarely used for transmission. Otherwise the same rules as for UTF-16 are applicable. For theIANA registered charsets UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE a "byte order mark" must not be used, an initial U+FEFF has to be interpreted as a (deprecated) "zero width no-break space", because the names of these charsets already determine the byte order. For the registered charsets UTF-16 and UTF-32, an initial U+FEFF indicates the byte order.

If the BOM character appears in the middle of a data stream, it should, according to Unicode, be interpreted as a "zero-width non-breaking space" (essentially a null character[citation needed]). Its deliberate use for this purpose is deprecated in Unicode 3.2, however, with the "Word Joiner" character,de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">U+2060de>, strongly preferred. This allows de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">U+FEFFde> to be used solely with the semantic of BOM.

Unwanted BOMs

Some text editing software in a UTF-8 environment on MS Windows adds a BOM to the beginning of text files. If the page is displayed in Latin-1 (ISO-8859-1), the three bytes show up as ??? (i+umlaut, double right guillemet, inverted question mark).

Adding a BOM to the beginning of a PHP file (.php) results in the BOM being displayed as HTML, rather than getting processed as PHP, since it comes before the opening <?php tag. Besides being a visual nuisance, it can cause the default HTTP headers to be sent, therefore preventing the sending of customized HTTP headers. This happens because whatever HTTP headers are in the queue are sent out as soon as the first text goes out for the page. The only solution is to hunt down the infected PHP file(s) and manually remove the BOM characters with another editor.

[edit]Representations of byte order marks by encoding

EncodingRepresentation (hexadecimal)Representation (decimal)Representation (ISO-8859-1)
UTF-8de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">EF BB BFde>[t 1]de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">239 187 191de>de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">???de>
UTF-16(BE)de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">FE FFde>de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">254 255de>de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">t?de>
UTF-16(LE)de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">FF FEde>de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">255 254de>de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">?tde>
UTF-32(BE)de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">00 00 FE FFde>de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">0 0 254 255de>de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">□□?tde> (□ is the ascii null character)
UTF-32(LE)de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">FF FE 00 00de>de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">255 254 0 0de>de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">t?□□de> (□ is the ascii null character)
UTF-7de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">2B 2F 76de>, and one of the following bytes: de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">[ 38 | 39 | 2B | 2F ]de>[t 2]de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">43 47 118de>, and one of the following bytes: de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">[ 56 | 57 | 43 | 47 ]de>de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">+/vde>, and one of the following characters: de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">8 9 + /de>
UTF-1de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">F7 64 4Cde>de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">247 100 76de>de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">÷dLde>
UTF-EBCDICde style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">DD 73 66 73de>de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">221 115 102 115de>de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">Ysfsde>
SCSUde style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">0E FE FFde>[t 3]de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">14 254 255de>de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">□t?de> (□ is the ascii "shift out" character)
BOCU-1de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">FB EE 28de> optionally followed by de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">FFde>[t 4]de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">251 238 40de> optionally followed by de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">255de>de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">??(de> optionally followed by de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">?de>
GB-18030de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">84 31 95 33de>de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">132 49 149 51de>de style="background-color: rgb(249, 249, 249); font-family: monospace, sans-serif; ">□1■3de> (□ and ■ are unmapped ISO-8859-1 characters)
03-15
### 关于 Byte Order Mark (BOM) 的技术概念 Byte Order Mark (BOM),即字节顺序标记,是一种用于表示Unicode编码文件中字节顺序的特殊字符。它通常出现在文件开头,帮助解析器识别文件使用的具体Unicode编码形式(如 UTF-8、UTF-16 或 UTF-32)。尽管 BOM 对某些程序透明无影响,但在处理文本时可能会引发兼容性问题。 #### BOM 的作用与常见编码关联 在 Unicode 编码标准下,BOM 被定义为 U+FEFF 字符,在不同编码中的表现如下: - **UTF-8**: 表现为三个字节序列 `EF BB BF`。 - **UTF-16 (BE)**: 表现为两个字节序列 `FE FF`。 - **UTF-16 (LE)**: 表现为两个字节序列 `FF FE`[^4]。 对于 Python 来说,默认支持多种编码方式,并允许开发者通过字符串指定具体的编码名称来读取或写入带 BOM 文件的内容[^1]。 #### 如何处理带有 BOM 的文本文件? 当遇到含有 BOM 的文件时,可以采用以下几种方法进行处理: ##### 方法一:使用 Python 自动移除 BOM Python 提供了内置的支持机制,能够自动检测并跳过 BOM 序列。例如,可以通过设置参数 `'utf-8-sig'` 实现这一功能: ```python with open('file_with_bom.txt', mode='r', encoding='utf-8-sig') as file: content = file.read() ``` 上述代码片段会先尝试寻找并去除可能存在的 BOM 标记后再继续正常操作数据流。 ##### 方法二:手动删除 BOM 如果需要更精细控制或者不希望依赖特定库函数,则可以选择手工实现逻辑清除掉这些额外标志位。下面展示了一个简单的例子说明如何做到这一点: ```python def remove_bom(file_path): import codecs bom_bytes = b'\xef\xbb\xbf' with open(file_path, 'rb') as f: raw_data = f.read(len(bom_bytes)) if raw_data != bom_bytes: raise ValueError("File does not start with expected BOM.") # Continue reading rest of the file after skipping over BOM bytes. with open(file_path, 'r', newline='', encoding='utf-8') as cleaned_file: return cleaned_file.readlines() lines_without_bom = remove_bom('./example.txt') print(lines_without_bom[:5]) # Print first five lines post-BOM removal process. ``` 此脚本首先验证目标文档确实存在预期类型的 BOM;随后才执行进一步的动作以忽略该部分而专注于实际有效载荷的数据上[^2]。 #### 总结 虽然 BOM 可以为跨平台间交换提供便利条件,但也增加了复杂度以及潜在错误风险。因此了解其工作原理及其应对策略是非常重要的技能之一。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值