C Reference Manual Reading Notes: 003 Multibyte and Wide Characters

宽字符与多字节字符

最新推荐文章于 2024-08-03 09:00:37 发布

原创最新推荐文章于 2024-08-03 09:00:37 发布 · 509 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#reference #c #character #constants #preprocessor #string

本文介绍了标准C中为适应非英文字符集引入的宽字符及字符串的概念，并解释了多字节字符如何在源码和执行字符集中使用。文章还详细说明了多字节字符在标准C中的限制。

To accommodate non-english alphabets that may contain a large number of characters, Standard C introduces wide characters and wide strings. To present wide characters and wide strings in the external, byte-oriented world, the concept of multibyte characters is introduced.

Wide Characters And Strings. A wide character is a binary representation of an element of an extended character set. It has the integer type wchar_t which is declared in header file stddef.h. Standard C does not specify the encodingof the extended character set other than "null wide character"(zero, 0) and the existence of WEOF(-1).

Multibyte Character is the representation of a wide character in either the source or execution character set.(There may be different encoding for each). A multibyte stirng is a normal C string, but whose characters can be interpreted as a series of multibyte characters. The form of multibyte characters and the mapping between multibyte and wide characters is implementation-defined. This mapping is performed for wide-character and wide string constants at compile time, and the standard library provides function that perform this mapping at run time. Multibyte characters encoding can be state dependent or independent.

Standard C places some restrictions on multibyte characters:

(1). All characters from the standard character set must be present in the encoding.

(2). In the initial shift state, all single-byte characters from the standard character set retain their normal interpretation and do not affect the shift state.

(3). A byte containing all zeros is taken to be the null character regardless of shift state. No multibyte character can use a byte containing all zeros as its second or subsequent character.

Together, these rules ensure that multibyte sequences can be processed as normal C strings(e.g. they will not contain embedded null characters ) and a C string without special multibyte codes will have the expected interpretation as a multibyte sequence.

Source and execution use of multibyte characters. Multibyte character may appear in comments, idenrifiers, preprocessor header names, string constants, and character constants. Multibyte characters in the physical representation of the source are recognized and translated to the source character set before any lexical analysis, preprocessing, or even splicing of continuation lines. During process, character appearing in string and character constants are translated to the execution character set before they are interpreted as multibyte sequences.