WEB常见编码方式简介

最新推荐文章于 2024-05-28 10:15:00 发布

原创最新推荐文章于 2024-05-28 10:15:00 发布 · 1k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#web #编码

WEB 专栏收录该内容

3 篇文章

订阅专栏

本文介绍WEB中常用编码原理

unicode

Unicode is a computing industry standardfor the consistent encoding, representation and handling of text expressed inmost of the world's writing systems. As of September 2012, the most recentversion is Unicode 6.2.

Unicode can be implemented by differentcharacter encodings. The most commonly used encodings are UTF-8, UTF-16 and thenow-obsolete UCS-2. UTF-8 uses one byte for any ASCII characters, which havethe same code values in both UTF-8 and ASCII encoding, and up to four bytes forother characters. UCS-2 uses a 16-bit code unit (two 8-bit bytes) for eachcharacter but cannot encode every character in the current Unicode standard.UTF-16 extends UCS-2, using two 16-bit units (4 × 8 bit) to handle each of theadditional characters.

Unicodedefines a codespace of 1,114,112 code points in the range 0(hex) to 10FFFF(hex).Normally a Unicode code point is referred to by writing"U+" followed by its hexadecimal number. For code points in the BasicMultilingual Plane (BMP), four digits are used ; for code points outside theBMP, five or six digits are used, as required.

Code points in the range U+D800..U+DBFF(1,024 code points) are known as high-surrogate code points, and code points inthe range U+DC00..U+DFFF (1,024 code points) are known as low-surrogate codepoints. A high-surrogate code point (also known as a leading surrogate)followed by a low-surrogate code point (also known as a trailing surrogate)together form a surrogate pair used in UTF-16 to represent 1,048,576 codepoints outside BMP. High and low surrogate code points are not valid bythemselves. Thus the range of code points that are available for use ascharacters is U+0000..U+D7FF and U+E000..U+10FFFF (1,112,064 code points).

Codepoint planes and blocks:

The Unicode codespace is divided intoseventeen planes, numbered 0 to 16:

All code points in the BMP are accessed asa single code unit in UTF-16 encoding and can be encoded in one, two or threebytes in UTF-8. Code points in Planes 1 through 16 (supplementary planes, or,informally, astral planes) are accessed as surrogate pairs in UTF-16 andencoded in four bytes in UTF-8.

UTF-8

UTF-8 (UCS Transformation Format—8-bit) isa variable-width encoding that can represent every character in the Unicodecharacter set. It was designed for backward compatibility with ASCII and toavoid the complications of endianness and byte order marks in UTF-16 andUTF-32.

The design of UTF-8 can be seen in thistable of the scheme as originally proposed by Dave Prosser and subsequentlymodified by Ken Thompson (the x's are replaced by the bits of the code point):

The salient features of this scheme are asfollows:

1. One-byte codes are used onlyfor the ASCII values 0 through 127. In this case the UTF-8 code has the samevalue as the ASCII code. The high-order bit of these codes is always 0.

2. Code points larger than 127 arerepresented by multi-byte sequences, composed of a leading byte and one or morecontinuation bytes. The leading byte has two or more high-order 1s followed bya 0, while continuation bytes all have '10' in the high-order position.

3. The number of high-order 1s inthe leading byte of a multi-byte sequence indicates the number of bytes in thesequence, so that the length of the sequence can be determined withoutexamining the continuation bytes.

4. The remaining bits of theencoding are used for the bits of the code point being encoded, padded withhigh-order 0s if necessary. The high-order bits go in the lead byte,lower-order bits in succeeding continuation bytes. The number of bytes in theencoding is the minimum required to hold all the significant bits of the codepoint.

The original specification covered numbersup to 31 bits (the original limit of the Universal Character Set).In November 2003 UTF-8 was restricted byRFC 3629 to end at U+10FFFF, in order to match the constraints of theUTF-16 character encoding. This removed all 5- and 6-byte sequences, and abouthalf of the 4-byte sequences.

UTF-16

UTF-16 is described in the UnicodeStandard, version 3.0. In the UTF-16 encoding, characters are represented usingeither one or two unsigned 16-bit integers,depending on the character value.

Therules for how characters are encoded in utf-16

1. Characters with values lessthan 0x10000 are represented as a single 16-bit integer with a value equal tothat of the character number.

2. Characters with values between0x10000 and 0x10FFFF are represented by a 16-bit integer with a value between0xD800 and 0xDBFF (within the so-called high-half zone or high surrogate area)followed by a 16-bit integer with a value between 0xDC00 and 0xDFFF (within theso-called low-half zone or low surrogate area).

3. Characters with values greaterthan 0x10FFFF cannot be encoded in UTF-16.

Encoding utf-16

Encoding of a single character from an ISO10646 character value to UTF-16 proceeds as follows. Let U be the characternumber, no greater than 0x10FFFF.

1. If U < 0x10000, encode U asa 16-bit unsigned integer and terminate.

2. Let U' = U - 0x10000. Because Uis less than or equal to 0x10FFFF,U' must be less than or equal to 0xFFFFF.That is, U' can be represented in 20 bits.

3. Initialize two 16-bit unsignedintegers, W1 and W2, to 0xD800 and 0xDC00, respectively. These integers eachhave 10 bits free to encode the character value, for a total of 20 bits.

4. Assign the 10 high-order bitsof the 20-bit U' to the 10 low-order bits of W1 and the 10 low-order bits of U'to the 10 low-order bits of W2. Terminate.

Decoding utf-16

Decoding of a single character from UTF-16to an ISO 10646 character value proceeds as follows. Let W1 be the next 16-bitinteger in the sequence of integers representing the text. Let W2 be the(eventual) next integer following W1.

1. If W1 < 0xD800 or W1 >0xDFFF, the character value U is the value of W1. Terminate.

2. Determine if W1 is between0xD800 and 0xDBFF. If not, the sequence is in error and no valid character canbe obtained using W1. Terminate.

3. If there is no W2 (that is, thesequence ends with W1), or if W2 is not between 0xDC00 and 0xDFFF, the sequenceis in error. Terminate.

4. Construct a 20-bit unsignedinteger U', taking the 10 low-order bits of W1 as its 10 high-order bits andthe 10 low-order bits of W2 as its 10 low-order bits.

5. Add 0x10000 to U' to obtain thecharacter value U. Terminate.

Byteorder mark

The Unicode Standard and ISO 10646 definethe character "ZERO WIDTH NON-BREAKING SPACE" (0xFEFF), which is alsoknown informally as "BYTE ORDER MARK" (abbreviated "BOM").

The order is big-endian if the first twooctets are 0xFE followed by 0xFF; if they are 0xFF followed by 0xFE, the orderis little-endian.

UTF-7

1. Some characters can berepresented directly as single ASCII bytes. The first group is known as"direct characters" and contains 62 alphanumeric characters and 9symbols: ' ( ) , - . / : ?. The direct characters are safe to includeliterally.

direct characters:

Character ASCII &Unicode Value (decimal)

' 39

( 40

) 41

, 44

- 45

. 46

/ 47

: 58

? 63

2. The other main group, known as"optional direct characters", contains all other printable charactersin the range U+0020–U+007E except ~ \ + and space. Using the optional directcharacters reduces size and enhances human readability but also increases thechance of breakage by things like badly designed mail gateways and may requireextra escaping when used in encoded words for header fields.

Character ASCII &Unicode Value (decimal)

! 33

" 34

# 35

$ 36

% 37

& 38

* 42

; 59

< 60

= 61

> 62

@ 64

[ 91

] 93

^ 94

_ 95

' 96

{ 123

| 124

} 125

3. Space, tab, carriage return andline feed may also be represented directly as single ASCII bytes. However, ifthe encoded text is to be used in e-mail, care is needed to ensure that thesecharacters are used in ways that do not require further content transferencoding to be suitable for e-mail.

4. The plus sign (+) may beencoded as +-

5. Other characters must beencoded in UTF-16 (hence U+10000 and higher would be encoded into surrogates)and then in modified Base64(no pad char). The start of these blocks of modifiedBase64 encoded UTF-16 is indicated by a + sign. The end is indicated by anycharacter not in the modified Base64 set. If the character after the modifiedBase64 is a - (ASCII hyphen-minus) then it is consumed by the decoder anddecoding resumes with the next character. Otherwise decoding resumes with thecharacter after the base64. Note that if the first character after the shiftedsequence is "-" then an extra "-" must be present toterminate the shifted sequence so that the actual "-" is not itselfabsorbed.

Hex	0				0				4				1				0
Bit	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	1	0	0
Index	0						4						4
Base64	A						E						E

Hex	0		0				4				2				0
Bit	0	0	0	0	0	0	0	1	0	0	0	0	1	0	0	0	0	0
Index	0						16						32
Base64	A						Q						g

Hex	0				4				3
Bit	0	0	0	0	0	1	0	0	0	0	1	1
Index	1						3
Base64	B						D

BOM: +/v8 +/v9 +/v+ +/v/

GBK

字符有一字节和双字节编码，00–7F范围内是一位，和ASCII保持一致，此范围内严格上说有96个文字和32个控制符号。

之后的双字节中，前一字节是双字节的第一位。总体上说第一字节的范围是81–FE（也就是不含80和FF），第二字节的一部分领域在40–7E，其他领域在80–FE。

具体来说，定义的是下列字节:

Base64

The Base 64 encoding is designed torepresent arbitrary sequences of octets in a form that requires casesensitivity but need not be humanly readable.

The encoding process represents 24-bitgroups of input bits as output strings of 4 encoded characters. Proceeding fromleft to right, a 24-bit input group is formed by concatenating 3 8-bit inputgroups. These 24 bits are then treated as 4 concatenated 6-bit groups, each ofwhich is translated into a single digit in the base 64 alphabet.

Each 6-bit group is used as an index intoan array of 64 printable characters. The character referenced by the index isplaced in the output string.

Special processing is performed if fewerthan 24 bits are available at the end of the data being encoded. A full encoding quantum is always completedat the end of a quantity. When fewerthan 24 input bits are available in an input group, zero bits are added (on theright) to form an integral number of 6-bit groups. Padding at the end of thedata is performed using the '=' character. Since all base 64 input is an integral number of octets, only thefollowing cases can arise:

1. the final quantum of encodinginput is an integral multiple of 24 bits; here, the final unit of encodedoutput will be an integral multiple of 4 characters with no "="padding,

2. the final quantum of encodinginput is exactly 8 bits; here, the final unit of encoded output will be twocharacters followed by two "=" padding characters, or

3. the final quantum of encodinginput is exactly 16 bits; here, the final unit of encoded output will be threecharacters followed by one "=" padding character.

HTML

Some characters are reserved in HTML.

It is not possible to use the less than(<) or greater than (>) signs in your text, because the browser will mixthem with tags.

To actually display reserved characters, wemust use character entities in the HTML source code.

A character entity looks like this:

&entity_name;

&#entity_number;

The advantage of using an entity name,instead of a number, is that the name is easier to remember. However, thedisadvantage is that browsers may not support all entity names (the support forentity numbers is very good).