Unicode
swift中字符类型为Character,使用Unicode编码,String由一系列Character组成,自然也使用Unicode编码
Unicode scalar
Unicode scalar是合法Unicode字符码(code point),唯一的21-bit数字(内存中占据32-bit,实际只使用21-bit),Unicode scalar包括:
- [U+0000, U+D7FF]
- [U+E000, U+10FFFF]
注:Unicode scalar不包括[U+D800, U+DFFF],[U+D800, U+DFFF]是保留Unicode字符码,未来备用
special character
- 转义字符:\\,\n,\r,\"等
- Unicode scalar:\u{n}(n为1-8位十六进制数字,n值等于合法Unicode字符码,即Unicode scalar)
func special_char()
{
let c1: Character = "\""
let c2: Character = "\\"
print("c1 = \(c1), c2 = \(c2)")
let c3: Character = "\u{24}"
let c4: Character = "\u{2665}"
print("c3 = \(c3), c4 = \(c4)")
}
output:
c1 = ", c2 = \
c3 = $, c4 = ♥
extended grapheme cluster
extended grapheme cluster,扩展字符簇,由一个或多个Unicode scalar组成,但本质还是单一字符
func extended_grapheme_cluster()
{
let eAcute: Character = "\u{E9}" // é
let combinedEAcute: Character = "\u{65}\u{301}" // e followed by ́
print("eAcute = \(eAcute), combinedEAcute = \(combinedEAcute), \(eAcute == combinedEAcute)")
let precomposed: Character = "\u{D55C}" // 한
let decomposed: Character = "\u{1112}\u{1161}\u{11AB}" // ᄒ, ᅡ, ᆫ
print("precomposed = \(precomposed), decomposed = \(decomposed), \(precomposed == decomposed)")
let enclosedEAcute: Character = "\u{E9}\u{20DD}" // enclosedEAcute is é⃝
print("enclosedEAcute = \(enclosedEAcute)")
}
output:
eAcute = é, combinedEAcute = é, true
precomposed = 한, decomposed = 한, true
enclosedEAcute = é⃝
Unicode encode
当Unicode字符进行IO操作写入外存或进行网络传输,需对Unicode字符编码,Unicode定义了几种编码格式,每种编码格式定义了code unit(代码单元),每个Unicode字符编码成n个code unit(n >= 1):
- UTF-8:code unit = 8-bit,用UInt8表示,Unicode字符编码为n个UInt8
- UTF-16:code unit = 16-bit,用UInt16表示,Unicode字符编码为n个UInt16
- UTF-32:code unit = 32-bit,用UInt32表示,Unicode字符编码为n个UInt32
注:Unicode scalar内存中占据32-bit,实际只使用21-bit
func unicode_encode()
{
let str1: String = "\u{E9}"
let str2: String = "\u{65}\u{301}"
print("str1 len = \(str1.characters.count), str2 len = \(str2.characters.count), \(str1 == str2)")
print("str1 encode")
for codeUnit in str1.utf8
{
print("\(codeUnit)", terminator: "|")
}
print("")
for codeUnit in str1.utf16
{
print("\(codeUnit)", terminator: "|")
}
print("")
for scalar in str1.unicodeScalars
{
print("\(scalar.value)", terminator: "|")
}
print("")
print("str2 encode")
for codeUnit in str2.utf8
{
print("\(codeUnit)", terminator: "|")
}
print("")
for codeUnit in str2.utf16
{
print("\(codeUnit)", terminator: "|")
}
print("")
for scalar in str2.unicodeScalars
{
print("\(scalar.value)", terminator: "|")
}
print("")
}
output:
str1 len = 1, str2 len = 1, true
str1 encode
195|169|
233|
233|
str2 encode
101|204|129|
101|769|
101|769|
总结:
- 同一字符在不同编码格式下编码为不同字节流
- 由于extended grapheme cluster原因,同一字符在同一编码格式下也可能编码为不同字节流
- ASCII字符编码为UTF-8,UTF-16,UTF-32后字符码不变,依旧跟ASCII字符码相同,但占据内存空间不同,分别占据UInt8,UInt16,UInt32
注:scalar.value获取Unicode scalar实际使用的21-bit
Character
Character是swift中字符类型,使用Unicode编码
字面值常量
Character支持以下几种格式:
- 字符:"a"
- 转义字符:"\"","\\","\n","\u{2665}"
- extended grapheme cluster:"\u{65}\u{301}","\u{1112}\u{1161}\u{11AB}"
func char()
{
let c1: Character = "a"
let c2: Character = "\""
let c3: Character = "\u{2665}"
let c4: Character = "\u{65}\u{301}"
let c5: Character = "\u{1112}\u{1161}\u{11AB}"
//let c6: Character = "\u{1112}\u{1161}\u{11AB}\u{2665}" //not extended grapheme cluster
print("c1 = \(c1)")
print("c2 = \(c2)")
print("c3 = \(c3)")
print("c4 = \(c4)")
print("c5 = \(c5)")
}
output:
c1 = a
c2 = "
c3 = ♥
c4 = é
c5 = 한
总结:
- Character为单一字符,用双引号包围(非单引号,不同于c),只允许包围单一字符(不允许0个或多个字符),包括转义字符,extended grapheme cluster
- 类型推断时,双引号包围的单一字符为String,非Character
- 编译器会识别extended grapheme cluster是否合法,如果是非法extended grapheme cluster,编译器解释为String,非Character
不支持隐式类型转换
Character本质为struct类型,仅仅表示字符,使用Unicode编码,不支持与数值类型的隐式转换
func implicit_convert()
{
var c: Character = "a"
var i: Int = 5;
//c = i
//i = c
}