swift——Unicode Character

本文介绍了Swift中Unicode的相关概念,包括Unicode scalar的范围、特殊字符的转义、extended grapheme cluster以及Unicode的UTF-8、UTF-16和UTF-32编码。此外,还讨论了Character类型的字面值常量、转义字符、extended grapheme cluster的使用,并强调了Character类型不支持隐式类型转换。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Unicode

swift中字符类型为Character,使用Unicode编码,String由一系列Character组成,自然也使用Unicode编码

Unicode scalar

Unicode scalar是合法Unicode字符码(code point),唯一的21-bit数字(内存中占据32-bit,实际只使用21-bit),Unicode scalar包括:
  • [U+0000, U+D7FF]
  • [U+E000, U+10FFFF]
注:Unicode scalar不包括[U+D800, U+DFFF],[U+D800, U+DFFF]是保留Unicode字符码,未来备用

special character

  • 转义字符:\\,\n,\r,\"等
  • Unicode scalar:\u{n}(n为1-8位十六进制数字,n值等于合法Unicode字符码,即Unicode scalar)
func special_char()
{
    let c1: Character = "\""
    let c2: Character = "\\"
    print("c1 = \(c1), c2 = \(c2)")
        
    let c3: Character = "\u{24}"
    let c4: Character = "\u{2665}"
    print("c3 = \(c3), c4 = \(c4)")
}
output:
c1 = ", c2 = \
c3 = $, c4 = ♥

extended grapheme cluster

extended grapheme cluster,扩展字符簇,由一个或多个Unicode scalar组成,但本质还是单一字符
func extended_grapheme_cluster()
{
    let eAcute: Character = "\u{E9}"                         // é
    let combinedEAcute: Character = "\u{65}\u{301}"          // e followed by ́
    print("eAcute = \(eAcute), combinedEAcute = \(combinedEAcute), \(eAcute == combinedEAcute)")
        
    let precomposed: Character = "\u{D55C}"                  // 한
    let decomposed: Character = "\u{1112}\u{1161}\u{11AB}"   // ᄒ, ᅡ, ᆫ
    print("precomposed = \(precomposed), decomposed = \(decomposed), \(precomposed == decomposed)")
        
    let enclosedEAcute: Character = "\u{E9}\u{20DD}" // enclosedEAcute is é⃝
    print("enclosedEAcute = \(enclosedEAcute)")
}
output:
eAcute = é, combinedEAcute = é, true
precomposed = 한, decomposed = 한, true
enclosedEAcute = é⃝

Unicode encode

当Unicode字符进行IO操作写入外存或进行网络传输,需对Unicode字符编码,Unicode定义了几种编码格式,每种编码格式定义了code unit(代码单元),每个Unicode字符编码成n个code unit(n >= 1):
  • UTF-8:code unit = 8-bit,用UInt8表示,Unicode字符编码为n个UInt8
  • UTF-16:code unit = 16-bit,用UInt16表示,Unicode字符编码为n个UInt16
  • UTF-32:code unit = 32-bit,用UInt32表示,Unicode字符编码为n个UInt32
注:Unicode scalar内存中占据32-bit,实际只使用21-bit
func unicode_encode()
{
    let str1: String = "\u{E9}"
    let str2: String = "\u{65}\u{301}"
    print("str1 len = \(str1.characters.count), str2 len = \(str2.characters.count), \(str1 == str2)")
        
    print("str1 encode")
    for codeUnit in str1.utf8
    {
        print("\(codeUnit)", terminator: "|")
    }
    print("")
        
    for codeUnit in str1.utf16
    {
        print("\(codeUnit)", terminator: "|")
    }
    print("")
        
    for scalar in str1.unicodeScalars
    {
        print("\(scalar.value)", terminator: "|")
    }
    print("")
        
    print("str2 encode")
    for codeUnit in str2.utf8
    {
        print("\(codeUnit)", terminator: "|")
    }
    print("")
        
    for codeUnit in str2.utf16
    {
        print("\(codeUnit)", terminator: "|")
    }
    print("")
        
    for scalar in str2.unicodeScalars
    {
        print("\(scalar.value)", terminator: "|")
    }
    print("")
}
output:
str1 len = 1, str2 len = 1, true
str1 encode
195|169|
233|
233|
str2 encode
101|204|129|
101|769|
101|769|
总结:
  • 同一字符在不同编码格式下编码为不同字节流
  • 由于extended grapheme cluster原因,同一字符在同一编码格式下也可能编码为不同字节流
  • ASCII字符编码为UTF-8,UTF-16,UTF-32后字符码不变,依旧跟ASCII字符码相同,但占据内存空间不同,分别占据UInt8,UInt16,UInt32
注:scalar.value获取Unicode scalar实际使用的21-bit

Character

Character是swift中字符类型,使用Unicode编码

字面值常量

Character支持以下几种格式:
  • 字符:"a"
  • 转义字符:"\"","\\","\n","\u{2665}"
  • extended grapheme cluster:"\u{65}\u{301}","\u{1112}\u{1161}\u{11AB}"
func char()
{
    let c1: Character = "a"
    
    let c2: Character = "\""
    let c3: Character = "\u{2665}"
    
    let c4: Character = "\u{65}\u{301}"
    let c5: Character = "\u{1112}\u{1161}\u{11AB}"
    //let c6: Character = "\u{1112}\u{1161}\u{11AB}\u{2665}" //not extended grapheme cluster
    
    print("c1 = \(c1)")
    print("c2 = \(c2)")
    print("c3 = \(c3)")
    print("c4 = \(c4)")
    print("c5 = \(c5)")
}
output:
c1 = a
c2 = "
c3 = ♥
c4 = é
c5 = 한
总结:
  • Character为单一字符,用双引号包围(非单引号,不同于c),只允许包围单一字符(不允许0个或多个字符),包括转义字符,extended grapheme cluster
  • 类型推断时,双引号包围的单一字符为String,非Character
  • 编译器会识别extended grapheme cluster是否合法,如果是非法extended grapheme cluster,编译器解释为String,非Character

不支持隐式类型转换

Character本质为struct类型,仅仅表示字符,使用Unicode编码,不支持与数值类型的隐式转换
func implicit_convert()
{
    var c: Character = "a"
    var i: Int = 5;
    
    //c = i
    //i = c
}
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值