字符编码完全指南：从ASCII到Unicode的全面解析

最新推荐文章于 2025-12-16 16:02:25 发布

原创最新推荐文章于 2025-12-16 16:02:25 发布 · 1k 阅读

16 ·

CC 4.0 BY-SA版权

文章标签：

#车载系统

软件开发基础专栏收录该内容

1 篇文章

订阅专栏

字符编码完全指南：从ASCII到Unicode的全面解析

字符编码基础概念

什么是字符编码？

字符编码是将字符映射到数字的规则系统。计算机只能处理数字，因此需要编码来将人类可读的字符转换为计算机可处理的数字。

编码的基本原理

字符 → 编码规则 → 数字 → 存储/传输 → 解码规则 → 字符

关键术语

字符集（Character Set）：定义哪些字符可以被表示
编码方案（Encoding Scheme）：定义如何将字符映射到数字
字节序（Byte Order）：多字节字符的存储顺序

主要编码类型详解

1. ASCII (American Standard Code for Information Interchange)

基本信息

发布时间：1963年
编码长度：7位（实际使用1字节）
字符数量：128个字符
字符范围：0-127

字符组成

0-31:   控制字符（不可打印）
32-126: 可打印字符
127:    删除字符

包含：
- 英文字母：A-Z (65-90), a-z (97-122)
- 数字：0-9 (48-57)
- 标点符号：!@#$%^&*()等
- 空格：32

适用场景

✅ 纯英文文本处理
✅ 配置文件
✅ 网络协议
✅ 嵌入式系统基础通信
❌ 多语言支持

实际应用示例

// ASCII字符串
const char* ascii_text = "Hello World!";
// 存储：48 65 6C 6C 6F 20 57 6F 72 6C 64 21

2. ISO-8859-1 (Latin-1)

基本信息

编码长度：8位（1字节）
字符数量：256个字符
字符范围：0-255

字符组成

0-127:   ASCII字符
128-255: 西欧扩展字符

适用场景

✅ 西欧语言（法语、德语、西班牙语等）
✅ 早期网页编码
✅ 数据库字段
❌ 中文、日文、韩文

实际应用示例

// Latin-1字符串
const char* latin1_text = "Café naïve";  // é = 0xE9, ï = 0xEF

3. UTF-8 (8-bit Unicode Transformation Format)

基本信息

发布时间：1993年
编码长度：变长（1-4字节）
字符数量：支持所有Unicode字符
兼容性：完全兼容ASCII

编码规则

1字节: 0xxxxxxx (ASCII字符)
2字节: 110xxxxx 10xxxxxx
3字节: 1110xxxx 10xxxxxx 10xxxxxx
4字节: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

字符示例

'A' → 0x41 (1字节)
'中' → 0xE4 0xB8 0xAD (3字节)
'��' → 0xF0 0x9F 0x8C 0x8D (4字节)

适用场景

✅ 现代Web应用
✅ 移动应用
✅ 数据库存储
✅ 网络传输
✅ 跨平台开发
✅ 国际化应用

实际应用示例

// UTF-8字符串
const char* utf8_text = "Hello 世界 🌍";
// 存储：48 65 6C 6C 6F 20 E4 B8 96 E7 95 8C 20 F0 9F 8C 8D

4. UTF-16 (16-bit Unicode Transformation Format)

基本信息

编码长度：变长（2-4字节）
字符数量：支持所有Unicode字符
字节序：需要处理大端/小端

编码规则

基本多文种平面 (BMP): 2字节 (U+0000-U+FFFF)
辅助平面: 4字节 (U+10000-U+10FFFF)

字符示例

'A' → 0x0041 (2字节)
'中' → 0x4E2D (2字节)
'🌍' → 0xD83C 0xDF0D (4字节，代理对)

适用场景

✅ Windows系统内部
✅ Java内部字符串
✅ .NET Framework
✅ 需要高效中文处理
❌ 网络传输（字节序问题）

实际应用示例

// Windows中的UTF-16
wchar_t* utf16_text = L"Hello 世界";
// 存储：00 48 00 65 00 6C 00 6C 00 6F 00 20 4E 2D 75 4C

5. UTF-32 (32-bit Unicode Transformation Format)

基本信息

编码长度：定长（4字节）
字符数量：支持所有Unicode字符
优点：处理简单
缺点：空间浪费

适用场景

✅ 内部字符处理
✅ 字符分析算法
✅ 需要快速随机访问
❌ 存储和传输

6. GB2312/GBK/GB18030 (中文编码)

GB2312

编码长度：2字节
字符数量：6763个汉字
发布时间：1980年

GBK

编码长度：变长（1-2字节）
字符数量：21003个汉字
发布时间：1995年

GB18030

编码长度：变长（1-4字节）
字符数量：支持所有Unicode字符
发布时间：2000年

适用场景

✅ 中国大陆系统
✅ 中文文档处理
✅ 传统中文软件
❌ 国际化应用

实际应用示例

// GB2312字符串
const char* gb2312_text = "你好世界";
// 存储：C4 E3 BA C3 CA C0 BD E7

7. Big5 (繁体中文编码)

基本信息

编码长度：2字节
字符数量：13053个汉字
发布时间：1984年

适用场景

✅ 台湾地区
✅ 香港地区
✅ 繁体中文系统
❌ 简体中文

8. Shift_JIS (日文编码)

基本信息

编码长度：变长（1-2字节）
字符数量：约8000个字符
发布时间：1978年

适用场景

✅ 日文系统
✅ 日文软件
✅ 传统日文应用

9. EUC-KR (韩文编码)

基本信息

编码长度：变长（1-2字节）
字符数量：约8000个字符

适用场景

✅ 韩文系统
✅ 韩文软件

编码对比分析

详细对比表

编码类型	英文字符	中文字符	日文字符	存储效率	处理复杂度	兼容性	现代支持
ASCII	1字节	不支持	不支持	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐
ISO-8859-1	1字节	不支持	不支持	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐
UTF-8	1字节	3字节	3字节	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
UTF-16	2字节	2字节	2字节	⭐⭐⭐	⭐⭐	⭐⭐	⭐⭐⭐⭐
UTF-32	4字节	4字节	4字节	⭐	⭐⭐⭐⭐⭐	⭐⭐	⭐⭐
GB2312	1字节	2字节	不支持	⭐⭐⭐⭐	⭐⭐⭐	⭐	⭐⭐
Big5	1字节	2字节	不支持	⭐⭐⭐⭐	⭐⭐⭐	⭐	⭐⭐
Shift_JIS	1字节	不支持	2字节	⭐⭐⭐⭐	⭐⭐	⭐	⭐⭐

性能对比

存储空间对比（以"Hello世界"为例）

ASCII:     5字节 (Hello) + 无法表示中文
UTF-8:     5字节 (Hello) + 6字节 (世界) = 11字节
UTF-16:    10字节 (Hello) + 4字节 (世界) = 14字节
UTF-32:    20字节 (Hello) + 8字节 (世界) = 28字节
GB2312:    5字节 (Hello) + 4字节 (世界) = 9字节

处理速度对比

UTF-32: 最快（定长，直接索引）
UTF-16: 快（大部分字符2字节）
UTF-8:  中等（需要解析变长）
GB2312: 中等（需要解析变长）

实际应用场景

1. Web开发

前端开发

<!-- HTML文档声明 -->
<!DOCTYPE html>
<html lang="zh-CN">
<head>
    <meta charset="UTF-8">  <!-- 推荐使用UTF-8 -->
    <title>多语言网站</title>
</head>

后端开发

// Node.js示例
const express = require('express');
const app = express();

// 设置编码
app.use(express.json({ charset: 'utf-8' }));

// 处理多语言
app.get('/api/text', (req, res) => {
    const text = "Hello 世界 🌍";
    res.setHeader('Content-Type', 'application/json; charset=utf-8');
    res.json({ message: text });
});

数据库设计

-- MySQL数据库
CREATE TABLE articles (
    id INT PRIMARY KEY,
    title VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
    content TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
);

2. 移动应用开发

Android开发

// Java字符串默认使用UTF-16
String text = "Hello 世界";
byte[] utf8Bytes = text.getBytes("UTF-8");

// 网络传输使用UTF-8
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestProperty("Content-Type", "application/json; charset=utf-8");

iOS开发

// Swift字符串使用UTF-8
let text = "Hello 世界"
let utf8Data = text.data(using: .utf8)

// 网络请求
var request = URLRequest(url: url)
request.setValue("application/json; charset=utf-8", forHTTPHeaderField: "Content-Type")

3. 嵌入式系统开发

C/C++嵌入式开发

// 嵌入式系统推荐使用UTF-8
class TextProcessor {
private:
    std::string m_text;
    
public:
    // 设置文本（UTF-8）
    void SetText(const char* text) {
        m_text = std::string(text);
    }
    
    // 获取文本长度（字符数，非字节数）
    size_t GetCharCount() {
        size_t count = 0;
        for (size_t i = 0; i < m_text.length(); ) {
            if ((m_text[i] & 0x80) == 0) {
                // ASCII字符
                i++;
            } else if ((m_text[i] & 0xE0) == 0xC0) {
                // 2字节UTF-8字符
                i += 2;
            } else if ((m_text[i] & 0xF0) == 0xE0) {
                // 3字节UTF-8字符
                i += 3;
            } else if ((m_text[i] & 0xF8) == 0xF0) {
                // 4字节UTF-8字符
                i += 4;
            } else {
                // 无效字符
                i++;
            }
            count++;
        }
        return count;
    }
};

协议通信

// 网络协议中的文本处理
class ProtocolHandler {
public:
    // 解析UTF-8文本
    bool ParseText(const uint8_t* data, size_t length, std::string& text) {
        // 验证UTF-8格式
        if (!IsValidUTF8(data, length)) {
            return false;
        }
        
        text.assign(reinterpret_cast<const char*>(data), length);
        return true;
    }
    
private:
    bool IsValidUTF8(const uint8_t* data, size_t length) {
        for (size_t i = 0; i < length; ) {
            if ((data[i] & 0x80) == 0) {
                // ASCII字符
                i++;
            } else if ((data[i] & 0xE0) == 0xC0) {
                // 2字节字符
                if (i + 1 >= length || (data[i + 1] & 0xC0) != 0x80) {
                    return false;
                }
                i += 2;
            } else if ((data[i] & 0xF0) == 0xE0) {
                // 3字节字符
                if (i + 2 >= length || 
                    (data[i + 1] & 0xC0) != 0x80 || 
                    (data[i + 2] & 0xC0) != 0x80) {
                    return false;
                }
                i += 3;
            } else if ((data[i] & 0xF8) == 0xF0) {
                // 4字节字符
                if (i + 3 >= length || 
                    (data[i + 1] & 0xC0) != 0x80 || 
                    (data[i + 2] & 0xC0) != 0x80 || 
                    (data[i + 3] & 0xC0) != 0x80) {
                    return false;
                }
                i += 4;
            } else {
                return false;
            }
        }
        return true;
    }
};

4. 桌面应用开发

Windows应用

// Windows API中的字符处理
#include <windows.h>
#include <string>

class WindowsTextHandler {
public:
    // UTF-8转UTF-16
    std::wstring UTF8ToUTF16(const std::string& utf8) {
        if (utf8.empty()) return std::wstring();
        
        int size = MultiByteToWideChar(CP_UTF8, 0, utf8.c_str(), -1, nullptr, 0);
        if (size <= 0) return std::wstring();
        
        std::wstring utf16(size - 1, 0);
        MultiByteToWideChar(CP_UTF8, 0, utf8.c_str(), -1, &utf16[0], size);
        return utf16;
    }
    
    // UTF-16转UTF-8
    std::string UTF16ToUTF8(const std::wstring& utf16) {
        if (utf16.empty()) return std::string();
        
        int size = WideCharToMultiByte(CP_UTF8, 0, utf16.c_str(), -1, nullptr, 0, nullptr, nullptr);
        if (size <= 0) return std::string();
        
        std::string utf8(size - 1, 0);
        WideCharToMultiByte(CP_UTF8, 0, utf16.c_str(), -1, &utf8[0], size, nullptr, nullptr);
        return utf8;
    }
};

Linux应用

// Linux中的字符处理
#include <locale>
#include <codecvt>

class LinuxTextHandler {
public:
    // UTF-8字符串处理
    void ProcessUTF8Text(const std::string& text) {
        // 设置locale
        std::locale::global(std::locale("en_US.UTF-8"));
        
        // 转换为宽字符
        std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
        std::wstring wide_text = converter.from_bytes(text);
        
        // 处理宽字符
        ProcessWideText(wide_text);
    }
    
private:
    void ProcessWideText(const std::wstring& text) {
        // 处理宽字符文本
        for (wchar_t c : text) {
            // 字符处理逻辑
        }
    }
};

5. 游戏开发

Unity游戏引擎

// Unity中的多语言支持
using UnityEngine;
using System.Text;

public class LocalizationManager : MonoBehaviour {
    private Dictionary<string, string> translations = new Dictionary<string, string>();
    
    public void LoadTranslations(string language) {
        // 从UTF-8编码的文件加载翻译
        string filePath = $"Translations/{language}.json";
        string jsonContent = System.IO.File.ReadAllText(filePath, Encoding.UTF8);
        
        // 解析JSON
        var translations = JsonUtility.FromJson<TranslationData>(jsonContent);
        // 处理翻译数据
    }
    
    public string GetText(string key) {
        return translations.ContainsKey(key) ? translations[key] : key;
    }
}

Unreal Engine

// Unreal Engine中的文本处理
#include "Internationalization/Text.h"
#include "Internationalization/Internationalization.h"

class UETextHandler {
public:
    // 创建本地化文本
    FText CreateLocalizedText(const FString& key) {
        return FInternationalization::Get().ForUseOnlyByLocMacroAndGraphNodeTextLiterals_CreateText(
            *key, TEXT("GameNamespace"), key
        );
    }
    
    // 处理多语言文本
    void ProcessMultilingualText(const FString& text) {
        // 确保文本使用UTF-8编码
        FString utf8Text = text;
        
        // 处理文本
        ProcessText(utf8Text);
    }
};

编码转换与处理

1. C++中的编码转换

使用标准库

#include <locale>
#include <codecvt>
#include <string>

class EncodingConverter {
public:
    // UTF-8转UTF-16
    static std::wstring UTF8ToUTF16(const std::string& utf8) {
        std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
        return converter.from_bytes(utf8);
    }
    
    // UTF-16转UTF-8
    static std::string UTF16ToUTF8(const std::wstring& utf16) {
        std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
        return converter.to_bytes(utf16);
    }
    
    // UTF-8转UTF-32
    static std::u32string UTF8ToUTF32(const std::string& utf8) {
        std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;
        return converter.from_bytes(utf8);
    }
};

使用第三方库（ICU）

#include <unicode/ucnv.h>
#include <unicode/ustring.h>

class ICUConverter {
public:
    // 使用ICU进行编码转换
    static std::string Convert(const std::string& input, 
                              const char* fromEncoding, 
                              const char* toEncoding) {
        UErrorCode status = U_ZERO_ERROR;
        
        // 创建转换器
        UConverter* converter = ucnv_open(fromEncoding, &status);
        if (U_FAILURE(status)) {
            return "";
        }
        
        // 执行转换
        const char* source = input.c_str();
        int32_t sourceLength = input.length();
        
        // 计算目标长度
        int32_t targetLength = ucnv_toUChars(converter, nullptr, 0, 
                                           source, sourceLength, &status);
        if (status == U_BUFFER_OVERFLOW_ERROR) {
            status = U_ZERO_ERROR;
        }
        
        // 分配缓冲区
        UChar* target = new UChar[targetLength + 1];
        ucnv_toUChars(converter, target, targetLength + 1, 
                     source, sourceLength, &status);
        
        // 清理
        ucnv_close(converter);
        
        if (U_FAILURE(status)) {
            delete[] target;
            return "";
        }
        
        // 转换为UTF-8
        std::string result = UCharToUTF8(target, targetLength);
        delete[] target;
        
        return result;
    }
    
private:
    static std::string UCharToUTF8(const UChar* uchars, int32_t length) {
        UErrorCode status = U_ZERO_ERROR;
        
        // 计算UTF-8长度
        int32_t utf8Length = 0;
        u_strToUTF8(nullptr, 0, &utf8Length, uchars, length, &status);
        if (status == U_BUFFER_OVERFLOW_ERROR) {
            status = U_ZERO_ERROR;
        }
        
        // 转换
        char* utf8 = new char[utf8Length + 1];
        u_strToUTF8(utf8, utf8Length + 1, nullptr, uchars, length, &status);
        
        std::string result(utf8);
        delete[] utf8;
        
        return result;
    }
};

2. Python中的编码处理

# Python中的编码转换
import codecs

class PythonEncodingHandler:
    @staticmethod
    def convert_encoding(text, from_encoding, to_encoding):
        """转换文本编码"""
        try:
            # 解码
            decoded = text.decode(from_encoding)
            # 编码
            encoded = decoded.encode(to_encoding)
            return encoded
        except UnicodeDecodeError as e:
            print(f"解码错误: {e}")
            return None
        except UnicodeEncodeError as e:
            print(f"编码错误: {e}")
            return None
    
    @staticmethod
    def detect_encoding(data):
        """检测编码"""
        import chardet
        result = chardet.detect(data)
        return result['encoding'], result['confidence']
    
    @staticmethod
    def safe_decode(text, encodings=['utf-8', 'gb2312', 'big5', 'shift_jis']):
        """安全解码，尝试多种编码"""
        for encoding in encodings:
            try:
                return text.decode(encoding)
            except UnicodeDecodeError:
                continue
        return text.decode('utf-8', errors='replace')  # 使用替换模式

3. JavaScript中的编码处理

// JavaScript中的编码处理
class JavaScriptEncodingHandler {
    // UTF-8字符串处理
    static processUTF8String(str) {
        // JavaScript字符串内部使用UTF-16
        // 但可以处理UTF-8编码的数据
        
        // 从UTF-8字节数组创建字符串
        const utf8Bytes = new Uint8Array([0x48, 0x65, 0x6C, 0x6C, 0x6F]); // "Hello"
        const decoder = new TextDecoder('utf-8');
        const text = decoder.decode(utf8Bytes);
        
        return text;
    }
    
    // 字符串转UTF-8字节数组
    static stringToUTF8Bytes(str) {
        const encoder = new TextEncoder();
        return encoder.encode(str);
    }
    
    // Base64编码（常用于网络传输）
    static encodeBase64(str) {
        return btoa(unescape(encodeURIComponent(str)));
    }
    
    static decodeBase64(base64) {
        return decodeURIComponent(escape(atob(base64)));
    }
}

选择建议

1. 按应用类型选择

Web应用

推荐：UTF-8
原因：
- 现代Web标准
- 浏览器原生支持
- 搜索引擎友好
- 跨平台兼容

移动应用

Android: UTF-8 (网络传输) + UTF-16 (内部处理)
iOS: UTF-8 (网络传输) + UTF-8 (内部处理)
原因：
- 网络传输效率
- 平台兼容性
- 存储空间优化

桌面应用

Windows: UTF-8 (文件) + UTF-16 (API)
macOS: UTF-8 (全栈)
Linux: UTF-8 (全栈)
原因：
- 平台特性
- API兼容性
- 文件系统支持

嵌入式系统

推荐：UTF-8
原因：
- 存储空间有限
- 网络传输效率
- 处理简单
- 兼容性好

2. 按语言需求选择

纯英文应用

推荐：ASCII
备选：UTF-8
原因：
- 存储效率最高
- 处理最简单
- 兼容性最好

中文应用

中国大陆：GB2312/GBK (传统) 或 UTF-8 (现代)
台湾香港：Big5 (传统) 或 UTF-8 (现代)
原因：
- 传统系统兼容
- 现代标准支持

多语言应用

推荐：UTF-8
原因：
- 支持所有语言
- 现代标准
- 跨平台兼容

3. 按性能需求选择

高性能要求

内部处理：UTF-32 (定长，快速索引)
存储传输：UTF-8 (空间效率)
原因：
- 处理速度优先
- 存储效率平衡

存储空间优先

推荐：UTF-8
原因：
- 英文字符1字节
- 中文字符3字节
- 总体效率最高

处理简单优先

推荐：UTF-32
原因：
- 定长编码
- 直接索引
- 无需解析

常见问题与解决方案

1. 乱码问题

问题描述

原始文本：Hello 世界
显示结果：Hello ??? 或 Hello ä¸ç

解决方案

// 检测和修复编码问题
class EncodingFixer {
public:
    static std::string FixEncoding(const std::string& text) {
        // 尝试检测编码
        EncodingType detected = DetectEncoding(text);
        
        switch (detected) {
            case ENCODING_UTF8:
                return text;  // 已经是UTF-8
            case ENCODING_GB2312:
                return ConvertGB2312ToUTF8(text);
            case ENCODING_BIG5:
                return ConvertBig5ToUTF8(text);
            default:
                return ConvertToUTF8(text);  // 强制转换
        }
    }
    
private:
    enum EncodingType {
        ENCODING_UNKNOWN,
        ENCODING_UTF8,
        ENCODING_GB2312,
        ENCODING_BIG5
    };
    
    static EncodingType DetectEncoding(const std::string& text) {
        // 简单的编码检测逻辑
        bool hasHighBytes = false;
        for (char c : text) {
            if (c & 0x80) {
                hasHighBytes = true;
                break;
            }
        }
        
        if (!hasHighBytes) {
            return ENCODING_UTF8;  // 纯ASCII
        }
        
        // 更复杂的检测逻辑...
        return ENCODING_UNKNOWN;
    }
};

2. 字节序问题

问题描述

UTF-16在不同系统间传输时出现字节序问题

解决方案

// 处理字节序问题
class ByteOrderHandler {
public:
    // 检测字节序
    static bool IsBigEndian() {
        uint16_t test = 0x1234;
        return *reinterpret_cast<uint8_t*>(&test) == 0x12;
    }
    
    // 转换字节序
    static uint16_t SwapBytes(uint16_t value) {
        return (value << 8) | (value >> 8);
    }
    
    // 处理UTF-16 BOM
    static std::wstring ProcessUTF16WithBOM(const uint8_t* data, size_t length) {
        if (length < 2) return L"";
        
        bool isBigEndian = false;
        size_t start = 0;
        
        // 检查BOM
        if (length >= 2) {
            if (data[0] == 0xFF && data[1] == 0xFE) {
                isBigEndian = false;  // Little Endian
                start = 2;
            } else if (data[0] == 0xFE && data[1] == 0xFF) {
                isBigEndian = true;   // Big Endian
                start = 2;
            }
        }
        
        // 转换数据
        std::wstring result;
        for (size_t i = start; i < length; i += 2) {
            if (i + 1 < length) {
                uint16_t value = data[i] | (data[i + 1] << 8);
                if (isBigEndian) {
                    value = SwapBytes(value);
                }
                result += static_cast<wchar_t>(value);
            }
        }
        
        return result;
    }
};

3. 性能优化

问题描述

大量文本处理时性能问题

解决方案

// 高性能文本处理
class HighPerformanceTextProcessor {
private:
    // 预编译的正则表达式
    std::regex utf8Pattern;
    
    // 字符缓存
    std::unordered_map<std::string, size_t> charCountCache;
    
public:
    HighPerformanceTextProcessor() {
        // 预编译UTF-8模式
        utf8Pattern = std::regex(R"([\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF7][\x80-\xBF]{3})");
    }
    
    // 快速字符计数
    size_t FastCharCount(const std::string& text) {
        // 检查缓存
        auto it = charCountCache.find(text);
        if (it != charCountCache.end()) {
            return it->second;
        }
        
        // 计算字符数
        size_t count = 0;
        for (size_t i = 0; i < text.length(); ) {
            if ((text[i] & 0x80) == 0) {
                i++;
            } else if ((text[i] & 0xE0) == 0xC0) {
                i += 2;
            } else if ((text[i] & 0xF0) == 0xE0) {
                i += 3;
            } else if ((text[i] & 0xF8) == 0xF0) {
                i += 4;
            } else {
                i++;
            }
            count++;
        }
        
        // 缓存结果
        charCountCache[text] = count;
        return count;
    }
    
    // 批量处理
    void BatchProcess(const std::vector<std::string>& texts) {
        // 并行处理
        std::vector<std::future<size_t>> futures;
        
        for (const auto& text : texts) {
            futures.push_back(std::async(std::launch::async, 
                [this](const std::string& t) { return FastCharCount(t); }, text));
        }
        
        // 收集结果
        for (auto& future : futures) {
            size_t count = future.get();
            // 处理结果...
        }
    }
};

4. 内存管理

问题描述

大量文本处理时内存泄漏

解决方案

// 安全的内存管理
class SafeTextHandler {
private:
    // 使用智能指针管理内存
    std::unique_ptr<char[]> buffer;
    size_t bufferSize;
    
public:
    SafeTextHandler() : bufferSize(0) {}
    
    // 安全分配内存
    bool AllocateBuffer(size_t size) {
        try {
            buffer = std::make_unique<char[]>(size);
            bufferSize = size;
            return true;
        } catch (const std::bad_alloc& e) {
            bufferSize = 0;
            return false;
        }
    }
    
    // 安全处理文本
    bool ProcessText(const std::string& input) {
        size_t requiredSize = input.length() + 1;
        
        if (requiredSize > bufferSize) {
            if (!AllocateBuffer(requiredSize)) {
                return false;
            }
        }
        
        // 安全拷贝
        std::strncpy(buffer.get(), input.c_str(), bufferSize - 1);
        buffer[bufferSize - 1] = '\0';
        
        // 处理文本...
        return true;
    }
    
    // 自动清理
    ~SafeTextHandler() {
        // 智能指针自动清理
    }
};