字符集与字符串转换详解-优快云博客

本文链接：https://blog.youkuaiyun.com/nemo2011/article/details/102690096

关于字符集的一些总结：

1. std::wstring s(L"abc"); //L标识使用wchar_t，一个字符占用两个字节。

"A" = 41

"ABC" = 41 42 43

L"A" = 00 41

L"ABC" = 00 41 00 42 00 43

string, wstring相互之间的转换：

#include <locale>
#include <codecvt>
#include <string>
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
std::string narrow = converter.to_bytes(wide_utf16_source_string);
std::wstring wide = converter.from_bytes(narrow_utf8_source_string);

#include <string>
#include <codecvt>
#include <locale>

std::string input_str = "this is a -string-, which is a sequence based on the -char- type.";
std::wstring input_wstr = L"this is a -wide- string, which is based on the -wchar_t- type.";

// conversion
std::wstring str_turned_to_wstr = std::wstring_convert<std::codecvt_utf8<wchar_t>>().from_bytes(input_str);

std::string wstr_turned_to_str = std::wstring_convert<std::codecvt_utf8<wchar_t>>().to_bytes(input_wstr);

这两段代码写得非常清晰，均来源于stackoverflow。

去除中文标点符号和空格：


    //定义转换对象
    wstring_convert<codecvt_utf8<wchar_t>> conv;
    //按行读取文件
    while (!infile.eof()){
        string s;
        getline(infile, s);
        //转换成宽字节类型
        wstring ws = conv.from_bytes(s);
        wstring nws;
        //过滤每一行中的标点和空格
        for (wchar_t ch : ws){
            //检查是否是标点和空格
            if (!iswpunct(ch) && !iswspace(ch)){
                nws.push_back(ch);
            }
        }
        //将过滤后的文本重新转换成UTF-8编码的多字节类型
        string ns = conv.to_bytes(nws);
        //重新写回文件
        outfile << ns;
    }