使用 Boost 的Locale 進行字碼轉換

最新推荐文章于 2023-09-27 07:56:37 发布

转载最新推荐文章于 2023-09-27 07:56:37 发布 · 6.8k 阅读

文章标签：

#string #encoding #windows #localization #translation #basic

boost 专栏收录该内容

16 篇文章

订阅专栏

本文介绍如何使用Boost C++ Libraries的Locale进行字码转换，包括between(), to_utf(), from_utf() 和 utf_to_utf()等函数的使用方法。

原帖： http://viml.nchc.org.tw/blog/paper_info.php?CLASS_ID=1&SUB_ID=1&PAPER_ID=316

這一篇會比較簡短一點，來大概提一下怎麼用 Boost C++ Libraries 的 Locale（官方文件）這個函式庫，來進行字碼的轉換。Locale 這個函式庫主要是提供 C++ 地區化（localization）的功能、並提供一些 Unicode 的處理能力；而 Heresy 在這邊只會針對字碼轉換的部分做說明。

不過在使用時要注意的一點是，Locale 這個函式庫是在 1.48.0 才加入的，算是相當新的一個函示庫，所以要使用前必須要先確定自己使用的 Boost 的版本，確認是否在 1.48.0 以上；另外，他也是 Boost 中少數需要建置（build）的函式庫，並且可以搭配 ICU（官網）或 iconv（官網）來使用，有需要的話請參考官方的建置說明。（註 1、註 2）

基本上，Locale 所提供的轉碼功能相當地簡單，官方文件請參考《Character Set Conversions》。

基本上，Locale 所提供的字碼轉換都是函式的形式，而且這些函式都在 boost::locale::conv 這個 namespace 下，如果要使用的，則是要引入 boost/locale/encoding.hpp 這個 header 檔。

而他提供用來轉碼的函式，則包括了：

不同字碼間的轉換－between()

這個函式是用來將字串在不同的字碼間進行轉換用的，他的介面是：
```
string between( string const &text,
                string const &to_encoding,
                string const &from_encoding,
                method_type how=default_method)
```
他基本本上只能針對 string 做處理，不支援 wstring。

使用上相當簡單，第一個參數 text 就是要進行編碼轉換的字串，第二個參數 to_encoding 則是要轉換成什麼編碼，第三個參數 from_encoding 則是描述 text 是用哪種編碼（註 3）。

而最後一個參數 how 則是 Locale 的一個列舉型別，代表在轉換失敗的時候，要做什麼處理；基本上只有兩個選項，skip 和 stop，預設會是 skip，也就是無視無法轉換的字元、其他繼續處理。

下面就次一個簡單的例子，他會把 source 這個字串從 BIG5 轉換到 UTF-8：
```
string source = "....";
string s = boost::locale::conv::between( source, "UTF-8", "BIG5" );
```

轉換到 UTF－to_utf()

這個函式是將特定編碼的 string，轉換成 UTF 字串用的，它的介面基本上是：

template<typename CharType>
basic_string<CharType> to_utf( string const &text,
                               string const &charset,
                               method_type how )

他除了可以輸出 string 之外，也可以支援輸出成 wstring。像下面的例子，就是把 sSource 這個 BIG-5 編碼的字串，依序轉換成 wstring 和 string 的字串。

string sSource = "...";
wstring ws = boost::locale::conv::to_utf<wchar_t>( sSource, "BIG5" );
string  ss = boost::locale::conv::to_utf<char>( sSource, "BIG5" );

從 UTF 轉成其他編碼－from_utf()

這個函式和上面介紹的 to_utf() 相反，他是把 UTF 字串（string 或 wstring）、轉換為特定編碼的字串用的。它的介面是：

template<typename CharType>
string from_utf( basic_string<CharType> const &text,
                 string const &charset,
                 method_type how )

他可以轉換 string 或 wstring 的字串，但是輸出一定是 string。

下面的例子，就是把 sSource 和 wSource 這兩個 UTF 字串，都轉換成 BIG-5 的 string 字串。

string  sSource =  "...";
wstring wSource = L"...";
string  ss1 = boost::locale::conv::from_utf( wSource, "BIG5" );
string  ss2 = boost::locale::conv::from_utf( sSource, "BIG5" );

Unicode 之間的轉換－utf_to_utf()

這個函式的目的，是在 UTF 的 string 字串和 wstring 字串之間做轉換；他的介面如下：

template<typename CharOut,typename CharIn>
basic_string<CharOut> utf_to_utf( basic_string<CharIn> const &str,
                                  method_type how )

下面的例子，就是把型別是 string 的 sSource 轉換成 wstring、並把型別是 wstring 的 wSource 轉換成 string 了～

string  sSource =  "...";
wstring wSource = L"...";
wstring wStr = boost::locale::conv::utf_to_utf<wchar_t>( sSource );
string  sStr = boost::locale::conv::utf_to_utf<char>( wSource );

這篇就寫到這裡了。基本上，Locale 所提供的字碼轉換的功能，在 Heresy 來看算是相當簡單好用的～至少和直接用 iconv 比起來，真的簡單不少…對於有需要在程式中進行字碼轉換的人來說，Heresy 是覺得可以試試看用 Boost 的 Locale 這個函式庫來做做看啦～

而實際上，Locale 還有許多其他的功能，但是因為 Heresy 還用不太到，所以暫時就不研究了。

不過另外 Heresy 覺得可能比較實用的，是他的《Messages Formatting (Translation) 》的部分～或許等有需要的時候，會來看看這一塊吧。

附註：

在 Windows 平台上，應該是可以不搭配 ICU 和 iconv 來使用的；但是在 POSIX 平台上，則似乎一定要有 ICU 或 iconv 其中一個。
Heresy 有試著想在 Windows 平台上整合 ICU，不過感覺好像不是很成功…

不過一個經驗是，似乎不能直接下載 ICU 的 binary 來作為 Boost 建置時的 ICU 路徑，而是需要下載 source code 來自己建置 ICU 才可以。但是雖然 Heresy 成功地讓 Boost 抓到 ICU 了，但是在試著轉碼的時候，他似乎還是去用 Windows 內建的 codepage、沒去用 ICU 的…
編碼基本上是直接以字串來做輸入，例如中文就是「BIG5」、Unicode 則是使用「UTF-8」；另外也可以使用 std::locale（參考）來作為指定編碼的參數。

至於有支援那些編碼，則是要看是使用哪種字碼轉換的模組，例如有使用 iconv 和 ICU 的話，就是看 iconv 和 ICU 有支援什麼了～而在 Windows + MSVC 環境下，如果都沒用的話，Boost 會使用 Windows 內建的 code page 來做處理，列表可以參考 Boost 目錄下的 \libs\locale\src\encoding\wconv_codepage.ipp 這個檔案裡的 all_windows_encodings 這個陣列。
上述 Locale 所提供的轉換函式，都可以改用 char* 或 wchar_t* 作為輸入的字串。
FreeType 的 ft_encoding_unicode 可以使用 wsting 的 UTF 字串。
要在 standrad IO 輸出 wstring 這轉寬字元字串，不能使用 cout，而是要使用 wcout；而如果要輸出中文字串的話，則是需要透過 ostream 的 imbue() 這個函式，來做區域的設定。下面是一個例子：
```
wstring wSource = L"中文";
wcout.imbue( std::locale( "cht" ) );
wcout << wSource << endl;
```