windows 宽窄字符串转换

最新推荐文章于 2024-03-05 10:39:18 发布

等风来不如迎风去

最新推荐文章于 2024-03-05 10:39:18 发布

阅读量1.8k

点赞数

CC 4.0 BY-SA版权

分类专栏： windows环境编程

本文链接：https://blog.youkuaiyun.com/commshare/article/details/106475576

windows环境编程专栏收录该内容

144 篇文章

订阅专栏

本文深入探讨Unicode、GBK和UTF-8的区别，解释字符集与编码的概念，剖析乱码产生的原因，提供编码转换实例，适合编程新手及资深开发者阅读。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

字符集的转换
字符编码的转换
以上两者是不同的概念，不可比较，分别处理

基础知识 Unicode,GBK和UTF-8

转载这位大神

看到题目,你也许会说,”又是这种月经帖,这问题我早弄清楚了”. 但如果有人问你,”Unicode,GBK和UTF-8有什么区别?”, 你能自信地给他一句简短清晰的回答吗? 如果不能的话, 那还是看一下这篇文章吧.

前言

其实这是个老生常谈的问题了,相信大家在第一次遇到Unicode编码问题时,都会在网上搜索一通, 找到几个解释,虽然有点杂乱,但还是感觉自己明白了些什么,然后就继续忙别的事情. 而我之所以就这个问题专门写一篇文章,原因是前两天在与公司一位有十几年工作经验的JAVA程序员对接 API时, 我问他返回的汉字是什么编码的, 而他回答说”直接返回unicode”. 一个如此有经验的老程序员对这种基本问题都不甚清楚, 因此我觉得还是有必要好好说一下这个问题的.

字符集

在介绍他们之间的区别时, 我们先讲下什么是Unicode.
简单来说,Unicode是一个字符集(character set), 和ASCII一样, 其作用是用一系列数字来表示字符(character), 这些数字有时也称为码点(code points).
在PC刚出来的时候,使用英文的几位先驱认为计算机需要表示的字符不多,26个英文字母加几个回车换行等特殊符号,总共一百个字符顶天了,于是就有了ASCII. ASCII码的大小为1个字节,定义了128个字符, 分别表示为0-127. 比如字符’A’的码点为65,回车符’\n’的码点为10, 如下所示:

>>> ord('A')
65
>>> ord('0')
48
>>> ord('\n')
10

当然, 后来人们发现, 世界上的字符远远不止128个, 因此就需要一个新的字符集能表示世上所有的字符, 包括一个英文字符,一个汉字字符,一个象形文字等. 这个字符集就是Unicode.
Unicode前向兼容了ASCII, 最多可以表示2^21(大概200万)个字符,已经足够囊括当今所有国家的文字, 如下所示:

>>> u'ソ'
u'\u30bd'
>>> u'龍'
u'\u9f8d'
>>> u'A'
u'A'

目前unicode字符集表示完所有字符后还有剩余, 这些暂时用不到的部分通常用占位符FFFD表示.

字符编码

有了字符集, 我们现在可以用任意数字来表示现实中的字符了. 但字符要保存在计算机中,必须要先经过编码. 有人问, 数字直接保存在内存里不就行了吗? 但是用多少个字节表示一个数字,以及每个字节的范围这都是需要预先约定的,这种约定就叫编码. 假如我们有四个数字,1,2,3,4要保存在计算机里, 如果约定了utf-8编码, 那么在内存中的表示则如下:

00000001 00000010 00000011 00000100

其他的编码规则有utf-16,gb2312,gbk等,具体的编码规则不在本文的范围内,想要深入了解的可以在网上查阅相关的文档.
因此,我们可以看到,如果不按照约定的规则来解码,就很有可能无法还原出原来的数据,也就是我们经常遇到的”乱码”. 下面以几个例子来简单说明:

>>> u'你好'
u'\u4f60\u597d'
>>> u'你好'.encode('utf8')
'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> u'你好'.encode('gbk')
'\xc4\xe3\xba\xc3'
>>> u'你好'.encode('utf8').decode('gbk')
u'\u6d63\u72b2\u30bd'
>>> print u'你好'.encode('utf8').decode('gbk')
浣犲ソ

如上面的代码所示, “你好”两个汉字字符的unicode分别为4f60和597d, utf-8编码后占6个字节, 而gbk编码后占4个字节. 如果用utf8编码后错误地用gbk来解码, 就会得到3个unicode码点,分别表示字符浣,犲和ソ;而如果用gbk编码后错误地用utf8来解码, 则在解码第二个字符时无法凑够3个字节, 因此会得到未知的结果, 甚至会因为内存越界访问引起程序异常.

注: 本文的python代码示例是在***Linux Terminal下运行的, 因此默认为utf-8编码, 如果你是在Windows cmd里运行, 则通常默认GBK编码***, 因此乱码会在不同地方出现:)

乱码

知道字符编解码的用法之后,我们就可以解释一下常见的一些乱码由来了, 比如在Windows下,未初始化的栈会初始化为0xcc, 未初始化的堆内存会初始化为0xcd, 可以看到前者为’烫’的gbk编码,而后者正好为’屯’的gbk编码, 如下所示:

>>> u'烫'
u'\u70eb'
>>> u'烫'.encode('gbk')
'\xcc\xcc'
>>> u'屯'
u'\u5c6f'
>>> u'屯'.encode('gbk')
'\xcd\xcd'

前面也说过, unicode暂时没用到码点会用占位符FFFD来表示, 如果这个占位符被错误解析, 就会被当作有意义的内容了:

>>> u'\uFFFD'.encode('utf8')
'\xef\xbf\xbd'
>>> u'锟斤拷'.encode('gbk')
'\xef\xbf\xbd\xef\xbf\xbd'
>>> print (u'\uFFFD'.encode('utf8')*2).decode('gbk')
锟斤拷

可以看到,汉字”锟斤铐”(Unicode)的gbk编码分别为\xef\xbf, \xbd\xef和\xbf\xbd, 正好是unicode码FFFD的utf8编码 的叠加, 因此如果平时遇到多个utf8编码的Unicode占位符且不巧用了gbk的方式解码,那就会看到熟悉的锟斤铐了.

其他

在Windows的Notepad.exe中, 保存文件的格式可以看到有如下几种:

notepad

可刚刚不是说Unicode只是字符集吗, 为什么上面显示可以保存为Unicode”编码”? 好吧, 其实这是Windows在命名上一个操蛋的地方. 因为**Windows内部使用UTF-16小端(UTF-16LE)作为默认编码,**并且认为这就是Unicode的标准编码格式. 在Windows的世界中, 存在着ANSI字符串(在当前系统代码页中, 不可拓展),以及Unicode字符串(内部以UTF16-LE编码保存). 因此notepad里所说的 Unicode大端,其实就是UTF16-BE.

这其实也不怪Windows, 因为这是在Unicode出现的早期设计的, 那时我们还没意识到UCS-2的不足, 而且UTF-8还没有被发明出来. 这也是为什么Windows对UTF8的支持如此之差的原因之一吧.

后记

说了这么多, 现在让我们回到一开始的问题, 如果有人问你”Unicode,GBK和UTF-8有什么区别?”, 我想你应该知道该怎么回答了吧: Unicode是一种字符集, 而GBK和UTF-8都是编码, 因此Unicode和后两者不是一类事物, 是无法进行对比的.

转换实例

以下只适用于windows 开发

  static std::string str2UTF8(std::string &str)
  {
	int nwLen = ::MultiByteToWideChar(CP_ACP, 0, str.c_str(), -1, NULL, 0);
	wchar_t * pwBuf = new wchar_t[nwLen + 1];
	ZeroMemory(pwBuf, nwLen * 2 + 2);
	::MultiByteToWideChar(CP_ACP, 0, str.c_str(), str.length(), pwBuf, nwLen);
	int nLen = ::WideCharToMultiByte(CP_UTF8, 0, pwBuf, -1, NULL, NULL, NULL, NULL);
	char * pBuf = new char[nLen + 1];
	ZeroMemory(pBuf, nLen + 1);
	::WideCharToMultiByte(CP_UTF8, 0, pwBuf, nwLen, pBuf, nLen, NULL, NULL);
	std::string retStr(pBuf);
	delete[]pwBuf;
	delete[]pBuf;
	pwBuf = NULL;
	pBuf= NULL;
	return retStr;
  }

static std::string str2UTF8(std::string &str)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8_conv;
   std::wstring wext = Str2Wstr(str);
   //LOGN(_CLS_) << "before ext " << msg.ext_;
   std::string ext = utf8_conv.to_bytes(wext);
   return std::move(ext);
}

wstr to utf8

static std::string utf8_encode(const std::wstring &wstr)
{
  int size_needed = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), NULL, 0, NULL, NULL);
  std::string strTo(size_needed, 0);
  WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), &strTo[0], size_needed, NULL, NULL);
  return strTo;
}

ansi2unicode

// Convert an ANSI string to a wide Unicode String
static std::wstring ansi2unicode(const std::string &str)
{
  int size_needed = MultiByteToWideChar(CP_ACP, 0, &str[0], (int)str.size(), NULL, 0);
  std::wstring wstrTo(size_needed, 0);
  MultiByteToWideChar(CP_ACP, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
  return wstrTo;
}

str2 wstr

  static std::wstring Str2Wstr(std::string &str)
  {
	if (str.length() == 0)
	  return L"";

	std::wstring wstr;
	wstr.assign(str.begin(), str.end());
	return wstr;
  }

wstr 2 str

static std::string w2s(std::wstring wstr)
{
	std::string result;
	//获取缓冲区大小，并申请空间，缓冲区大小事按字节计算的  
	int len = WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), wstr.size(), NULL, 0, NULL, NULL);
	char* buffer = new char[len + 1];
	//宽字节编码转换成多字节编码  
	WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), wstr.size(), buffer, len, NULL, NULL);
	buffer[len] = '\0';
	//删除缓冲区并返回值  
	result.append(buffer);
	delete[] buffer;
	return result;
}

	std::wstring toWideString( const char* pStr , int len,unsigned int CodePage /*=CP_ACP*/ )
	{
		std::wstring buf ;

		if(pStr == NULL)
		{
			assert(NULL);
			return buf;
		}

		if (len < 0 && len != -1)
		{
			assert(NULL);
			return buf;
		}

		// figure out how many wide characters we are going to get 
		int nChars = MultiByteToWideChar( CodePage , 0 , pStr , len , NULL , 0 ) ; 
		if ( len == -1 )
			-- nChars ; 
		if ( nChars == 0 )
			return L"" ;

		// convert the narrow string to a wide string 
		// nb: slightly naughty to write directly into the string like this
		buf.resize( nChars ) ; 
		MultiByteToWideChar( CodePage , 0 , pStr , len , 
			const_cast<wchar_t*>(buf.c_str()) , nChars ) ; 

		return buf ;
	}

	std::string toNarrowString( const wchar_t* pStr , int len, unsigned int CodePage )
	{
		std::string buf ;

		if(pStr == NULL)
		{
			assert(NULL);
			return buf;
		}

		if (len < 0 && len != -1)
		{
			assert(NULL);
			return buf;
		}

		// figure out how many narrow characters we are going to get 
		int nChars = WideCharToMultiByte( CodePage , 0 , 
			pStr , len , NULL , 0 , NULL , NULL ) ; 
		if ( len == -1 )
			-- nChars ; 
		if ( nChars == 0 )
			return "" ;

		// convert the wide string to a narrow string
		// nb: slightly naughty to write directly into the string like this
		buf.resize( nChars ) ;
		WideCharToMultiByte( CodePage , 0 , pStr , len , 
			const_cast<char*>(buf.c_str()) , nChars , NULL , NULL ) ; 

		return buf ; 
	}

std::string String::ToMultiByte(const std::wstring& strWide, UINT uCodePage /* = CP_ACP */)
{
	std::string strMultiByte;
	if (strWide.empty())
	{
		return strMultiByte;
	}

	int nRequiredChar = WideCharToMultiByte(uCodePage, 0, strWide.c_str(), strWide.size(), NULL, 0, NULL, NULL);
	if (nRequiredChar > 0)
	{
		strMultiByte.resize(nRequiredChar/*+1*/);
		WideCharToMultiByte(uCodePage , 0, strWide.c_str(), strWide.size(), const_cast<char*>(strMultiByte.c_str()), nRequiredChar, NULL, NULL);
	}

	return strMultiByte; 
}

std::wstring String::ToWideChar(const std::string& strMultiByte, UINT uCodePage /* = CP_ACP */)
{
	std::wstring strWide;
	if (strMultiByte.empty())
	{
		return strWide;
	}

	int nRequiredChar = MultiByteToWideChar(uCodePage, 0, strMultiByte.c_str(), strMultiByte.size(), NULL, 0);

	if (nRequiredChar > 0)
	{
		strWide.resize(nRequiredChar/*+1*/); 
		MultiByteToWideChar(uCodePage, 0, strMultiByte.c_str(), strMultiByte.size(), const_cast<wchar_t*>(strWide.c_str()), nRequiredChar); 
	}

	return strWide;
}