Unicode文件修改实例

最新推荐文章于 2023-02-03 21:40:40 发布

原创最新推荐文章于 2023-02-03 21:40:40 发布 · 1.3k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#delete #character #byte #newline #string #null

计算机理论专栏收录该内容

7 篇文章

订阅专栏

本文详细介绍如何在VC环境下，采用多字节字符集修改Unicode编码的文本文件。包括使用二进制方式打开文件的原因及如何根据不同情况更新文件内容的具体步骤，并提供了示例代码。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

描述：

源文件是按unicode编码的文本文件有这样一段

[应用版本号] "1.0.0"
主题名称] "202"
[主题图标路径] "themeIcons//202.gif"

........

根据传进来的参数，比如是vmax就要修改成

[应用版本号] "1.0.0"
主题名称] "vmax"
[主题图标路径] "themeIcons//vmax.gif"

.......

我用的是vc，字符集采用的是“多字节字符集”

首先以二进制方式打开文件,这里面说一下二进制方式和文本方式有什么不同

这些不同主要是在一些API上，某些API会根据打开方式的不同对文件中的“回车换行”符(/r/n)做特殊处理。注意这是两个字符，在ascii下由0d和0a表示。

查阅msdn 在cstdiofile下面有这样一段remarks

Stream files are buffered and can be opened in either text mode (the default) or binary mode.

Text mode provides special processing for carriage return–linefeed pairs. When you write a newline character (0x0A) to a text-mode CStdioFile object, the byte pair (0x0D, 0x0A) is sent to the file. When you read, the byte pair (0x0D, 0x0A) is translated to a single 0x0A byte.

是说当使用writestring是字符串末尾的'/0'不会被写到文件中,以文本方式打开其中的换行符'/n'ascii表示是0a都会被转换成"/r/n"ascii表示是0x0d0x0a 。

以二进制方式打开的文件不会出现这种转换。

接下来就说说怎么样在多字节字符集下去修改Unicode文件

修改文件一般用覆盖的方式。根据要修改的内容和原有内容长度的不同(按字节算)有三种情况（假设pos是你要修改的位置）：

1，和原来的一样长，在pos处直接写覆盖就可以了。

2，比原来的长n字节，自pos的所有内容后移n个字节后，再从pos写入以刚好凑成等长

3，比原来的短n字节，自pos+n后的所有内容n个字节，再从pos写入，最后还要将文件末尾的n个字节删除.可以用setlength方法

下面展出一段示例代码：

void CTh_FileManage::UpdateCustomDemo(CString& path, CString& vte_name) { CStdioFileEx file; CFileException ex; if (!file.Open((LPCTSTR)path, CFile::modeReadWrite | CFile::shareDenyNone | CFile::typeBinary, &ex)) { char szError[1024]; ex.GetErrorMessage((LPTSTR)szError, 1024); AfxMessageBox((LPCTSTR)szError); file.Close(); exit(0); } CString cstr_c; //skip two lines ULONGLONG posb = file.GetPosition(); file.ReadString(cstr_c);//this is a null terminate string posb = file.GetPosition(); file.ReadString(cstr_c); posb = file.GetPosition(); file.ReadString(cstr_c); //get the string before '"' int pb = cstr_c.Find('"'); int pe = cstr_c.Find('"', pb + 1); volatile int orgL = pe - pb - 1; volatile int vL = vte_name.GetLength(); CString comment = cstr_c.Left(cstr_c.Find('"')); comment += '"'; comment += vte_name; comment += '"'; //comment += "/r/n";//以二进制打开的文件 write时不会自动将"/n"转换成"/r/n" 这不同于以文本方式打开的文件 Updatelineinfile(file, orgL, vL, posb, comment); //fourth line is path posb = file.GetPosition();//the position is just before '/r' posb += 4; //the length of "/r/n" in unicode file.Seek(posb, CFile::begin); file.ReadString(cstr_c); comment = cstr_c.Left(cstr_c.Find('"')); comment += '"'; comment += "themeIcons////"; comment += vte_name; comment += ".gif"; comment += '"'; orgL = cstr_c.GetLength(); vL = comment.GetLength(); Updatelineinfile(file, orgL, vL, posb, comment); file.Flush(); file.Close(); } void CTh_FileManage::Updatelineinfile(CStdioFileEx& file,int orgL, int vL, int posb, CString& comment) { if (orgL == vL)//刚好写入 { //write new vte_name to file file.Seek(posb, CFile::begin); file.WriteString(comment);//原来的行取代 file.Flush(); }else if (orgL < vL) { //自posb的所有内容后移vL - orgL个字节后，从posb写入以刚好凑成等长 int off = vL - orgL; off <<= 1; int len = file.GetLength() - posb; char* buff = new char[len]; file.Seek(posb, CFile::begin); int res = file.Read(buff, len); file.Seek(posb + off, CFile::begin); file.Write(buff, res); file.Flush(); file.Seek(posb, CFile::begin); file.WriteString(comment); }else { int off = orgL - vL; off <<= 1; //自pob + off 的所有内容向前移off个字节 ULONGLONG pos = posb + off; int len = file.GetLength() - pos;//len是以单字节为单位的这点必须明确 char* buf = new char[len]; file.Seek(pos, CFile::begin); int res = file.Read(buf, len); file.Seek(posb, CFile::begin); file.Write(buf, res); file.Flush(); file.Seek(posb, CFile::begin); file.WriteString(comment); int posc = file.GetPosition(); //SetLength截断后面多余的字节因为是短覆盖长所以在末尾有off个重复字节这个调用会改变文件指针的位置所以先当前保存文件指针的位置 file.SetLength(file.GetLength() - off); file.Seek(posc, CFile::begin); file.Flush(); } }

自posb开始文件内容后移off个字节的方法最简单的：

求剩余字节:int len = file.GetLength() - posb;

然后读取剩余字节：

char* buff = new char[len];
file.Seek(posb, CFile::begin);
int res = file.Read(buff, len);

接着从posb+off写入剩余内容

file.Seek(posb + off, CFile::begin);
file.Write(buff, res);
file.Flush();

接着把我们要添加的内容从posb写入

file.Seek(posb, CFile::begin);
file.WriteString(comment);

以上是修改的过程，下面要实现cstdiofileex类，实现writestring和readstring。以便从unicode文件中用readstring方法读一行Mutibyte编码的字符串。并且用writestring将mutibyte字符串以unicode编码写道文件

列出这两个函数

BOOL CStdioFileEx::ReadString(CString& rString) { const int nMAX_LINE_CHARS = 4096; BOOL bReadData; LPTSTR lpsz; int nLen = 0; //, nMultiByteBufferLength = 0, nChars = 0; CString sTemp; wchar_t* pszUnicodeString = NULL; char * pszMultiByteString= NULL; // If at position 0, discard byte-order mark before reading if (!m_pStream || (GetPosition() == 0 && m_bIsUnicodeText)) { wchar_t cDummy; // Read(&cDummy, sizeof(_TCHAR)); Read(&cDummy, sizeof(wchar_t)); } // If compiled for Unicode #ifdef _UNICODE // Do standard stuff -- both ANSI and Unicode cases seem to work OK bReadData = CStdioFile::ReadString(rString); #else if (!m_bIsUnicodeText) { // Do standard stuff -- read ANSI in ANSI bReadData = CStdioFile::ReadString(rString); } else { pszUnicodeString = new wchar_t[nMAX_LINE_CHARS]; pszMultiByteString= new char[nMAX_LINE_CHARS]; // Read as Unicode, convert to ANSI if(fgetws(pszUnicodeString, nMAX_LINE_CHARS, m_pStream)==NULL) { bReadData=FALSE; } else { bReadData=TRUE; if (GetMultiByteStringFromUnicodeString(pszUnicodeString, pszMultiByteString, nMAX_LINE_CHARS)) { rString = (CString)pszMultiByteString; } if (pszUnicodeString) { delete pszUnicodeString; } if (pszMultiByteString) { delete pszMultiByteString; } } } #endif // Then remove end-of-line character if in Unicode text mode if (bReadData) { // Copied from FileTxt.cpp but adapted to Unicode and then adapted for end-of-line being just '/r'. nLen = rString.GetLength(); if (nLen > 1 && rString.Mid(nLen-2) == sNEWLINE) { rString.GetBufferSetLength(nLen-2); } else { lpsz = rString.GetBuffer(0); if (nLen != 0 && (lpsz[nLen-1] == _T('/r') || lpsz[nLen-1] == _T('/n'))) { rString.GetBufferSetLength(nLen-1); } } } return bReadData; } void CStdioFileEx::WriteString(LPCTSTR lpsz) { // If writing Unicode and at the start of the file, need to write byte mark if (m_nFlags & CStdioFileEx::modeWriteUnicode) { // If at position 0, write byte-order mark before writing anything else if (!m_pStream || GetPosition() == 0) { wchar_t cBOM = (wchar_t)nUNICODE_BOM; CFile::Write(&cBOM, sizeof(wchar_t)); } } // If compiled in Unicode... #ifdef _UNICODE // If writing Unicode, no conversion needed if (m_nFlags & CStdioFileEx::modeWriteUnicode) { // Write in byte mode CFile::Write(lpsz, lstrlen(lpsz) * sizeof(wchar_t)); } // Else if we don't want to write Unicode, need to convert else { int nChars = lstrlen(lpsz) + 1; // Why plus 1? Because yes int nBufferSize = nChars * sizeof(char); wchar_t* pszUnicodeString = new wchar_t[nChars]; char * pszMultiByteString= new char[nChars]; // Copy string to Unicode buffer lstrcpy(pszUnicodeString, lpsz); // Get multibyte string if (GetMultiByteStringFromUnicodeString(pszUnicodeString, pszMultiByteString, nBufferSize, GetACP())) { // Do standard write CFile::Write((const void*)pszMultiByteString, lstrlen(lpsz)); } if (pszUnicodeString && pszMultiByteString) { delete [] pszUnicodeString; delete [] pszMultiByteString; } } // Else if *not* compiled in Unicode #else // If writing Unicode, need to convert if (m_nFlags & CStdioFileEx::modeWriteUnicode) { int nChars = lstrlen(lpsz); int nBufferSize = nChars * sizeof(wchar_t); wchar_t* pszUnicodeString = new wchar_t[nChars + 1]; char * pszMultiByteString= new char[nChars + 1]; // Copy string to multibyte buffer lstrcpy(pszMultiByteString, lpsz); if (GetUnicodeStringFromMultiByteString(pszMultiByteString, pszUnicodeString, nBufferSize, GetACP())) { // Write in byte mode until null not include null wchar_t* p = pszUnicodeString; while(*p){ CFile::Write(p, 2); p++; } //CFile::Write(pszUnicodeString, lstrlen(lpsz) * sizeof(wchar_t)); } else { ASSERT(false); } if (pszUnicodeString && pszMultiByteString) { delete [] pszUnicodeString; delete [] pszMultiByteString; } } // Else if we don't want to write Unicode, no conversion needed else { // Do standard stuff CStdioFile::WriteString(lpsz); } #endif }

cstdiofileex的源文件转自http://blog.youkuaiyun.com/Augusdi/archive/2009/10/15/4677520.aspx

我针对自己的应用做了些修改

特别要注意其中的一段

wchar_t* p = pszUnicodeString;
   while(*p){
    CFile::Write(p, 2);
    p++;
   }
   //CFile::Write(pszUnicodeString, lstrlen(lpsz) * sizeof(wchar_t));

作者原来使用的是注释行CFile::Write(pszUnicodeString, lstrlen(lpsz) * sizeof(wchar_t));

这里有一点小失误因为他用int nChars = lstrlen(lpsz); 获取字符个数,这在mutibyte下是不合适的。汉字本身已经占了2字节

如果原文中有汉字,在用CFile::Write(pszUnicodeString, lstrlen(lpsz) * sizeof(wchar_t));是就会把结尾处几个多余的空字节写到文件中。读者可以自己传带汉字的字符串做个小实验。

所以我每次只写入两个字节遇到0x000x00时结束,这样做的坏处是会重复的发出I/O请求，耗费大量资源。有一个比较好的方法是用

_mbslen代替lstrlen这个函数可以返回mutibyte字符的个数。详见msdn

就写这么多了因为基本不用mfc 最近要开发一个小工具才练练。代码纰漏之处请见谅。