【读书笔记】" An introduce to Unicode （chapter 2)

最新推荐文章于 2023-12-13 09:48:25 发布

imwkj

最新推荐文章于 2023-12-13 09:48:25 发布

阅读量1.5k

点赞数

分类专栏：读书笔记文章标签：读书 character windows encoding microsoft function

本文链接：https://blog.youkuaiyun.com/imwkj/article/details/1490647

版权

读书笔记专栏收录该内容

1 篇文章

订阅专栏

Chapter2 – An introduce to Unicode

·Unicode is an extension of ASCII character encoding set.

·ASCII is now using a byte of 8-bit per character, and Unicode use full of 16 bits for character encoding.

·In this case, it allows Unicode to represent all the letters and all ideographs, and other symbol written in other language of the world are used to computer communication.

·Unicode is intended initially to supplement ASCII and, with any luck, eventually replace it.

·The C programming language as formalized by ANSI inherently supports Unicode through its support of wide characters.

A brief of character sets

Character sets	Introduce period	Feature
Telegraph encoding set	between 1838 and 1854	·Each letter in the alphabet corresponded to a series of short and long pulses (dots and dashes) ·No distinction uppercase and lowercase letters but numbers and punctuation marks had their own codes
Morse code	Between 1821 and 1824	·essentially a 6-bit code that encodes letters ·common letter combinations, common words, and punctuation
Telex codes	standardized in 1931	·5-bit codes that included letter shifts and figure shifts
BCDIC		·Binary-Coded Decimal Interchange Code"
8-bit EBCDIC	1960s
ASCII	origins in the late 1950s and was finalized in 1967	·a total of 128 codes ·The 26 letter codes are contiguous ·The codes for the 10 digits are easily derived from the value of the digits
ANSI character set	1985
Double-Byte Character Sets(DBCS)		·maintain all kinds of language character sets ·introduce Code-Page concept ·not compatible to ASCII which is 1 byte ·insufficient and awkward.
Unicode		·allowing the representation of 65,536 characters · sufficient for all the characters and ideographs · compatible with ASCII · simply no ambiguity with only one character set

Wide character set in C

ANSI C also supports multibyte character set, and wide characters aren't necessarily Unicode.

The char Date Type

char data type is encoded by one byte. The definition likes so.

char c = ‘A’; 1byte

char* p = “Hello, World!”; 12bytes

char a[] = “Hello, World!”; sizeof(a) is 13byte; with ‘/ 0’ as its end

char a[10]; sizeof(a) is 13byte;

Wide characters

wide char type in C is based on wchar_t data type which is defined in <wchar.h>. The definition likes so.

typedef unsigned short wchar_t

we can use following statement to define some wide characters.

wchar_t c = ‘A’; 2bytes equivalent to wcha_t c = L‘A’;

wchar_t* p = L“Hello, World!”; 26bytes

wchar_t a[] = L“Hello, World!”; sizeof(a) is 28bytes; with ‘/ 0’ as its end

Wide character functions library

original char data type character functions is showed below

char *pc = “Hello!”;

wchar_t *pw = “Hello!”;

int iLength = strlen(pc);

iLength = strlen(pw) is syntax error as strlen() is defined to process strlen( const char*) while pw is wchar* ( as defined unsigned short* ). This statement will be considered by complier as error or warning.

The form of string stored in memory:

The 6 characters of the character string "Hello!" have the 16-bit values:

0x0048 0x0065 0x 006C 0x 006C 0x 006F 0x0021

and stored in intel processor as this form:

48 00 65 00 6C 00 6C 00 6F 00 21 00

If iLength = strlen(pw) could be complied by complier the iLength will be assigned 1;

wide character function in C

There are alternations of 1byte character functions while us wchar_t data type, and hese functions are declared both in < wchar.h> and in the header file where the normal function is declared

1byte char data type functions	wide char data type functions
strlen( const char*)	wcslen( const wchar_t*)
printf( const char*, …)	wprintf( const wchar_t*, …)

Maintain a single source code

·It is obvious to provide two version of the source code. One is complied for ASCII char encoding and the other is complied for wide encoding system.

·Use <TCHAR.H> head file to maintain one version source code which is defined in VC++ by Microsoft and it is not the ANSI C Standard.

How to use TCHAR.H?

There are some very useful definitions in TCHAR.H :

#ifdef _UNICODE

    typedef wchar_t TCHAR

    #define __T(x) L##x

    #define _tcslen wcslen

    #else

    #define __T(x) x

    typedef char TCHAR

    #define _tcslen strlen

    #endif      /* _UNICODE*/

    #define _T(x) __T(x)

    #define _TEXT(x) __T(x)

So we can use _tcslen to declare characters whatever there are char or wide char. The translate between wcslen and strlen is automatic by complier. we can only transfer option “ –D _UNICODE ” to complier if we want to use wide char functions in our program.

we can make declarations like so:

TCHAR *pstr = _TEXT(“Hello, World!”);

Wide Characters and Windows

WINNT supports not only ASCII character set but UNICODE set. So it can accept both 8-bit and 16-bit character strings.

WIN98 has much less supports of UNICODE than WINNT. Only a few Windows 98 function calls support wide-character strings

Windows Header File Types

Windows program includes the header file WINDOWS.H. This file includes a number of other header files, including WINDEF.H, which has many of the basic type definitions used in Windows and which itself includes WINNT.H. WINNT.H handles the basic Unicode support.

There are some new data types and useful Macros in WINNT.H:

These definitions let you mix ASCII and Unicode characters strings in the same program or write a single program that can be compiled for either ASCII or Unicode

typedef char CHAR ;

typedef wchar_t WCHAR ;     // wc

typedef CHAR * PCHAR, * LPCH, * PCH, * NPSTR, * LPSTR, * PSTR ;

typedef CONST CHAR * LPCCH, * PCCH, * LPCSTR, * PCSTR ;

typedef WCHAR * PWCHAR, * LPWCH, * PWCH, * NWPSTR, * LPWSTR, * PWSTR ;

typedef CONST WCHAR * LPCWCH, * PCWCH, * LPCWSTR, * PCWSTR ;

#ifdef  UNICODE

typedef WCHAR TCHAR, * PTCHAR ;

typedef LPWSTR LPTCH, PTCH, PTSTR, LPTSTR ;

typedef LPCWSTR LPCTSTR ;

#define __TEXT(quote) L##quote

#else

typedef char TCHAR, * PTCHAR ;

typedef LPSTR LPTCH, PTCH, PTSTR, LPTSTR ;

typedef LPCSTR LPCTSTR ;

#define __TEXT(quote) quote

#endif

#define TEXT(quote) __TEXT(quote)

8-bit character variables and strings,	use CHAR, PCHAR (or one of the others),
explicit 16-bit character variables and strings	use WCHAR, PWCHAR, and append an L before quotation marks
8 bit or 16 bit depending on the definition of the UNICODE identifier	use TCHAR, PTCHAR, and the TEXT macro

Windows' String Functions

Microsoft C includes wide-character and generic versions of all C run-time library functions that require character string arguments.

ILength = lstrlen (pString) ;

pString = lstrcpy (pString1, pString2) ;

pString = lstrcpyn (pString1, pString2, iCount) ;

pString = lstrcat (pString1, pString2) ;

iComp = lstrcmp (pString1, pString2) ;

iComp = lstrcmpi (pString1, pString2) ;

These work much the same as their C library equivalents. They accept wide-character strings if the UNICODE identifier is defined and regular strings if not.

Using printf in Windows

The printf() function in C could not be used in Window programming.

use fprintf() function to output to files.

use sprintf() function to format strings, and then we can pass it to MessageBox().

char szBuffer [100] ;

        sprintf (szBuffer, "The sum of %i and %i is %i", 5, 3, 5+3) ;

        puts (szBuffer) ;

int sprintf (char * szBuffer, const char * szFormat, ...)

     int     iReturn ;

     va_list pArgs ;

     va_start (pArgs, szFormat) ;

     iReturn = vsprintf (szBuffer, szFormat, pArgs) ;

     va_end (pArgs) ;

     return iReturn ;

The va_start macro sets pArg to point to the variable on the stack right above the szFormat argument on the stack.

	ASCII	*Wide-Character*	*Generic*
Variable Number of Arguments
Standard Version	sprintf	swprintf	_stprintf
Max-Length Version	_snprintf	_snwprintf	_sntprintf
Windows Version	wsprintfA	wsprintfW	wsprintf
Pointer to Array of Arguments
Standard Version	vsprintf	vswprintf	_vstprintf
Max-Length Version	_vsnprintf	_vsnwprintf	_vsntprintf
Windows Version	wvsprintfA	wvsprintfW	wvsprintf

A Formatting Message Box

SCRNSIZE.C

#include <windows.h>

#include <tchar.h>

#include <stdio.h>

int CDECL MessageBoxPrintf (TCHAR * szCaption, TCHAR * szFormat, ...)

     TCHAR   szBuffer [1024] ;

     va_list pArgList ;

          // The va_start macro (defined in STDARG.H) is usually equivalent to:

          // pArgList = (char *) &szFormat + sizeof (szFormat) ;

     va_start (pArgList, szFormat) ;

          // The last argument to wvsprintf points to the arguments

     _vsntprintf (szBuffer, sizeof (szBuffer) / sizeof (TCHAR),

                  szFormat, pArgList) ;

          // The va_end macro just zeroes out pArgList for no good reason

     va_end (pArgList) ;

     return MessageBox (NULL, szBuffer, szCaption, 0) ;

int WINAPI WinMain (HINSTANCE hInstance, HINSTANCE hPrevInstance,

                    PSTR szCmdLine, int iCmdShow)

     int cxScreen, cyScreen ;

     cxScreen = GetSystemMetrics (SM_CXSCREEN) ;

     cyScreen = GetSystemMetrics (SM_CYSCREEN) ;

     MessageBoxPrintf (TEXT ("ScrnSize"),

                       TEXT ("The screen is %i pixels wide by %i pixels high."),

                       cxScreen, cyScreen) ;

     return 0 ;