您的位置:首页 > 编程语言 > C语言/C++

Unicode-enabling Microsoft C/C++ Source Code

2007-07-28 14:23 281 查看
Unicode-enabling Microsoft C/C++ Source Code
Initial Steps for Unicode-enabling Microsoft C/C++ Source
· Define _UNICODE, undefine _MBCS if defined.
· Convert literal strings to use L or _T
· Convert string functions to use Wide or TCHAR versions.
· Clarify string lengths in API as byte or character counts. For character-based display or printing (as opposed to GUI which is pixel-based) use column counts, not byte or character.
· Replace character pointer arithmetic with GetNext style, as characters may consist of more than one Unicode code unit.
· Watch buffer size and buffer overflows- changing encodings may require either larger buffers or limiting string lengths. If character size changes from 1 byte to as many as 4 bytes, and string length was formerly 20 characters and 20 bytes, either expand the string buffer(s) from 20 to 80 bytes or limit the string to 5 characters (and therefore 20 bytes). Note maximum buffer expansion may be constrained (for example to 65 KB). Reducing string length to a fixed number of characters may break existing applications. Limiting strings to a fixed byte length is dangerous. For example, allowing any string that fits into 20 bytes. Simple operations such as uppercasing a string may cause it to grow and exceed the byte length.
· Replace functions that accept or return arguments of a single character, with functions that use strings instead. (International) Operations on a single character may result in more than one code point being returned. For example, upper('ß') returns "SS".
· Use wmain instead of main. The environment variable is then _wenviron instead of _environ.
wmain( int argc, wchar_t *argv[ ], wchar_t *envp[ ] ).
· MFC Unicode applications use wWinMain as the entry point.
In the Output page of the Linker folder in the project's Property Pages dialog box, set the Entry Point symbol to wWinMainCRTStartup.
· Consider fonts. Identify the fonts that will render each language or script used.
Top of page
File I/O, Database, Transfer Protocol Considerations
· Consider whether to read/write UTF-8 or UTF-16 in files, databases, and for data exchange.
· Consider Endian-ness in UTF-16 files.
Read/Write Big-Endian on networks. Use Big-Endian if you don't produce a BOM.
Endian-ness of files will depend on the file format and/or the architecture of the source or target machine.
When reading files encoded in UTF-16 or UTF-32, be prepared to swap-bytes to convert endian-ness.
Also consider streams and transfer protocols and the encoding used in each.
· Label files or protocols for data exchange with the correct character encoding. E.g. set HTTP, HTML, XML to UTF-8 or UTF-16.
· Consider Unicode BOM (Byte Order Marker) and whether it should be written with data. Remove it when reading data.
· Consider encoding conversion of legacy data and files, import and export, transfer protocols. (MultiByteToWideChar, WideCharToMultiByte, mbtowc, wctomb, wctombs, mbstowcs )
· Consider writing to the Clipboard-
use CF_TEXT format and write native character encoding (ANSI) text, and
use CF_UNICODETEXT format and write Unicode text.
· Database applications should consider Data Type (NCHAR, NVARCHAR) and Schema Changes, Triggers, Stored Procedures, and Queries. Data Storage growth, Indexes and Performance.
Note that the Unicode schema changes will have different impacts and concerns on different vendors' databases. If database portability is a requirement, the features and behaviors of each database need to be taken into account.
(I know this item is seriously understated. To be expanded sometime in the future.)
Top of page
Stream I/O
Streams are difficult in Microsoft C++. You may run into 3 types of problems:
1. Unicode filenames are not supported. The workaround is to use FILE * _wfopen and if needed, use the FILE handle in subsequent stream I/O.
std::ifstream stm(_wfopen(pFilename, L"r"));
2. Stream I/O will convert Unicode data from/to native (ANSI) code page on read/write, not UTF-8 or UTF-16. However the stream class can be modified to read/write UTF-8. You can implement a facet to convert between Unicode and UTF-8.
codecvt <wchar_t, char_traits <wchar_t> >
3. To read/write UTF-16 with stream I/O, use binary opens and binary I/O. To set binary I/O:
_setmode( _fileno( stdin ), _O_BINARY );

Also see the Microsoft run-time library reference: "Unicode Stream I/O in Text and Binary Modes".
Note: There aren't TCHAR equivalents for cout/wcout, cin/wcin, etc. You may want to make your own preprocessor definition for "tout", if you are compiling code both ways.
Top of page
Internationalization, Advanced Unicode, Platform and Other Considerations
· Consider using locale-based routines and further internationalization.
· For Windows 95, 98 and ME, consider using the Microsoft MSLU (Microsoft Layer for Unicode)
· Consider string compares and sorting, Unicode Collation Algorithm
· Consider Unicode Normalization
· Consider Character Folding
· Reconsider doing this on your own. Bring in an experienced Unicode consultant, and deploy your existing resources on the tasks they do best. (Hey, an I18nGuy's gotta earn a living...)
Top of page
Unicode BOM Encoding Values
Encoding Form
BOM Encoding
UTF-8
EF BB BF
UTF-16
(big-endian)
FE FF
UTF-16
(little-endian)
FF FE
UTF-16BE, UTF-32BE
(big-endian)
No BOM!
UTF-16LE, UTF-32LE
(little-endian)
No BOM!
UTF-32
(big-endian)
00 00 FE FF
UTF-32
(little-endian)
FF FE 00 00
SCSU
(compression)
0E FE FF
The Byte Order Marker (BOM) is Unicode character U+FEFF. (It can also represent a Zero Width No-break Space.) The code point U+FFFE is illegal in Unicode, and should never appear in a Unicode character stream. Therefore the BOM can be used in the first character of a file (or more generally a string), as an indicator of endian-ness. With UTF-16, if the first character is read as bytes FE FF then the text has the same endian-ness as the machine reading it. If the character is read as bytes FF FE, then the endian-ness is reversed and all 16-bit words should be byte-swapped as they are read-in. In the same way, the BOM indicates the endian-ness of text encoded with UTF-32.
Note that not all files start with a BOM however. In fact, the Unicode Standard says that text that does not begin with a BOM MUST be interpreted in big-endian form.
The character U+FEFF also serves as an encoding signature for the Unicode Encoding Forms. The table shows the encoding of U+FEFF in each of the Unicode encoding forms. Note that by definition, text labeled as UTF-16BE, UTF-32BE, UTF-32LE or UTF-16LE should not have a BOM. The endian-ness is indicated in the label.
For text that is compressed with the SCSU (Standard Compression Scheme for Unicode) algorithm, there is also a recommended signature.
Data Types
ANSI
Wide
TCHAR
char
wchar_t
_TCHAR
_finddata_t
_wfinddata_t
_tfinddata_t
__finddata64_t
__wfinddata64_t
_tfinddata64_t
_finddatai64_t
_wfinddatai64_t
_tfinddatai64_t
int
wint_t
_TINT
signed char
wchar_t
_TSCHAR
unsigned char
wchar_t
_TUCHAR
char
wchar_t
_TXCHAR
L
_T or _TEXT
LPSTR
(char *)
LPWSTR
(wchar_t *)
LPTSTR
(_TCHAR *)
LPCSTR
(const char *)
LPCWSTR
(const wchar_t *)
LPCTSTR
(const _TCHAR *)
LPOLESTR
(For OLE)
LPWSTR
LPTSTR
Constant and Global Variables
ANSI
Wide
TCHAR
EOF
WEOF
_TEOF
_environ
_wenviron
_tenviron
_pgmptr
_wpgmptr
_tpgmptr
Platform SDK String Functions
There are many Windows API that compile into ANSI or Wide forms, depending on whether the symbol UNICODE is defined. Modules that operate on both ANSI and Wide characters, need to be aware of this. Otherwise, using the Character Data Type-independent name requires no changes, just compile with the symbol UNICODE defined.
The following list is by no means all of the Character Data Type-dependent API, just some character and string related ones. Look in WinNLS.h for some code page and locale related API.
ANSI
Wide
Character Data Type-
Independent Name

CharLowerA
CharLowerW
CharLower
CharLowerBuffA
CharLowerBuffW
CharLowerBuff
CharNextA
CharNextW
CharNext
CharNextExA
CharNextExW
CharNextEx
CharPrevA
CharPrevW
CharPrev
CharPrevExA
CharPrevExW
CharPrevEx
CharToOemA
CharToOemW
CharToOem
CharToOemBuffA
CharToOemBuffW
CharToOemBuff
CharUpperA
CharUpperW
CharUpper
CharUpperBuffA
CharUpperBuffW
CharUpperBuff
CompareStringA
CompareStringW
CompareString
FoldStringA
FoldStringW
FoldString
GetStringTypeA
GetStringTypeW
GetStringType
GetStringTypeExA
GetStringTypeExW
GetStringTypeEx
IsCharAlphaA
IsCharAlphaW
IsCharAlpha
IsCharAlphaNumericA
IsCharAlphaNumericW
IsCharAlphaNumeric
IsCharLowerA
IsCharLowerW
IsCharLower
IsCharUpperA
IsCharUpperW
IsCharUpper
LoadStringA
LoadStringW
LoadString
lstrcatA
lstrcatW
lstrcat
lstrcmpA
lstrcmpW
lstrcmp
lstrcmpiA
lstrcmpiW
lstrcmpi
lstrcpyA
lstrcpyW
lstrcpy
lstrcpynA
lstrcpynW
lstrcpyn
lstrlenA
lstrlenW
lstrlen
OemToCharA
OemToCharW
OemToChar
OemToCharBuffA
OemToCharBuffW
OemToCharBuff
wsprintfA
wsprintfW
wsprintf
wvsprintfA
wvsprintfW
wvsprintf
TCHAR String Functions
Functions sorted by ANSI name, for ease of converting to Unicode.
ANSI
Wide
TCHAR
_access
_waccess
_taccess
_atoi64
_wtoi64
_tstoi64
_atoi64
_wtoi64
_ttoi64
_cgets
_cgetws
cgetts
_chdir
_wchdir
_tchdir
_chmod
_wchmod
_tchmod
_cprintf
_cwprintf
_tcprintf
_cputs
_cputws
_cputts
_creat
_wcreat
_tcreat
_cscanf
_cwscanf
_tcscanf
_ctime64
_wctime64
_tctime64
_execl
_wexecl
_texecl
_execle
_wexecle
_texecle
_execlp
_wexeclp
_texeclp
_execlpe
_wexeclpe
_texeclpe
_execv
_wexecv
_texecv
_execve
_wexecve
_texecve
_execvp
_wexecvp
_texecvp
_execvpe
_wexecvpe
_texecvpe
_fdopen
_wfdopen
_tfdopen
_fgetchar
_fgetwchar
_fgettchar
_findfirst
_wfindfirst
_tfindfirst
_findnext64
_wfindnext64
_tfindnext64
_findnext
_wfindnext
_tfindnext
_findnexti64
_wfindnexti64
_tfindnexti64
_fputchar
_fputwchar
_fputtchar
_fsopen
_wfsopen
_tfsopen
_fullpath
_wfullpath
_tfullpath
_getch
_getwch
_gettch
_getche
_getwche
_gettche
_getcwd
_wgetcwd
_tgetcwd
_getdcwd
_wgetdcwd
_tgetdcwd
_ltoa
_ltow
_ltot
_makepath
_wmakepath
_tmakepath
_mkdir
_wmkdir
_tmkdir
_mktemp
_wmktemp
_tmktemp
_open
_wopen
_topen
_popen
_wpopen
_tpopen
_putch
_putwch
_puttch
_putenv
_wputenv
_tputenv
_rmdir
_wrmdir
_trmdir
_scprintf
_scwprintf
_sctprintf
_searchenv
_wsearchenv
_tsearchenv
_snprintf
_snwprintf
_sntprintf
_snscanf
_snwscanf
_sntscanf
_sopen
_wsopen
_tsopen
_spawnl
_wspawnl
_tspawnl
_spawnle
_wspawnle
_tspawnle
_spawnlp
_wspawnlp
_tspawnlp
_spawnlpe
_wspawnlpe
_tspawnlpe
_spawnv
_wspawnv
_tspawnv
_spawnve
_wspawnve
_tspawnve
_spawnvp
_wspawnvp
_tspawnvp
_spawnvpe
_wspawnvpe
_tspawnvpe
_splitpath
_wsplitpath
_tsplitpath
_stat64
_wstat64
_tstat64
_stat
_wstat
_tstat
_stati64
_wstati64
_tstati64
_strdate
_wstrdate
_tstrdate
_strdec
_wcsdec
_tcsdec
_strdup
_wcsdup
_tcsdup
_stricmp
_wcsicmp
_tcsicmp
_stricoll
_wcsicoll
_tcsicoll
_strinc
_wcsinc
_tcsinc
_strlwr
_wcslwr
_tcslwr
_strncnt
_wcsncnt
_tcsnbcnt
_strncnt
_wcsncnt
_tcsnccnt
_strncnt
_wcsncnt
_tcsnccnt
_strncoll
_wcsncoll
_tcsnccoll
_strnextc
_wcsnextc
_tcsnextc
_strnicmp
_wcsnicmp
_tcsncicmp
_strnicmp
_wcsnicmp
_tcsnicmp
_strnicoll
_wcsnicoll
_tcsncicoll
_strnicoll
_wcsnicoll
_tcsnicoll
_strninc
_wcsninc
_tcsninc
_strnset
_wcsnset
_tcsncset
_strnset
_wcsnset
_tcsnset
_strrev
_wcsrev
_tcsrev
_strset
_wcsset
_tcsset
_strspnp
_wcsspnp
_tcsspnp
_strtime
_wstrtime
_tstrtime
_strtoi64
_wcstoi64
_tcstoi64
_strtoui64
_wcstoui64
_tcstoui64
_strupr
_wcsupr
_tcsupr
_tempnam
_wtempnam
_ttempnam
_ui64toa
_ui64tow
_ui64tot
_ultoa
_ultow
_ultot
_ungetch
_ungetwch
_ungettch
_unlink
_wunlink
_tunlink
_utime64
_wutime64
_tutime64
_utime
_wutime
_tutime
_vscprintf
_vscwprintf
_vsctprintf
_vsnprintf
_vsnwprintf
_vsntprintf
asctime
_wasctime
_tasctime
atof
_wtof
_tstof
atoi
_wtoi
_tstoi
atoi
_wtoi
_ttoi
atol
_wtol
_tstol
atol
_wtol
_ttol
character compare
Maps to macro or inline function
_tccmp
character copy
Maps to macro or inline function
_tccpy
character length
Maps to macro or inline function
_tclen
ctime
_wctime
_tctime
fgetc
fgetwc
_fgettc
fgets
fgetws
_fgetts
fopen
_wfopen
_tfopen
fprintf
fwprintf
_ftprintf
fputc
fputwc
_fputtc
fputs
fputws
_fputts
freopen
_wfreopen
_tfreopen
fscanf
fwscanf
_ftscanf
getc
getwc
_gettc
getchar
getwchar
_gettchar
getenv
_wgetenv
_tgetenv
gets
getws
_getts
isalnum
iswalnum
_istalnum
isalpha
iswalpha
_istalpha
isascii
iswascii
_istascii
iscntrl
iswcntrl
_istcntrl
isdigit
iswdigit
_istdigit
isgraph
iswgraph
_istgraph
islead (Always FALSE)
(Always FALSE)
_istlead
isleadbyte (Always FALSE)
isleadbyte (Always FALSE)
_istleadbyte
islegal (Always TRUE)
(Always TRUE)
_istlegal
islower
iswlower
_istlower
isprint
iswprint
_istprint
ispunct
iswpunct
_istpunct
isspace
iswspace
_istspace
isupper
iswupper
_istupper
isxdigit
iswxdigit
_istxdigit
main
wmain
_tmain
perror
_wperror
_tperror
printf
wprintf
_tprintf
putc
putwc
_puttc
putchar
putwchar
_puttchar
puts
_putws
_putts
remove
_wremove
_tremove
rename
_wrename
_trename
scanf
wscanf
_tscanf
setlocale
_wsetlocale
_tsetlocale
sprintf
swprintf
_stprintf
sscanf
swscanf
_stscanf
strcat
wcscat
_tcscat
strchr
wcschr
_tcschr
strcmp
wcscmp
_tcscmp
strcoll
wcscoll
_tcscoll
strcpy
wcscpy
_tcscpy
strcspn
wcscspn
_tcscspn
strerror
_wcserror
_tcserror
strftime
wcsftime
_tcsftime
strlen
wcslen
_tcsclen
strlen
wcslen
_tcslen
strncat
wcsncat
_tcsncat
strncat
wcsncat
_tcsnccat
strncmp
wcsncmp
_tcsnccmp
strncmp
wcsncmp
_tcsncmp
strncpy
wcsncpy
_tcsnccpy
strncpy
wcsncpy
_tcsncpy
strpbrk
wcspbrk
_tcspbrk
strrchr
wcsrchr
_tcsrchr
strspn
wcsspn
_tcsspn
strstr
wcsstr
_tcsstr
strtod
wcstod
_tcstod
strtok
wcstok
_tcstok
strtol
wcstol
_tcstol
strtoul
wcstoul
_tcstoul
strxfrm
wcsxfrm
_tcsxfrm
system
_wsystem
_tsystem
tmpnam
_wtmpnam
_ttmpnam
tolower
towlower
_totlower
toupper
towupper
_totupper
ungetc
ungetwc
_ungettc
vfprintf
vfwprintf
_vftprintf
vprintf
vwprintf
_vtprintf
vsprintf
vswprintf
_vstprintf
WinMain
wWinMain
_tWinMain
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: