您的位置:首页 > 编程语言 > Delphi

Delphi in a Unicode World Part II: New RTL Features and Classes to Support Unicode

2013-02-04 11:48 633 查看


Delphi in a Unicode World Part II: New RTL Features and Classes to Support Unicode

By: Nick Hodges
Abstract: This article will cover the new features of the Tiburon Runtime Library that will help handle Unicode strings.
From【来源】 : http://delphi.about.com/gi/o.htm?zi=1/XJ&zTi=1&sdn=delphi&cdn=compute&tm=3051&f=11&su=p284.13.342.ip_p504.6.342.ip_&tt=2&bt=1&bts=0&st=31&zu=http%3A//edn.embarcadero.com/article/38498


Introduction

In Part I, we saw how Unicode support is a huge benefit for Delphi developers by enabling communication with all characters set in the
Unicode universe. We saw the basics of the UnicodeString type and how it will be used in Delphi

In Part II, we’ll look at some of the new features of the Delphi Runtime Library that support Unicode and general string handling.


TCharacter
Class

The Tiburon RTL includes a new class called TCharacter, which is found in the Character unit. It is a sealed class that consists entirely of static class functions.
Developers should not create instances of TCharacter, but rather merely call its static class methods directly. Those class functions do a number of things, including:

Convert characters to upper or lower case
Determine whether a given character is of a certain type, i.e. is the character a letter, a number, a punctuation mark, etc.

TCharacter uses the standards set forth by the Unicode consortium.

Developers can use the TCharacter class to do many things previously done with sets of chars. For instance, this code:

[code]uses

Character;

begin

if
MyChar
in
[‘a’...’z’, ‘A’...’Z’]
then

begin

...
end
;
end
;
[/code]
can be easily replaced with

[code]uses

Character;

begin

if
TCharacter.IsLetter(MyChar)
then

begin

...
end
;
end
;
[/code]
The Character unit also contains a number of standalone functions that wrap up the functionality of each class function from TCharacter, so if you prefer a simple
function call, the above can be written as:

[code]uses

Character;

begin

if
IsLetter(MyChar)
then

begin

...
end
;
end
;
[/code]
Thus the TCharacter class can be used to do most any manipulation or checking of characters that you might care to do.

In addition, TCharacter contains class methods to determine if a given character is a high or low surrogate of a surrogate pair.


TEncoding
Class

The Tiburon RTL also includes a new class called TEncoding. Its purpose is to define a specific type of character encoding so that you can tell the VCL what type of encoding you want used in specific situations.

For instance, you may have a TStringList instance that contains text that you want to write out to a file. Previously, you would have written:

[code]begin

...
MyStringList.SaveToFile(‘SomeFilename.txt’);
...
end
;
[/code]
and the file would have been written out using the default ANSI encoding. That code will still work fine – it will write out the file using ANSI string encoding as it always has, but now that Delphi supports Unicode string data, developers may want to write
out string data using a specific encoding. Thus, SaveToFile (as well as LoadFromFile) now take an optional second parameter that defines the encoding to be used:

[code]begin

...
MyStringList.SaveToFile(‘SomeFilename.txt’, TEncoding.Unicode);
...
end
;
[/code]
Execute the above code and the file will be written out as a Unicode (UTF-16) encoded text file.

TEncoding will also convert a given set of bytes from one encoding to another, retrieve information about the bytes and/or characters in a given string or array of characters, convert any string into an array
of byte (TBytes), and other functionality that you may need with regard to the specific encoding of a given string or array of chars.

The TEncoding class includes the following class properties that give you singleton access to a TEncoding instance of the given encoding:

[code]class
property
ASCII: TEncoding
read
GetASCII;
class
property
BigEndianUnicode: TEncoding
read
GetBigEndianUnicode;
class
property
Default
: TEncoding
read
GetDefault;
class
property
Unicode: TEncoding
read
GetUnicode;
class
property
UTF7: TEncoding
read
GetUTF7;
class
property
UTF8: TEncoding
read
GetUTF8;
[/code]
The Default property refers to the ANSI active codepage. The Unicode property refers to UTF-16.

TEncoding also includes the

[code]class
function
TEncoding.GetEncoding(CodePage: Integer): TEncoding;
[/code]
that will return an instance of TEncoding that has the affinity for the code page passed in the parameter.

In addition, it includes following function:

[code]function
GetPreamble: TBytes;
[/code]
which will return the correct BOM for the given encoding.

TEncoding is also interface compatible with the .Net class called Encoding.


TStringBuilder

The RTL now includes a class called TStringBuilder. Its purpose is revealed in its name – it is a class designed to “build up” strings. TStringBuilder contains
any number of overloaded functions for adding, replacing, and inserting content into a given string. The string builder class makes it easy to create single strings out of a variety of different data types. All of the Append, Insert,
and Replace functions return an instance of TStringBuilder, so they can easily be chained together to create a single string.

For example, you might choose to use a TStringBuilder in place of a complicated Format statement. For instance, you might write the following code:

[code]procedure
TForm86.Button2Click(Sender: TObject);
var

MyStringBuilder: TStringBuilder;
Price: double;
begin

MyStringBuilder := TStringBuilder.Create(
''
);
try

Price := 1.49;
Label1.Caption := MyStringBuilder.Append(
'The apples are $'
).Append(Price).
ÄAppend(
' a pound.'
).ToString;
finally

MyStringBuilder.Free;
end
;
end
;
[/code]
TStringBuilder is also interface compatible with the .Net class called StringBuilder.


Declaring
New String Types

Tiburon’s compiler enables you to declare your own string type with an affinity for a given codepage. There is any number of code pages available. (MSDN has a
nice rundown of available codepages.) For instance, if you require a string type with an affinity for ANSI-Cyrillic, you can declare:

[code]type

// The code page for ANSI-Cyrillic is 1251

CyrillicString =
type
Ansistring(1251);
[/code]
And the new String type will be a string with an affinity for the Cyrillic code page.


Additional
RTL Support for Unicode

The RTL adds a number of routines that support the use of Unicode strings.


StringElementSize

StringElementSize returns the typical size for an element (code point) in a given string. Consider the following code:

[code]procedure
TForm88.Button3Click(Sender: TObject);
var

A: AnsiString;
U: UnicodeString;
begin

A :=
'This is an AnsiString'
;
Memo1.Lines.Add(
'The ElementSize for an AnsiString is: '
+ IntToStr(StringElementSize(A)));
U :=
'This is a UnicodeString'
;
Memo1.Lines.Add(
'The ElementSize for an UnicodeString is: '
+ IntToStr(StringElementSize(U)));
end
;
[/code]
The result of the code above will be:

The ElementSize [code]for
an AnsiString
is
: 1
The ElementSize
for
an UnicodeString
is
: 2
[/code]


StringCodePage

StringCodePage will return the Word value that corresponds to the codepage for a given string.

Consider the following code:

[code]procedure
TForm88.Button2Click(Sender: TObject);
type

// The code page for ANSI-Cyrillic is 1251

CyrillicString =
type
AnsiString(1251);
var

A: AnsiString;
U: UnicodeString;
U8: UTF8String;
C: CyrillicString;
begin

A :=
'This is an AnsiString'
;
Memo1.Lines.Add(
'AnsiString Codepage: '
+ IntToStr(StringCodePage(A)));
U :=
'This is a UnicodeString'
;
Memo1.Lines.Add(
'UnicodeString Codepage: '
+ IntToStr(StringCodePage(U)));
U8 :=
'This is a UTF8string'
;
Memo1.Lines.Add(
'UTF8string Codepage: '
+ IntToStr(StringCodePage(U8)));
C :=
'This is a CyrillicString'
;
Memo1.Lines.Add(
'CyrillicString Codepage: '
+ IntToStr(StringCodePage(C)));
end
;
[/code]
The above code will result in the following output:

The Codepage [code]for
an AnsiString
is
: 1252
The Codepage
for
an UnicodeString
is
: 1200
The Codepage
for
an UTF8string
is
: 65001
The Codepage
for
an CyrillicString
is
: 1251
[/code]


Other
RTL Features for Unicode

There are a number of other routines for converting strings of one codepage to another. Including:

UnicodeStringToUCS4String
UCS4StringToUnicodeString
UnicodeToUtf8
Utf8ToUnicode

In addition the RTL also declares a type called RawByteString which is a string type with no encoding affiliated with it:

RawByteString = [code]type
AnsiString($FFFF);
[/code]
The purpose of the RawByteString type is to enable the passing of string data of any code page without doing any codepage conversions. This is most useful for routines that do not care about specific encoding,
such as byte-oriented string searches.Normally, this would mean that parameters of routines that process strings without regard for the strings code page should be of type RawByteString. Declaring variables of
type RawByteString should rarely, if ever, be done as this can lead to undefined behavior and potential data loss.

In general, string types are assignment compatible with each other.

For instance:

MyUnicodeString := MyAnsiString;

will perform as expected – it will take the contents of the AnsiString and place them into a UnicodeString. You should in general be able to assign one string type
to another, and the compiler will do the work needed to make the conversions, if possible.

Some conversions, however, can result in data loss, and one must watch out this when moving from one string type that includes Unicode data to another that does not. For instance, you can assign UnicodeString to
an AnsiString, but if the UnicodeString contains characters that have no mapping in the active ANSI code page at runtime, those characters will be lost in the conversion.
Consider the following code:

[code]procedure
TForm88.Button4Click(Sender: TObject);
var

U: UnicodeString;
A: AnsiString;
begin

U :=
'This is a UnicodeString'
;
A := U;
Memo1.Lines.Add(A);
U :=
'Добро пожаловать в мир Юникода с использованием Дельфи 2009!!'
;
A := U;
Memo1.Lines.Add(A);
end
;
[/code]
The output of the above when the current OS code page is 1252is:

This [code]is
a UnicodeString
????? ?????????? ? ??? ??????? ? ?????????????? ?????? 2009!!
[/code]
As you can see, because Cyrillic characters have no mapping in Windows-1252, information was lost when assigning this UnicodeString to an AnsiString. The
result was gibberish because the UnicodeString contained characters not representable in the code page of the AnsiString, those characters were lost and replaced by the question mark when assigning the UnicodeString to
the AnsiString.


SetCodePage

SetCodePage, declared in the System.pas unit as

[code]procedure
SetCodePage(
var
S: AnsiString; CodePage: Word; Convert: Boolean);
[/code]
is a new RTL function that sets a new code page for a given AnsiString. The optional Convert parameter determines if the payload itself of the string should be
converted to the given code page. If the Convert parameter is False, then the code page for the string is merely altered. If the Convert parameter
is True, then the payload of the passed string will be converted to the given code page.

SetCodePage should be used sparingly and with great care. Note that if the codepage doesn’t actually match the existing payload (i.e. Convert is set to False),
then unpredictable results can occur. Also if the existing data in the string is converted and the new codepage doesn’t have a representation for a given original character, data loss can occur.


Getting
TBytes from Strings

The RTL also includes a set of overloaded routines for extracting an array of bytes from a string. As we’ll see in Part III, it is recommended that instead of using string as a data buffer, you use TBytes instead. The RTL makes it easy by providing overloaded
versions of BytesOf() that takes as a parameter the different string types.


Conclusion

Tiburon’s Runtime Library is now completely capable of supporting the new UnicodeString. It includes new classes and routines for handling, processing, and converting Unicode strings, for managing codepages, and for ensuring an easy migration from earlier
versions.

In Part III, we’ll cover the specific code constructs that you’ll need to look out for in ensuring that your code is Unicode ready.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: