[perl]Wide character in print
2011-07-19 21:12
537 查看
[b]binmode DATA, ":utf8";[/b]
----------------------------------------
Unicode-processing issues in Perl and how to cope with it
Perl 5.8+ has comprehensive support for Unicode and a wide range of different text encodings. But still many people experience problems when processing multi-language text. Here I explain the most common problems and offer solutions. An older version of this article is available. It is not as well structured, but provides some additional perl version 5.6.1 unicode-related details. You can read this piece and dive into all the technical details and idiosyncrasies of perl and unicode. Or you can hire me to fix your code. A bunch of perldoc manpages outline and explain the Perl’s unicodesupport.perluniintro,
perlunicode,
Encodemodule,
binmode()function. And thelist is not complete. The major problem with this documentation is itsvolume. Most programmers don’t even have to read it all, because to start working with Unicode you just need to know some basic facts andrules.
I have experienced several kinds of trouble with Unicode in Perl, in several projects. The two main problems I’ve seen are:
UTF-8 data getting double-encoded or other-encoding data getting mangled
“Wide character in print” warning
These two problems are closely related and often solvedby similar moves.
Reading or at least browsing through the relatedmanpages is still a good way to understand and solve your Unicodeproblems. If you don’t have time for that now, read on.
The problem showcase: the example
Imagine two simple variables with Unicode text in it. And you print those variables to standard output. What may be easier?..#!/usr/bin/perl my $ustring1 = "Hello \x{263A}!\n"; my $ustring2 = <DATA>; print "$ustring1$ustring2"; __DATA__ Hello ☺!sourceBoth variables here contain the same data: string
"Hello "followed by Unicode character WHITESMILING FACE U+263A, an exclamation mark and a new-linecharacter. The __DATA__ part (
$ustring2) isUTF-8 encoded.
But when we print it, the first one comes out fine andthe second one comes garbled. This is because Perl knowsthat the first string is a Unicode string and is internallystored in UTF-8. But it doesn’t know the encoding of thesecond. When it builds a bigger string for printing, itre-encodes the second into UTF-8, wrongly.
In addition, it prints a warning:
Wide character in print at unitest1.pl line 6, <DATA> line 1.We’ll look at it later, afterwe fix our output.
You could apparently fix things by avoiding concatenation:
#!/usr/bin/perl my $ustring1 = "Hello \x{263A}!\n"; my $ustring2 = <DATA>; print $ustring1, $ustring2; __DATA__ Hello ☺!sourceBut this is not a solution. Sometimes you simply can’t avoidconcatenation; it is such a basic operation. In addition,it is error-prone and not future-proof.
Why the problem happens
First, some basic facts.There is a distiction between bytes and characters. Characters are Unicode characters. One character may be represented by several bytes, when stored, printed or sent over network. That depends on a particular encoding used. UTF-8 is just one of the ways to do represent Unicode data.
Perl has a “utf8” flag for every scalar value, which may be “on” or “off”. “On” state of the flag tells perl to treat the value as a string of Unicode characters.
If you take a string with utf8 flag off and concatenate it with a string that has utf8 flag on, perl converts the first one to Unicode.
This may sound okay and obvious. But then you think: How? Perl will need to know the encoding of the string data before converting it. And perl will try to guess it. And this is the usual source of problems.
The algorithm perl uses when guessing is documented (uses some defaults and maybe checks your locale), but my firm suggestion is: never let perl do that. Otherwise, there is a BIG chance that you’ll get double-encoded UTF-8 strings, or otherwise mangled data.
The solution: always make data encoding explicit, both for your input and output.
Solution #1: Convert string to Unicode
One solution could be to tell perl that the$ustring2contains Unicode data in UTF-8encoding. There is a couple of ways to do that; theorthodox way is through Encode’s
decode_utf8()function:
#!/usr/bin/perl use Encode; my $ustring1 = "Hello \x{263A}!\n"; my $ustring2 = <DATA>; $ustring2 = decode_utf8( $ustring2 ); print "$ustring1$ustring2"; __DATA__ Hello ☺!sourceIn this simple case both ways would do the job, but mayget quite tedious, if your imports are plentiful. And it still prints the “Wide character” warning.
But this is what you should always do for theinternational data you get from other modules, like fromdatabases.
You should not forget though, that not every sequence of bytes is valid UTF-8. So the decode_utf8() operation mayfail. See
Encodeperldoc for error handlingdetails.
Another way to do let perl accept the UTF-8 data as suchis with a pack “U0C*”,unpack “C*” hack.
If you get data in another encoding (not UTF-8), convertit to Unicode explicitly. Again, Encode module,
decode()function:
require Encode; my $ustring = Encode::decode( 'iso-8859-1', $input );
Another example: UTF-8 data from CGI
In ACIS we produce HTML pages in UTF-8. We expect the HTML form input to be UTF-8 as well. To manipulate it, we tell perl about the encoding:require Encode; require CGI; my $query = CGI ->new; my $form_input = {}; foreach my $name ( $query ->param ) { my @val = $query ->param( $name ); foreach ( @val ) { $_ = Encode::decode_utf8( $_ ); } $name = Encode::decode_utf8( $name ); if ( scalar @val == 1 ) { $form_input ->{$name} = $val[0]; } else { $form_input ->{$name} = \@val; # save value as an array ref } }This builds a ready- and safe-to-use hash of inputparameters.
Solution #2: Specify IO encoding layers foryour filehandles
In Perl 5.8 a filehandle can have an encoding specified for it. Perl then will convert all input from the file automatically into its internal Unicode encoding. It will mark the values read from it accordingly with the utf8 flag. Equally, perl can convert output to a specific encoding for a filehandle. Additionally, perl checks that the data you output is valid for the filehandle’s encoding.So, if you read data from a file or another input stream,and you expect UTF-8 data there, warn perl:
if ( open( FILE, "<:utf8", $fname ) ) { . . . }or, in case of our simple test,
#!/usr/bin/perl my $ustring1 = "Hello \x{263A}!\n"; binmode DATA, ":utf8"; my $ustring2 = <DATA>; print "$ustring1$ustring2"; __DATA__ Hello ☺!sourceThis should print two equal lines and make no annoying warning.
Similarly, if you open a file as:
open FILE, "<:encoding(iso-8859-7)", $filename;it’s content will be assumed to be in iso-8859-7 encoding. Perl will use that to interprete file’s data correctly, i.e. to convert it to the internal UTF-8.
Solution #3: Global Unicode setting in Perl
And there is yet another way to approach yourcoding/encoding problems. It is to command perl to treatall your program’s input and output as UTF-8 by default.-Cis a perl switch which let’s you do that.Just put
-CSon the perl command line.
Alternatively, use
PERL_UNICODEenvironmentvariable. It has to be set in the environment where youexecute perl, for instance:
god@world:~$ PERL_UNICODE=S perl script.plWould command perl to assume UTF-8 in all input andoutput filehandles in your script and used modules, by default. (Unfortunately and contrary to my expectationsthis does not have an impact on the special DATA filehandle.So this is not a solution to our problem showcase script.)
You can also specify UTF-8-ness for just your stdin or just stdout or just stderr. Read a section on
-Cin perlrun for full details.
Wide character in print warning
The warning happens when you output a Unicode string to anon-unicode filehandle. What is a "non-unicodefilehandle?", you ask. That’s the one with no unicode-compatible IO layer on it (see Solution #2 section above.)The right way to fix this is to specify the outputencoding explicitly, with the binmode() function or in youropen() call. For example, open your file this way:
open FILE, ">:utf8", $filename;To print UTF-8 to standard output (or standard error), asin our case, we do:
#!/usr/bin/perl my $ustring1 = "Hello \x{263A}!\n"; binmode DATA, ":utf8"; my $ustring2 = <DATA>; binmode STDOUT, ":utf8"; print "$ustring1$ustring2"; __DATA__ Hello ☺!sourceThe wrong way to avoid the warning is to turn off theutf8 flag on your to-be-printed data. Then the characterswill turn into bytes, and perl will push them to a bytes-filehandle smoothly. But you don’t need that,really.
On the other hand, if you open a file as:
open FILE, ">:encoding(iso-8859-7)", $filename;the stuff you print will be output in iso-8859-7encoding, transcoded automatically. ISO-8859-7 is not a Unicode-compatible charset, so you won’t be able to outputUnicode characters on it without a warning.
The right strategy: summary
If you can, use a Unicode encoding (such as UTF-8) to store and process your data. Always make sure perl knowswhich encoding your data comes in and come out. Make sureall your Unicode-containing scalars, have the utf8 flag on.Then you can safely concatenate strings. Then you can useUnicode-related regular expressions, which gives you greatpowers for international (multi-language) textprocessing.To achieve that, you may need to know all the ways datagets into your program. As soon as you get some input, markit as Unicode or convert it to Unicode and sleep well.
Sometimes data comes into your program already in Unicodeand you shouldn’t worry. For instance, XML parsers returnyou string values with the utf8 flag “on”. (Unless you do something weird, like getting it in original form from theparser, which you shouldn’t do anyway.) In the aboveexample we explicitly include a unicode character into a string (
$ustring1) and perl knows itsencoding.
But when you read data from input streams, from a database or from environment variables (like parameters inCGI), you need to tell perl about its encoding.
Use PERL_UNICODE environment variable to force UTF-8 IO layers on your input and/or output filehandles.
转自:http://ahinea.com/en/tech/perl-unicode-struggle.html
相关文章推荐
- 【 Perl 】三种方式解决” Wide character in print “
- perl unload utf-8 oracle Wide character in print at unload_oracle.pl line 105.
- perl unload utf-8 oracle Wide character in print at unload_oracle.pl line 105.
- Perl Wide character in print问题解决
- Wide character in print at ../lib/MonWalkProc.pm line 569.
- Wide character in print报错
- Wide character in print at check_cert.pl line 18.
- Wide character in print at hcp.pl line 21.
- Wide character in print at hcp.pl line 21.
- Wide character in print at hcp.pl line 21.
- Wide character in perl
- php 返回json 解析 报Wide character in print
- use utf8 gives me 'Wide character in print'
- php 返回json 解析 报Wide character in print
- Wide character in print 报错
- How can convert character “%xx” in html using Perl
- Illegal character in query Url中含有{}
- Trace the sql print message in .net code
- How to configure samba server in Linux Print
- Leetcode-387 First Unique Character in a String