您的位置:首页 > 运维架构

Positively Developer Must Know About Unicode and Character Sets (No Excuses!)

2014-02-17 12:52 423 查看
关于字符编码:

http://blog.golang.org/strings

http://www.joelonsoftware.com/articles/Unicode.html

已经有人翻译了,翻译得很不错,就转载过来:http://hi.baidu.com/liqin/item/2a44e97d45a6f6376f29f670

BlueSky

蓝天

2009-11-08 01:26


翻译了一篇Joel on Software的文章

翻译了一篇Joel on Software的文章, 翻了一半才发现, 真是选错了一篇文章啊, 实在太长了。。。所以我呕心沥血了三个晚上, 才把它翻译完. 当然整体感觉还是挺糙的, 毕竟是第一篇长篇的英文翻译, 还是要多加阅读加以润色啊, 还有的地方实在直译不了, 就用了一些自己理解的语言, 如果有大牛能坚持看完, 希望指正啊. 以后再也不干这种傻事了. 其实全文的翻译对自己的能力是一个巨大的考验, 有那么一些咬文嚼字的过程, 但是对今后的帮助却不是太大, 以后再回来看这篇文章, 我应该也不会从头看到尾了. 再说,
每次都从头看到尾, 岂不是太浪费时间? 老师曾经教导我们, 不要重复的发明车轮.  所以, 本着软件工程的伟大思想, 以后, 咱们就搞摘要吧.

这篇文章主要讲述了字符及其编码的发展史. 我现在也不知道这篇文章用的是哪种字符编码, 不过还好显示的是正确的. 字符编码是很重要的东西, 如果你想让你写的网页在全世界都能被正常的显示, 做好这门功课是很必要的. 因为不是所有的电脑都支持中文的编码. 哎, 不说了, 慢慢看吧. if you are bold enough to read all of this...

============================以下是译言======================

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Unicode和字符集 程序员必知必会

Ever wonder about that mysterious Content-Type tag? You know, the one you're supposed to put 

in HTML and you never quite know what it should be?

还记得吗? 那个神秘的Content-Type标签? 地球人都知道, 就是在HTML里面必须输入的一个东西, 这个东

西你可能从来都不知道是干嘛的

Did you ever get an email from your friends in Bulgaria with the subject line "???? ?????? 

??? ????"?

还记得吗?你那个保加利亚的老友给你发过一封主题为"???? ?????? ??? ????"的邮件?

I've been dismayed to discover just how many software developers aren't really completely up 

to speed on the mysterious world of character sets, encodings, Unicode, all that stuff. A 

couple of years ago, a beta tester for FogBUGZ was wondering whether it could handle 

incoming email in Japanese. Japanese? They have email in Japanese? I had no idea. When I 

looked closely at the commercial ActiveX control we were using to parse MIME email messages, 

we discovered it was doing exactly the wrong thing with character sets, so we actually had 

to write heroic code to undo the wrong conversion it had done and redo it correctly. When I 

looked into another commercial library, it, too, had a completely broken character code 

implementation. I corresponded with the developer of that package and he sort of thought 

they "couldn't do anything about it." Like many programmers, he just wished it would all

blow over somehow.

我一度感到惊愕, 因为我发现太多的程序员对于字符集, 编码, Unicode, 这些东西知之甚浅。两年前, 

FogBUGZ的一名beta测试员问, 他测试的程序能不能处理日文。日文? 难道他们有日文的邮件么? 我完全

懵了。直到我深入了解我们用于解析MIME email的商用ActiveX控件时我才知道, 这个控件在处理字符集

的时候完全错误, 最终我们不得不重写代码, 撤销掉错误的字符转换程序, 将其改写正确。接着, 我又研

究了另一个商用类库, 发现它依然在字符编码实现上完全错误. 我向developer反映了这个问题, 他觉得

他们"并不能做什么." 如同很多其他的程序员一样, 它更倾向于忘了这茬.

But it won't. When I discovered that the popular web development tool PHP has almost 

complete ignorance of character encoding issues, blithely using 8 bits for characters, 

making it darn near impossible to develop good international web applications, I thought, 

enough is enough.

但是这种问题不会消失。我发现知名的web开发工具PHP也同样忽略掉了字符编码的问题, 没有顾虑的使用

了8 bit字符, 让开发良好的国际化web应用变得几乎不可能, 我觉得我已经不能忍了(Enough is Enough)

So I have an announcement to make: if you are a programmer working in 2003 and you don't 

know the basics of characters, character sets, encodings, and Unicode, and I catch you, I'm 

going to punish you by making you peel onions for 6 months in a submarine. I swear I will.

所以我今天发表声明: 如果你是一个2003年的程序员(这篇文章够老的。。。), 假如你不清楚字符, 字符

集, 编码, Unicode的知识, 如果被我抓到, 我会让你去潜艇上剥6个月的洋葱, 你以为我不敢?

And one more thing:

还有一点很重要

IT'S NOT THAT HARD.

其实这些玩意也不是那么的难

In this article I'll fill you in on exactly what every working programmer should know. All 

that stuff about "plain text = ascii = characters are 8 bits" is not only wrong, it's 

hopelessly wrong, and if you're still programming that way, you're not much better than a 

medical doctor who doesn't believe in germs. Please do not write another line of code until 

you finish reading this article.

在这篇文章里, 我会教你们任何一个程序员都清楚应该知道的东西. 所有的这些关于"plain text = 

ascii = 8 bit 字符"的思想都是统统错误滴, 如果你们仍然用这种思想去写程序, 就如同外科医生不相

信世界上有细菌一样. 所以, 在读完这篇文章之前, 请不要再写一行代码.

Before I get started, I should warn you that if you are one of those rare people who knows 

about internationalization, you are going to find my entire discussion a little bit 

oversimplified. I'm really just trying to set a minimum bar here so that everyone can 

understand what's going on and can write code that has a hope of working with text in any 

language other than the subset of English that doesn't include words with accents. And I 

should warn you that character handling is only a tiny portion of what it takes to create 

software that works internationally, but I can only write about one thing at a time so today 

it's character sets.

在我正式开始之前, 我应该在警告一下, 如果你是那种了解国际化(I18n)的稀缺品种, 你会发现这整篇文

章实在太简单了. 但是我的目的只是设一个最低的门槛, 让所有人都能明白是怎么回事, 并且让他们的代

码有希望出现任何语言的文字, 而不是只会出现不带重音单词英文的一个子集. 我也应该警告你们, 字符

的处理在构建国际化软件中只是微不足道的一环, 但是我一次也写不完那么多东西, 所以这次我们只谈字

符集.

A Historical Perspective

历史的视角

The easiest way to understand this stuff is to go chronologically.

理解这些东西的最简单方法就是翻黄历

You probably think I'm going to talk about very old character sets like EBCDIC here. Well, I 

won't. EBCDIC is not relevant to your life. We don't have to go that far back in time.

你可能会想我会讨论那种陈旧的字符集, 如EBCDIC. 拜托, EBCDIC跟你的生活一点关系也没有吧. 我们不

会翻到那么旧的一页的.

Back in the semi-olden days, when Unix was being invented and K&R were writing The C 

Programming Language, everything was very simple. EBCDIC was on its way out. The only 

characters that mattered were good old unaccented English letters, and we had a code for 

them called ASCII which was able to represent every character using a number between 32 and 

127. Space was 32, the letter "A" was 65, etc. This could conveniently be stored in 7 bits. 

Most computers in those days were using 8-bit bytes, so not only could you store every 

possible ASCII character, but you had a whole bit to spare, which, if you were wicked, you 

could use for your own devious purposes: the dim bulbs at WordStar actually turned on the 

high bit to indicate the last letter in a word, condemning WordStar to English text only. 

Codes below 32 were called unprintable and were used for cussing. Just kidding. They were 

used for control characters, like 7 which made your computer beep and 12 which caused the 

current page of paper to go flying out of the printer and a new one to be fed in.

让我们回到近代, 在Unix刚被发明, K&R还在编写C语言的时候, 世界是如此的单纯. EBCDIC已经淡出了人

们的生活, 取而代之的, 也是唯一的一种字符集, 是经典的无重音的英文字母, 我们也有一种针对它们的

编码, 叫做ASCII, 这种编码可以使用32到127之间的数字表示所有的字符. 例如: 空格(Space)是32, 字

母"A"是65. 这些字符只用7 bit就可以完全储存. 那个时代多数电脑都使用8 bit字长, 所以你不仅可以

存储所有的ASCI字符, 还有1个bit的富余, 如果你够邪恶的话, 你可以使用那富余的1 bit做你想做的坏

事: WordStar(某早期字处理软件)傻傻的用最高的bit表示单词的最后一个字母, 大家都嘲笑他只懂英语. 

32以下的字符码被称为不可打印字符, 它们可以用来诅咒...哈,开玩笑. 这些字符是特殊的控制字符, 

像7能让电脑beep一声, 12能让打印机中的纸自动换页.

And all was good, assuming you were an English speaker.

假设你是个说英文的, 一切都很美好

Because bytes have room for up to eight bits, lots of people got to thinking, "gosh, we can 

use the codes 128-255 for our own purposes." The trouble was, lots of people had this idea 

at the same time, and they had their own ideas of what should go where in the space from 128 

to 255. The IBM-PC had something that came to be known as the OEM character set which 

provided some accented characters for European languages and a bunch of line drawing 

characters... horizontal bars, vertical bars, horizontal bars with little dingle-dangles 

dangling off the right side, etc., and you could use these line drawing characters to make 

spiffy boxes and lines on the screen, which you can still see running on the 8088 computer 

at your dry cleaners'. In fact  as soon as people started buying PCs outside of America all 

kinds of different OEM character sets were dreamed up, which all used the top 128 characters 

for their own purposes. For example on some PCs the character code 130 would display as é, 

but on computers sold in Israel it was the Hebrew letter Gimel (ג), so when Americans would 

send their résumés to Israel they would arrive as rגsumגs. In many cases, such as Russian,

there were lots of different ideas of what to do with the upper-128 characters, so you 

couldn't even reliably interchange Russian documents.

由于机器字长是8个bit, 有人就想了, "天哪, 我们可以用128-255的字符码做我们自己想做的事."但是很

麻烦的是, 很多人都想利用这些字符码, 但是他们各自有各的方法去使用128-255之间的字符. IBM-PC就

搞了个OEM字符集, 提供了一些欧洲语言中带重音的字符和一些用于制表的字符...如水平线, 竖直线, 右

边向下弯的水平线, 等等. 人们可以用这些制表符在屏幕上画出漂亮的表格和线(ASCII Arts...), 这些

图案至今还能在装有8088计算机的干洗店里看到. 实际上, 当人们开始从美国以外的国家购买PC, 这些

OEM字符集就开始天马行空了, 它们都使用了高位的128个字符, 但是目的各不相同. 例如, 有些PC上字符

码130会显示字母é, 但是在以色列的PC上, 同样的字符码显示的是希伯来字母 Gimel(ג打不出来), 所以

当美国人想发封简历(résumé)给以色列人时, résumé就变成了rגsumגs. 还有很多其他的情况, 例如俄文

也采用了很多别的方法, 所以你不可能无所顾虑的交换俄文的文档. 

Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, 

everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there 

were lots of different ways to handle the characters from 128 and on up, depending on where 

you lived. These different systems were called code pages. So for example in Israel DOS used 

a code page called 862, while Greek users used 737. They were the same below 128 but 

different from 128 up, where all the funny letters resided. The national versions of MS-DOS 

had dozens of these code pages, handling everything from English to Icelandic and they even 

had a few "multilingual" code pages that could do Esperanto and Galician on the same 

computer! Wow! But getting, say, Hebrew and Greek on the same computer was a complete 

impossibility unless you wrote your own custom program that displayed everything using 

bitmapped graphics, because Hebrew and Greek required different code pages with different 

interpretations of the high numbers.

终于, 这种完全自由, 其实是完全混乱的OEM被ANSI标准给规范了. 在ANSI标准中, 所有人都对128以下的

字符达成一致, 这些字符跟ASCII基本相同, 但是128以上的字符, 却有很多种处理方法, 这取决于你居住

在哪里. 这些不同的字符系统被称为码页(Code pages). 比如说, 在以色列, DOS系统使用862的码页, 而

希腊用户则使用737. 128以下的字符, 都相同, 128以上的字符, 基本不相同, 所有有趣的字符都在128以

上. MS-DOS的国际版本有好几打这样的码页去处理从英文到冰岛文的文字, 而且, 这些码页中, 还有一些

被成为"多语言"的码页, 可以同时处理例如西班牙语和加利西亚语! 很神奇吧! 然而, 希伯来语和希腊语

是完全不可能共同处理的, 除非你去写一段专用的程序, 使用bit图来显示所有的文字, 因为希伯来语和

希腊语需要不同的码页去解析高位的字符编码.

Meanwhile, in Asia, even more crazy things were going on to take into account the fact that 

Asian alphabets have thousands of letters, which were never going to fit into 8 bits. This 

was usually solved by the messy system called DBCS, the "double byte character set" in which 

some letters were stored in one byte and others took two. It was easy to move forward in a 

string, but dang near impossible to move backwards. Programmers were encouraged not to use 

s++ and s-- to move backwards and forwards, but instead to call functions such as Windows' 

AnsiNext and AnsiPrev which knew how to deal with the whole mess.

于此同时, 在亚洲, 更令人抓狂的事情出现了, 亚洲文字的字符有上千种, 完全不可能用8个bit来表示. 

这种情况一度被一个挺糟糕的系统DBCS所解决, 这个系统使用了"双字长字符集", 这个字符集中, 有些字

符用单字节存储, 而另外的则用双字节. 这个字符集导致的后果是, 在一个字符串中, 前移操作很容易, 

但是向后移动却变得不可能. 因此程序员都不鼓励使用s++和s--来使字符串前后移动, 而是采用了一些特

殊的方法, 例如Windows的AnsiNext和AnsiPrev, 这些方法会正确处理这种特殊的情况.

But still, most people just pretended that a byte was a character and a character was 8 bits 

and as long as you never moved a string from one computer to another, or spoke more than one 

language, it would sort of always work. But of course, as soon as the Internet happened, it 

became quite commonplace to move strings from one computer to another, and the whole mess 

came tumbling down. Luckily, Unicode had been invented.

但是, 很多人依然把一个字节当成是一个字符, 如果他们从不把字符串发送到另一台电脑上, 或者从不说

超过一种的语言, 字符永远都是8 bit的, 在某种意义下, 这也是可行的. 但是很显然, 当Internet出现

之后, 字符串在电脑之间的转移变得稀松平常, 字符集的混乱轰然来袭. 可是很幸运的, Unicode被发明

出来了.

Unicode

Unicode was a brave effort to create a single character set that included every reasonable 

writing system on the planet and some make-believe ones like Klingon, too. Some people are 

under the misconception that Unicode is simply a 16-bit code where each character takes 16 

bits and therefore there are 65,536 possible characters. This is not, actually, correct. It 

is the single most common myth about Unicode, so if you thought that, don't feel bad.

Unicode很强大, 它力图将这个星球上任何一种正常的以及不正常(如Klingon)的字符系统通通用一个字符

集来表示. 有的人曲解了Unicode, 以为它就是一种16 bit的字符编码, 所有的字符都用16 bit来表示, 

所以总共会有65536种不同的字符. 其实这样的理解是不正确的. 而这也是Unicode神秘的一面, 如果你的

确是那样理解的, 也别太自责了.

In fact, Unicode has a different way of thinking about characters, and you have to 

understand the Unicode way of thinking of things or nothing will make sense.

事实上, Unicode采用了一种不同的思想去解读字符, 我们必须去理解Unicode的思想, 否则余下的东西都

该看不懂了.

Until now, we've assumed that a letter maps to some bits which you can store on disk or in 

memory:

到现在为止, 我们已经假设一个字符(或字母)可以用一些bit来表示, 并且可以存储在磁盘或内存中:

A -> 0100 0001

In Unicode, a letter maps to something called a code point which is still just a theoretical 

concept. How that code point is represented in memory or on disk is a whole nuther story.

而在Unicode中, 一个字符对应的是一种称为码点的东西, 码点至今还是一个理论的概念. 关于码点在内

存或磁盘中如何表示, 又可以讲一个故事了.

In Unicode, the letter A is a platonic ideal. It's just floating in heaven:

在Unicode中, 字母A是一个基本的字母. 它是一种基础的表达方式:

A

This platonic A is different than B, and different from a, but the same as A and A and A. 

The idea that A in a Times New Roman font is the same character as the A in a Helvetica 

font, but different from "a" in lower case, does not seem very controversial, but in some 

languages just figuring out what a letter is can cause controversy. Is the German letter ß a 

real letter or just a fancy way of writing ss? If a letter's shape changes at the end of the 

word, is that a different letter? Hebrew says yes, Arabic says no. Anyway, the smart people 

at the Unicode consortium have been figuring this out for the last decade or so, accompanied 

by a great deal of highly political debate, and you don't have to worry about it. They've 

figured it all out already.

这个基本的A与B不相同, 与a也不相同, 但是与不同字体的A相同. Unicode中, Times New Roman字体的A

与Helvetica字体的A是同一个东西, 但是与小写的"a"是不同的, 这种思想看起来没有什么争议, 但是有

些语言却会在字母的判定上产生分歧. 比如, 德语的字符ß究竟是一个真实的字符, 还是ss的一种艺术写

法? 如果字母的形状在单词的末尾发生了改变, 是否就是不一样的字符了呢? 在希伯来语中, 这种情况存

在, 但是阿拉伯语中却不存在. 不管怎样, Unicode组织的精英们已经在近半年内解决了这些问题, 他们

采用了大量的政治讨论, 但是今天的你却不用再去担心政治问题. 他们都已经被搞定了.

Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium 

which is written like this: U+0639.  This magic number is called a code point. The U+ means 

"Unicode" and the numbers are hexadecimal. U+0639 is the Arabic letter Ain. The English 

letter A would be U+0041. You can find them all using the charmap utility on Windows 2000/XP 

or visiting the Unicode web site.

任何一个字母表中, 所有的基本字符, Unicode组织都为其分配了一个魔数, 形如: U+0639. 这个魔数就

被称为码点. U+代表的是Unicode, 后面的数字是16进制的. U+0639是阿拉伯字母Ain. 英文字母A是

U+0041. 你可以使用Windows 2000/XP自带的字符映射工具或访问Unicode的网站找到这些字符的码点.

There is no real limit on the number of letters that Unicode can define and in fact they 

have gone beyond 65,536 so not every unicode letter can really be squeezed into two bytes, 

but that was a myth anyway.

Unicode对可表示字符的数目没有限制, 实际上, 这些字符的数目早就超过了65536, 所以, 不是所有的字

符都可以塞到2个字节中, 而这就是Unicode的神奇之处.

OK, so say we have a string:

Ok, 让我们来看一个字符串:

Hello

which, in Unicode, corresponds to these five code points:

在Unicode中, 这个字符串对应5个码点:

U+0048 U+0065 U+006C U+006C U+006F.

Just a bunch of code points. Numbers, really. We haven't yet said anything about how to 

store this in memory or represent it in an email message.

不就是一堆码点吗? 确实, 就是一堆数字而已. 但是我们还没有讲到如何在内存中存储这些码点, 或者如

何在Email中来表示它们.

Encodings

编码

That's where encodings come in.

这里, 编码的概念出现了.

The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, hey, 

let's just store those numbers in two bytes each. So Hello becomes

最早的Unicode编码, 也是两字节神话的祖先, 就是……就用这两字节去储存每个字符. 所以Hello就成了

00 48 00 65 00 6C 00 6C 00 6F

Right? Not so fast! Couldn't it also be:

对吗? 让我们慢慢前进一些! 它们也可以是这样的啊:

48 00 65 00 6C 00 6C 00 6F 00 ?

Well, technically, yes, I do believe it could, and, in fact, early implementors wanted to be 

able to store their Unicode code points in high-endian or low-endian mode, whichever their 

particular CPU was fastest at, and lo, it was evening and it was morning and there were 

already two ways to store Unicode. So the people were forced to come up with the bizarre 

convention of storing a FE FF at the beginning of every Unicode string; this is called a 

Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a 

FF FE and the person reading your string will know that they have to swap every other byte. 

Phew. Not every Unicode string in the wild has a byte order mark at the beginning.

额, 从技术的角度来说, 是的, 我的确相信Unicode可以这样编码. 实际上, 早期的实现者要求将Unicode

同时以high-endian和low-endian的字节序存储, 以适应不同的CPU对不同的字节序编码处理的速度. 看, 

已经有两种截然不同的存储Unicode方法了. 所以人们需要强迫适应一种古怪的习惯, 在每个Unicode字符

串之前储存一个标志FE FF; 这个标志被称为Unicode字节序标志, 如果需要交换高低字节, 这个标志会变

成FF FE, 这样当人们阅读这个字符串的时候就会知道他们需要将高低字节交换. 其实, 也不是所有的

Unicode字符串在开头都会有这样的一个字节序标志.

For a while it seemed like that might be good enough, but programmers were complaining. 

"Look at all those zeros!" they said, since they were Americans and they were looking at 

English text which rarely used code points above U+00FF. Also they were liberal hippies in 

California who wanted to conserve (sneer). If they were Texans they wouldn't have minded 

guzzling twice the number of bytes. But those Californian wimps couldn't bear the idea of 

doubling the amount of storage it took for strings, and anyway, there were already all these 

doggone documents out there using various ANSI and DBCS character sets and who's going to 

convert them all? Moi? For this reason alone most people decided to ignore Unicode for 

several years and in the meantime things got worse.

字节序标志风光了一段时间, 但是有的程序员开始抱怨了. "看看这一堆0", 他们说, 由于他们是美国人, 

他们只会看英文的文本, 这里面几乎不会用到高于U+00FF的码点. 同时, 他们也是自由的加州嬉皮士, 他

们很复古. 如果他们是德州人, 就不会去在意两倍字节的花费. 但是这些加州的老土不能忍受花两倍的空

间去存储字符串, 毕竟, 已经有这么多可恶的文档使用了各种各样的ANSI和DBCS字符集, 谁来把他们都转

了啊? 基于这个原因, 大多数的人都忘记了Unicode很多年, 与此同时, 字符集的混乱还在继续.

Thus was invented the brilliant concept of UTF-8. UTF-8 was another system for storing your 

string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-

8, every code point from 0-127 is stored in a single byte. Only code points 128 and above 

are stored using 2, 3, in fact, up to 6 bytes.

之后, 辉煌的UTF-8编码出现了. UTF-8是另一种储存Unicode码点的编码系统, 它使用8位字长, 去存储那

些U+魔数. UTF-8中, 0-127的码点用单字节来存储. 只有128以上的才使用多字节, 字节数可以为2, 3, 

直到6.

This has the neat side effect that English text looks exactly the same in UTF-8 as it did in 

ASCII, so Americans don't even notice anything wrong. Only the rest of the world has to jump 

through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be 

stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, 

and every OEM character set on the planet. Now, if you are so bold as to use accented

letters or Greek letters or Klingon letters, you'll have to use several bytes to store a 

single code point, but the Americans will never notice. (UTF-8 also has the nice property 

that ignorant old string-processing code that wants to use a single 0 byte as the null-

terminator will not truncate strings).

UTF-8对英文几乎没有副作用, 因为用UTF-8编码的英文字母与ASCII编码完全一致, 所以美国人几乎感觉

不到变化. 只有世界其他地方的人需要重新适应. 对于Hello, Unicode: U+0048 U+0065 U+006C U+006C 

U+006F, 将被储存为48 65 6C 6C 6F, 这就与ASCII, ANSI, 以及世界上所有的OEM字符集相一致了. 现在

, 如果你想使用带重音的字母或者希腊字母或者Klingon字母, 就需要使用好几个字节去存储一个码点, 

但是美国人不会注意到这点. (UTF-8有一个良好的特点, 它取消了陈旧的字符串处理编码, 例如使用全0

字节表示终结符来防止字符串的删节)

So far I've told you three ways of encoding Unicode. The traditional store-it-in-two-byte 

methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits), and 

you still have to figure out if it's high-endian UCS-2 or low-endian UCS-2. And there's the 

popular new UTF-8 standard which has the nice property of also working respectably if you 

have the happy coincidence of English text and braindead programs that are completely 

unaware that there is anything other than ASCII.

目前为止, 我已经讲述了3种Unicode编码方法. 传统的2字节存储方法被称为UCS-2或UTF-16, 你也需要区

分high-endian的UCS-2或low-endian的UCS-2. 当然还有流行的UTF-8规范, 如果你正好有一份英文的文献

或者是系统, UTF-8能处理的很出色, 你完全不用去关心ASCII以外的任何东西.

There are actually a bunch of other ways of encoding Unicode. There's something called UTF-

7, which is a lot like UTF-8 but guarantees that the high bit will always be zero, so that 

if you have to pass Unicode through some kind of draconian police-state email system that 

thinks 7 bits are quite enough, thank you it can still squeeze through unscathed. There's 

UCS-4, which stores each code point in 4 bytes, which has the nice property that every 

single code point can be stored in the same number of bytes, but, golly, even the Texans 

wouldn't be so bold as to waste that much memory.

实际上, 还有很多其它的方法对Unicode进行编码. 有一种叫UTF-7的编码, 跟UTF-8很像, 但是它保证最

高位的一个bit始终为0, 这样当你遇到一些很苛刻的email系统的过滤规则, 如限制字符位数不能超过7 

bit时, 还是能保证Unicode信息能够被无损的传输. 还有UCS-4, 所有的码点都用4字节存储, 这样就保证

了任何一个码点都能够使用相同数量的字节来存储, 但是, 天哪, 即使是德克萨斯人也不能忍受那么多的

空间浪费.

And in fact now that you're thinking of things in terms of platonic ideal letters which are 

represented by Unicode code points, those unicode code points can be encoded in any old-

school encoding scheme, too! For example, you could encode the Unicode string for Hello 

(U+0048 U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding, or the Hebrew 

ANSI Encoding, or any of several hundred encodings that have been invented so far, with one 

catch: some of the letters might not show up! If there's no equivalent for the Unicode code 

point you're trying to represent in the encoding you're trying to represent it in, you 

usually get a little question mark: ? or, if you're really good, a box. Which did you get? 

-> �

现在我们已经建立了这样的思想: 基本的字母由一些Unicode码点来表示, 这些码点也可以用任何一种学

校教的旧编码模式来进行编码. 例如, 你可以将Unicode字符串Hello(U+0048 U+0065 U+006C U+006C 

U+006F)编码成ASCII, 或者是以前的希腊OEM编码, 或者是希伯来语的ANSI编码, 以及其它的上百种已发

明的编码, 只需注意一点: 有一些字符不一定能被显示出来! 如果在你采用的编码中没有与当前的

Unicode码点对应的编码, 通常你会得到一个问号: ?或者, 如果你够NB, 会得到一个方框. 你究竟得到的

是什么呢?

There are hundreds of traditional encodings which can only store some code points correctly 

and change all the other code points into question marks. Some popular encodings of English 

text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859

-1, aka Latin-1 (also useful for any Western European language). But try to store Russian or 

Hebrew letters in these encodings and you get a bunch of question marks. UTF 7, 8, 16, and 

32 all have the nice property of being able to store any code point correctly.

传统的编码方式有上百种, 但是它们却只能将某些码点正确存储, 另外的都会变成问号. 一些通用的英文

编码方法有Windows-1251, ISO-8859-1, aka Latin-1. 然而用这些编码方法去处理俄文或希伯来文, 你

通常会得到一大堆的问号. UTF 7, 8, 16还有32都能很好的处理所有的码点.

The Single Most Important Fact About Encodings

以下就是关于编码最重要的一点

If you completely forget everything I just explained, please remember one extremely 

important fact. It does not make sense to have a string without knowing what encoding it 

uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII.

如果你完全忘记了我刚才说的东西, 请你一定要记住最重要的点. 一个字符串不知道其编码, 是完全没有

意义的. 所以你就别像个鸵鸟一样把头埋在沙子里, 骗自己"plain" text就是ASCII.

There Ain't No Such Thing As Plain Text.

其实根本没有所谓的plain text

If you have a string, in memory, in a file, or in an email message, you have to know what 

encoding it is in or you cannot interpret it or display it to users correctly.

一个字符串, 不管是在内存里, 在文件里, 还是在email信息里, 你都要知道它的编码是什么, 不然你就

不能正确解析或者是显示给用户.

Almost every stupid "my website looks like gibberish" or "she can't read my emails when I 

use accents" problem comes down to one naive programmer who didn't understand the simple 

fact that if you don't tell me whether a particular string is encoded using UTF-8 or ASCII 

or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it 

correctly or even figure out where it ends. There are over a hundred encodings and above 

code point 127, all bets are off.

几乎所有的愚蠢现象, 例如"我的网站看起来就像是在扯淡", 或者"当我发送带重音的字母时, 她就读不

了我的邮件", 都来源于那些天真的程序员, 他们不懂得这个浅显的知识, 如果你不告诉我一个字符串是

用UTF-8编码的, 还是ASCII编码的, 或者是ISO8859-1, 或者是Windows 1252, 那我就不能正确的显示它, 

甚至都不知道这个字符串是在哪里结尾的. 世界上有上百种编码, 当码点在127以上时, 什么都可能发生.

How do we preserve this information about what encoding a string uses? Well, there are 

standard ways to do this. For an email message, you are expected to have a string in the 

header of the form

那我们该如何去提供字符编码的信息呢? 当然, 这都是有标准的. 在Email消息中, 通常在最开始会有这

么一个头部

Content-Type: text/plain; charset="UTF-8"

For a web page, the original idea was that the web server would return a similar Content-

Type http header along with the web page itself -- not in the HTML itself, but as one of the 

response headers that are sent before the HTML page.

对于网页来说, 最早的做法是web服务器可以与web页一起返回一个类似的Content-Type http 头部 -- 不

是在HTML内, 而是在HTML页发送之前的响应里. 

This causes problems. Suppose you have a big web server with lots of sites and hundreds of 

pages contributed by lots of people in lots of different languages and all using whatever 

encoding their copy of Microsoft FrontPage saw fit to generate. The web server itself 

wouldn't really know what encoding each file was written in, so it couldn't send the 

Content-Type header.

这种做法会引起问题. 假如现在你有一个大型的web服务器, 上面有许多的站点和上百个网页, 这些站点

和网页由来自不同语言背景的人维护, 这些人各自用自己觉得合适的编码去处理FrontPage上的网页源码. 

web服务器自己当然不知道哪个文件用的是哪种编码, 所以它当然不可能返回Content-Type头部.

It would be convenient if you could put the Content-Type of the HTML file right in the HTML 

file itself, using some kind of special tag. Of course this drove purists crazy... how can 

you read the HTML file until you know what encoding it's in?! Luckily, almost every encoding 

in common use does the same thing with characters between 32 and 127, so you can always get 

this far on the HTML page without starting to use funny letters:

所以, 将Content-Type放置在HTML文件内部会更方便, 可以使用一种特殊的tag. 当然这会引起细心的人

的疑问...我们怎么可能知道了编码类型, 才能继续读HTML文件本身?(编码类型自己都放在HTML文件里面) 

可是幸运的是, 几乎所有常用的编码对于32到127之间的字符处理方法是一致的, 所以如果你不去用那些

有趣的字符, 还是可以正常显示以下的内容的:

<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

But that meta tag really has to be the very first thing in the <head> section because as 

soon as the web browser sees this tag it's going to stop parsing the page and start over 

after reinterpreting the whole page using the encoding you specified.

这里面的meta tag必须紧接<head>部分, 因为web浏览器读到这个tag的时候, 就会停止解析网页的过程, 

然后重新开始采用指定的编码来解析.

What do web browsers do if they don't find any Content-Type, either in the http headers or 

the meta tag? Internet Explorer actually does something quite interesting: it tries to 

guess, based on the frequency in which various bytes appear in typical text in typical 

encodings of various languages, what language and encoding was used. Because the various old 

8 bit code pages tended to put their national letters in different ranges between 128 and 

255, and because every human language has a different characteristic histogram of letter 

usage, this actually has a chance of working. 

It's truly weird, but it does seem to work often enough that naïve web-page writers who 

never knew they needed a Content-Type header look at their page in a web browser and it 

looks ok, until one day, they write something that doesn't exactly conform to the letter-

frequency-distribution of their native language, and Internet Explorer decides it's Korean 

and displays it thusly, 

proving, I think, the point that Postel's Law about being "conservative in what you emit and 

liberal in what you accept" is quite frankly not a good engineering principle. Anyway, what 

does the poor reader of this website, which was written in Bulgarian but appears to be 

Korean (and not even cohesive Korean), do? He uses the View | Encoding menu and tries a 

bunch of different encodings (there are at least a dozen for Eastern European languages) 

until the picture comes in clearer. If he knew to do that, which most people don't.

可是如果web浏览器找不到Content-Type, 不管是在http头部还是在meat tag里, 该怎么办呢? Internet 

Explorer做了件有趣的事情: 它去猜测编码类型, 依据的是字节在常见的文本中出现的频率, 再结合不同

的语言在常见编码下的情形, 推测出使用的语言还有编码. 由于以前的8 bit码页倾向于将国际字符放在

128和255之间的不同的范围内, 而且确实所有的语言都有一个字符的使用频率统计, 这种猜测的方法还是

有一定概率能成功的. 的确是很诡异, 但是它成功的概率还蛮高, 以至于天真的网页编写者可以不知道他

们需要提供一个Content-Type头部, 也看不出他们的网页有什么毛病, 直到有一天, 他们用他们的母语写

了一些不符合字符频率分布方法的内容, Internet Explorer的猜测法推断出他们用的是韩文并且显示出

来. 这说明了什么呢? 我想, Postel法则"保守的给予, 自由的接受"对于工程理论来说, 是相当荒谬的. 

最后, 可怜的读者打开了这个网页, 这个本该是保加利亚语但是变成了韩语(而且也不是连贯的韩语)的网

页, 他该做什么? 他会打开View|Encoding菜单, 试图在一大群编码里找到一个, 让网页能够正常显示. 

如果他早知道该多好, 当然大多数人都不会知道

For the latest version of CityDesk, the web site management software published by my 

company, we decided to do everything internally in UCS-2 (two byte) Unicode, which is what 

Visual Basic, COM, and Windows NT/2000/XP use as their native string type. In C++ code we 

just declare strings as wchar_t ("wide char") instead of char and use the wcs functions 

instead of the str functions (for example wcscat and wcslen instead of strcat and strlen). 

To create a literal UCS-2 string in C code you just put an L before it as so: L"Hello".

在本公司发布的站点管理软件, CityDesk的最新版本中, 我们决定采用UCS-2作为Unicode的内部编码, 这

种编码被应用在Visual Basic, COM, 以及Windows NT/2000/XP中, 用于表示本地的字符串. 在C++代码中

, 我们将字符串声明为wchar_t, 而不是char, 并且使用wcs开头的函数, 而不是str开头的函数. 为了在C

语言中创建常量的UCS-2字符串, 你必须在前面加个L, 例如L"Hello".

When CityDesk publishes the web page, it converts it to UTF-8 encoding, which has been well 

supported by web browsers for many years. That's the way all 29 language versions of Joel on 

Software are encoded and I have not yet heard a single person who has had any trouble 

viewing them.

当CityDesk正式发布的时候, 它将编码转换成了UTF-8, 因为这种编码被浏览器广泛支持. 这也是所有的

29个语言版本的Joel on Software网站的编码, 我至今还没有听说有谁浏览我的网页遇到问题.

This article is getting rather long, and I can't possibly cover everything there is to know 

about character encodings and Unicode, but I hope that if you've read this far, you know 

enough to go back to programming, using antibiotics instead of leeches and spells, a task to 

which I will leave you now.

这篇文章已经很长了, 我不可能覆盖到关于字符编码和Unicode的方方面面, 但是我想, 如果你一直坚持

读到这里, 就已经知道的够多了, 可以回去编码了, 记得要吃抗生素治病, 别用蛊虫或者是符咒了. 接下

来, 就交给你们自己了~
#Joel On Software
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐