您的位置:首页 > Web前端 > HTML

HTML4.01规范中英文对照-HTML文档展现(1)

2011-06-24 14:28 537 查看

5 HTML Document Representation

Contents

The Document Character Set

Character encodings

Choosing an encoding

Notes on specific encodings

Specifying the character encoding

Character references

Numeric character references

Character entity references

Undisplayable characters

本章内容

文档字符集

字符编码

选择编码

有关特定编码的一些说明

指定字符编码

字符引用

数字形式的字符引用

字符实体引用

不能显示字符

In this chapter, we discuss how HTML documents are represented on a computer and over the Internet.
本章我们讨论一个HTML文档经过互联网(Internet)传输后,如何在计算机被展示的一些问题。
The section on the document character set addresses the issue of what abstract characters may be part of an HTML document. Characters include the Latin letter "A", the Cyrillic letter "I", the Chinese character meaning "water", etc.
文档字符集部分主要讨论哪些抽象字符可以在HTML文档中出现。例如:拉丁字母“A”,斯拉夫字母"I",中文字符”水“,等等。
The section on character encodings addresses the issue of how those characters may be represented in a file or when transferred over the Internet. As some character encodings cannot directly represent all characters an author may want to include in a document, HTML offers other mechanisms, called character references, for referring to any character.
字符编码部分主要讨论这些字符在文件中存储或者在Internet上进行传输时如何进行表示。由于一些字符编码不能像作者所希望的那样,对在文档内出现的所有字符进行直接表示,HTML提供了另外的叫做"字符引用"的机制,该机制可以对任何字符进行引用。
Since there are a great number of characters throughout human languages, and a great variety of ways to represent those characters, proper care must be taken so that documents may be understood by user agents around the world.
由于人类语言拥有数量庞大的字符,并且对于这些字符来说又有很多种不同的表示方式,所以为了能够让文档可以被世界上所有的用户代理理解,所以必须在该方面进行正确的处理。

5.1 The Document Character Set

To promote interoperability, SGML requires that each application (including HTML) specify its document character set. A document character set consists of:

A Repertoire: A set of abstract characters,, such as the Latin letter "A", the Cyrillic letter "I", the Chinese character meaning "water", etc.

Code positions: A set of integer references to characters in the repertoire.

为了彰显互操作能力,SGML要求每一个应用(当然包括HTML)都要指定文档字符集。一个文档字符集由如下部分组成:

字符全集: 抽象字符的集合,, 例如拉丁字母"A", 斯拉夫字符"I", 中文字符"水", 等等.

代码位置: 指向字符全集中字符的整型引用集合。

Each SGML document (including each HTML document) is a sequence of characters from the repertoire. Computer systems identify each character by its code position; for example, in the ASCII character set, code positions 65, 66, and 67 refer to the characters 'A', 'B', and 'C', respectively.
每一个SGML文档(当然包括HTML文档)都是上述字符全集中字符的序列。计算机会通过它们的代码地址来识别它们。例如:在ASCII字符集中,代码地址65,66,和67分别代表字符'A', 'B', and 'C'。
The ASCII character set is not sufficient for a global information system such as the Web, so HTML uses the much more complete character set called the Universal Character Set (UCS), defined in [ISO10646]. This standard defines a repertoire of thousands of characters used by communities all over the world.
由于在像Web这样的面向全球的信息系统中,ASCII字符集字符太少不够使用,所以HTML使用更加完全的称为统一字符集(UCS),该字符集定义在[ISO10646].该标准定义了全世界所有语境中所使用的成千上万个字符的字符全集
The character set defined in [ISO10646] is character-by-character equivalent to Unicode ([UNICODE]). Both of these standards are updated from time to time with new characters, and the amendments should be consulted at the respective Web sites. In the current specification, "[ISO10646]" is used to refer to the document character set while "[UNICODE]" is reserved for references to the Unicode bidirectional text algorithm.
在 [ISO10646]定义的字符在Unicode中都有一一对应。ISO10646以及UNICODE这两个标准会不断地引入新字符,所以有关它们的最新 修正应该去看它们相应的网站。在此规范中,"[ISO10646]"用来指文档字符集,"[UNICODE]"被用来专指Unicode双向文本机制。
The document character set, however, does not suffice to allow user agents to correctly interpret HTML documents as they are typically exchanged -- encoded as a sequence of bytes in a file or during a network transmission. User agents must also know the specific character encoding that was used to transform the document character stream into a byte stream.
由于HTML文档进行交流时需要在存储成文件或在网络传输时编码成字节序列,所以仅有文档字符集对于用户代理正确解析HTML文档是不够的。用户代理必须还要知道将文档字符流转换成字节流所使用的字符编码。

5.2 Character encodings

What this specification calls a character encoding is known by different names in other specifications (which may cause some confusion). However, the concept is largely the same across the Internet. Also, protocol headers, attributes, and parameters referring to character encodings share the same name -- "charset" -- and use the same values from the [IANA] registry (see [CHARSETS] for a complete list).
本规范所称的字符编码在其他的规范中可能会有其他不同的名字(这可能会导致一些冲突)。不过,在Internet领域,这个概念还是在很大程度上一样的。另外,可以引用到字符编码的"协议头",“属性”,“参数”都共享相同的名字——“charset“——并且使用来自在 [IANA] 登记注册的相同取值。
The "charset" parameter identifies a character encoding, which is a method of converting a sequence of bytes into a sequence of characters. This conversion fits naturally with the scheme of Web activity: servers send HTML documents to user agents as a stream of bytes; user agents interpret them as a sequence of characters. The conversion method can range from simple one-to-one correspondence to complex switching schemes or algorithms.
"charset"参数指定一个字符编码,通过该方式将字节序列转换成字符序列。这种转化与Web的运行机制不谋而合:服务器以字节流的方式向用户代理发送数据;用户代理将它们解析成字符序列。这种转换方法可能是简单的直接对应也可能是其他复杂的方案或机制。
A simple one-byte-per-character encoding technique is not sufficient for text strings over a character repertoire as large as [ISO10646]. There are several different encodings of parts of [ISO10646] in addition to encodings of the entire character set (such as UCS-4).
对于像[ISO10646]这样巨大的字符全集来说,一个字符一个字节的编码技术是不行的。除了对整个字符集进行编码(例如:UCS-4)外,还有几个针对[ISO10646]不同子集的编码方式。

5.2.1 Choosing an encoding

Authoring tools (e.g., text editors) may encode HTML documents in the character encoding of their choice, and the choice largely depends on the conventions used by the system software. These tools may employ any convenient encoding that covers most of the characters contained in the document, provided the encoding is correctly labeled. Occasional characters that fall outside this encoding may still be represented by character references. These always refer to the document character set, not the character encoding.
文 档撰写工具(比如:文本编辑器)可以选择它们对HTML文档的字符编码方式,这种编码方式的选择很大程度上依赖于系统软件的默认约定。这些工具可以指定一 个能够包含文档中所有字符的最经济的编码方式,并将该编码方式正确标记。那些在该编码之外的不常用的字符依然可以用字符引用的方式来表示。这些都是在说文 档字符集,而不是字符编码。
Servers and proxies may change a character encoding (called transcoding) on the fly to meet the requests of user agents (see section 14.2 of [RFC2616], the "Accept-Charset" HTTP request header). Servers and proxies do not have to serve a document in a character encoding that covers the entire document character set.
服 务器或者代理(proxy)为了迎合用户代理的需要(参见[RFC2616]的14.2部分:HTTP请求头部的"Accept-Charset")可以 改变字符编码,这种操作称为编码转换。服务器以及代理(proxy)无须提供完全编码的文档(即,文档采用涵盖文档全部字符集的编码方式)。
Commonly used character encodings on the Web include ISO-8859-1 (also referred to as "Latin-1"; usable for most Western European languages), ISO-8859-5 (which supports Cyrillic), SHIFT_JIS (a Japanese encoding), EUC-JP (another Japanese encoding), and UTF-8 (an encoding of ISO 10646 using a different number of bytes for different characters). Names for character encodings are case-insensitive, so that for example "SHIFT_JIS", "Shift_JIS", and "shift_jis" are equivalent.
在Web上常用的一些字符编码包括:ISO-8859-1 (也被称为 "Latin-1";西欧的绝大部分语言采用该字符编码for ), ISO-8859-5 (支持斯拉夫语), SHIFT_JIS (日文编码), EUC-JP (另外一种日文编码), and UTF-8 (对ISO10646字符集进行编码的方式,该编码方式对不同的字符采用不同的数量字节进行编码)。字符编码的名字是大小写不敏感的, 所以 "SHIFT_JIS", "Shift_JIS", 和"shift_jis"所代表的编码方式是一样的。
This specification does not mandate which character encodings a user agent must support.
本规范不强制哪个字符编码用户代理必须要支持。
Conforming user agents must correctly map to ISO 10646 all characters in any character encodings that they recognize (or they must behave as if they did).
符合规范的用户代理必须可以正确地将ISO 10646映射成它们可识别的字符编码(或者它们要表现的至少看起来是正确的)。

Notes on specific encodings

When HTML text is transmitted in UTF-16 (charset=UTF-16), text data should be transmitted in network byte order ("big-endian", high-order byte first) in accordance with [ISO10646], Section 6.3 and [UNICODE], clause C3, page 3-1.
当HTML文本采用UTF-16(即:chartset=UTF-16)编码进行传输时,根据 [ISO10646], 6.3B部分以及 [UNICODE], C3 段, 页码3-1的规定,文本数据应该以网络字节顺序(“big-endian”,即高位字节在前的顺序)形式进行传输。
Furthermore, to maximize chances of proper interpretation, it is recommended that documents transmitted as UTF-16 always begin with a ZERO-WIDTH NON-BREAKING SPACE character (hexadecimal FEFF, also called Byte Order Mark (BOM)) which, when byte-reversed, becomes hexadecimal FFFE, a character guaranteed never to be assigned. Thus, a user-agent receiving a hexadecimal FFFE as the first bytes of a text would know that bytes have to be reversed for the remainder of the text.
更进一步,为了最大可能对文档进行正确解析,我们建议在使用UTF-16传输时,文档应该总是以零宽度不间断空格(ZERO-WIDTH NON-BREAKING SPACE)字符开始,该字符的十六进制编码为FEFF,也被称为字节顺序标记(BOM),该标记被反序解析时为十六进制FFFE,该数字没有被分配给任何字符。当用户代理接收到文本开头的十六进制数字FFFE时,用户代理就会知道余下的文本中所有字节都应该被反向转换。
The UTF-1 transformation format of [ISO10646] (registered by IANA as ISO-10646-UTF-1), should not be used. For information about ISO 8859-8 and the bidirectional algorithm, please consult the section on bidirectionality and character encoding.
[ISO10646]的UTF-1转换格式(IANA官方名字ISO-10646-UTF-1)不应被使用。有关ISO 8859-8以及双向文本机制,请参考双向文本及字符编码部分。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: