What's the difference between UTF-8 and Unicode?
2015-06-08 21:10
357 查看
If asked the question, "What is the difference between UTF-8 and Unicode?", would you confidently reply with a short and precise answer? In these days of internationalization all developers should be able to do that. I suspect many of us do not understand these
concepts as well as we should. If you feel you belong to this group, you should read this ultra short introduction to character sets and encodings.
Actually, comparing UTF-8 and Unicode is like comparing apples and oranges:
UTF-8 is an encoding - Unicode is a character set
A character set is a list of characters with unique numbers (these numbers are sometimes referred to as "code points"). For example, in the Unicode character set, the number for A is 41.
An encoding on the other hand, is an algorithm that translates a list of numbers to binary so it can be stored on disk. For example UTF-8 would translate the number sequence 1, 2, 3, 4 like this:
Our data is now translated into binary and can now be saved to disk.
All together now
Say an application reads the following from the disk:
The app knows this data represent a Unicode string encoded with UTF-8 and must show this as text to the user. First step, is to convert the binary data to numbers. The app uses the UTF-8 algorithm to decode the data. In this case, the decoder returns this:
Since the app knows this is a Unicode string, it can assume each number represents a character. We use the Unicode character set to translate each number to a corresponding character. The resulting string is "hello".
So when somebody asks you "What is the difference between UTF-8 and Unicode?", you can now confidently answer short and precise:
UTF-8 and Unicode cannot be compared. UTF-8 is an encoding used to translate numbers into binary data. Unicode is a character set used to translate characters into numbers.
文章转自:http://stackoverflow.com/questions/3951722/whats-the-difference-between-unicode-and-utf8
Java中采用的是unicode标准字符集
Java语言使用unicode标准字符集,最多可以识别65535个字符,unicode字符表的前128个字符刚好是ASCII表。每个国家的“字母表”的字母都是unicode表中的一个字符,比如汉字中的“你”字就是unicode表中的第20320字符。
Java所谓的字母包括了世界上任何语言中的“字母表”,因此,Java所使用的字母不仅包括通常的拉丁字母,a,b,c等,也包括汉语中的汉字,日文里的片假名,平假名,朝鲜文以及其他许多语言中的文字。
维基百科:
目前实际应用的统一码版本是UCS-2,使用16位的编码空间。也就是每个字符(character,即char)占用2个字节(byte)。这样理论上一共最多可以表示2^16(即65536)个字符。基本满足各种语言的使用。实际上当前版本的统一码并未完全使用这16位编码,而是保留了大量空间以作为特殊使用或将来扩展。
Java的字节码环境采用UTF-16作为内部表示,UTF-16继承自UCS-2,使用16位的编码空间。所以Java中基本类型char的大小是16-bit,范围是:Unicode 0 ~ Unicode 2^16-1。
concepts as well as we should. If you feel you belong to this group, you should read this ultra short introduction to character sets and encodings.
Actually, comparing UTF-8 and Unicode is like comparing apples and oranges:
UTF-8 is an encoding - Unicode is a character set
A character set is a list of characters with unique numbers (these numbers are sometimes referred to as "code points"). For example, in the Unicode character set, the number for A is 41.
An encoding on the other hand, is an algorithm that translates a list of numbers to binary so it can be stored on disk. For example UTF-8 would translate the number sequence 1, 2, 3, 4 like this:
[code]00000001 00000010 00000011 00000100
Our data is now translated into binary and can now be saved to disk.
All together now
Say an application reads the following from the disk:
[code]1101000 1100101 1101100 1101100 1101111
The app knows this data represent a Unicode string encoded with UTF-8 and must show this as text to the user. First step, is to convert the binary data to numbers. The app uses the UTF-8 algorithm to decode the data. In this case, the decoder returns this:
[code]104 101 108 108 111
Since the app knows this is a Unicode string, it can assume each number represents a character. We use the Unicode character set to translate each number to a corresponding character. The resulting string is "hello".
Conclusion
So when somebody asks you "What is the difference between UTF-8 and Unicode?", you can now confidently answer short and precise:UTF-8 and Unicode cannot be compared. UTF-8 is an encoding used to translate numbers into binary data. Unicode is a character set used to translate characters into numbers.
文章转自:http://stackoverflow.com/questions/3951722/whats-the-difference-between-unicode-and-utf8
Java中采用的是unicode标准字符集
Java语言使用unicode标准字符集,最多可以识别65535个字符,unicode字符表的前128个字符刚好是ASCII表。每个国家的“字母表”的字母都是unicode表中的一个字符,比如汉字中的“你”字就是unicode表中的第20320字符。
Java所谓的字母包括了世界上任何语言中的“字母表”,因此,Java所使用的字母不仅包括通常的拉丁字母,a,b,c等,也包括汉语中的汉字,日文里的片假名,平假名,朝鲜文以及其他许多语言中的文字。
维基百科:
目前实际应用的统一码版本是UCS-2,使用16位的编码空间。也就是每个字符(character,即char)占用2个字节(byte)。这样理论上一共最多可以表示2^16(即65536)个字符。基本满足各种语言的使用。实际上当前版本的统一码并未完全使用这16位编码,而是保留了大量空间以作为特殊使用或将来扩展。
Java的字节码环境采用UTF-16作为内部表示,UTF-16继承自UCS-2,使用16位的编码空间。所以Java中基本类型char的大小是16-bit,范围是:Unicode 0 ~ Unicode 2^16-1。
相关文章推荐
- javascript 之闭包-理解不了来找我
- js常用功能代码
- 基于HT for Web矢量实现HTML5文件上传进度条
- CAFFE安装:UNUNTU14.04.1/CUDA7/OPENCV3.0
- 剑指offer29题,牛客网中“数组中出现次数超过一半的数字”
- 解决gem update --system的Gem::RemoteFetcher::FetchError错误
- 基于HT for Web矢量实现HTML5文件上传进度条
- jquery异步请求返回JSON
- Caffe-代码解析-Layer
- Extjs5需要引入的文件
- BZOJ 1579: [Usaco2009 Feb]Revamping Trails 道路升级( 最短路 )
- Htmlunit写小爬虫使用心得
- FireBreath与JS交互1
- RabbitMQ消息队列的小伙伴(八): ProtoBuf(Google Protocol Buffer)
- 【JavaScript 5—基础知识点】:正则表达式(笔记)
- 【JavaScript 5—基础知识点】:正则表达式(笔记)
- 完整的堆栈JavaScript路(十五)HTML5 focus 扩大 (扩展点)
- 让navigationItem.leftBarButtonItem具有backBarButtonItem的外观样式
- ReactJS读书笔记四:mixins
- LeetCode 25: Reverse Nodes in k-Group