Android Cyrillic Encoding support----我们真的可以识别native 编码吗?
2013-07-12 17:24
357 查看
需求是做Android上的 Cyrillic script的支持, Cyrillic 是一种以单字节编码的 native charset。我们真的可以准确的判断出Cyrillic 本地编码,继而对他进行转化吗?
FYI,Cyrillic就是Windows-1251 A.K.A CP1251
Windows-1251 (a.k.a. code page CP1251) is a popular 8-bit character encoding, designed to cover languages that use the Cyrillic
script such as Russian, Bulgarian, Serbian Cyrillic and other
languages. It is the most widely used for encoding the Bulgarian, Serbian and Macedonian languages[citation
needed].
In modern applications, Unicode is a preferred character set.
Windows-1251 and KOI8-R (or its Ukrainian variant KOI8-U) are much more commonly used than ISO
8859-5[citation needed]. In the future, both may eventually give way to Unicode.
This character-encoding scheme is used throughout TheAmericas, Western Europe, Oceania, and much
of Africa. It is also commonly used in most standard romanizations of East-Asian languages.
试想一下,有办法把上面两种不同的native codepage区分开,他们的共同点都是 “单字节编码”,并且前半部分,完全和ascii兼容。
唯一不同点是,Windows-1251(Cyrillic)占用了0x80~0x9F区段,用来表示Cyrillic字符,而ISO-8859-1(Windows-1252/Latin1)没有使用该区段。
这个链接是我就该问题在StackOverflow上的提问 :
http://stackoverflow.com/questions/17544426/how-to-detect-windows-1251-encoded-characters
FYI,Cyrillic就是Windows-1251 A.K.A CP1251
我们真的可以识别本地字符编码吗?
下面来看两个Code Page, 一个是Windows-1251,另一个是ISO-8859-1Windows-1251 (a.k.a. code page CP1251) is a popular 8-bit character encoding, designed to cover languages that use the Cyrillic
script such as Russian, Bulgarian, Serbian Cyrillic and other
languages. It is the most widely used for encoding the Bulgarian, Serbian and Macedonian languages[citation
needed].
In modern applications, Unicode is a preferred character set.
Windows-1251 and KOI8-R (or its Ukrainian variant KOI8-U) are much more commonly used than ISO
8859-5[citation needed]. In the future, both may eventually give way to Unicode.
Windows-1251
ISO-8859-1 A.K.A latin-1
ISO 8859-1 encodes what it refers to as "Latin alphabet no. 1," consisting of 191 characters from the Latin script.This character-encoding scheme is used throughout TheAmericas, Western Europe, Oceania, and much
of Africa. It is also commonly used in most standard romanizations of East-Asian languages.
试想一下,有办法把上面两种不同的native codepage区分开,他们的共同点都是 “单字节编码”,并且前半部分,完全和ascii兼容。
唯一不同点是,Windows-1251(Cyrillic)占用了0x80~0x9F区段,用来表示Cyrillic字符,而ISO-8859-1(Windows-1252/Latin1)没有使用该区段。
那么到底该如何区分呢?
在我们的项目中,我是使用了这样的方法,对于单字节的,non-ascii的字符,将这个字符的“裸数据”例如 0x88,去上面的两个codepage去查表,根据命中率的情况来判断字符到底是 1251 还是 1252(latin1). 可想而知,这个方法是非常不可靠的,因为这两个单字节编码的码表之间的差距时间是太小了,根据命中率计算出来的confidence根本没有参考价值。结论:
We should be told that what kind of native encoding we're facing other than detecting or guessing.这个链接是我就该问题在StackOverflow上的提问 :
http://stackoverflow.com/questions/17544426/how-to-detect-windows-1251-encoded-characters
相关文章推荐
- How to remove native support from an Android Project in eclipse because eclipse is showing errors in
- BytesEncodingDetect.java 自动识别文件编码
- 取消Android Native Support
- Eclispe+CDT+gdb调试android ndk程序----包括CDT组件跟Sequoyah Android Native Code Support组件安装
- android学习:自动识别文本文件编码格式
- react-native ( om.android.support:appcompat-v7:23.0.1.)
- android 自动划屏效果 在这里,我们需要用到google提到的一个包——android-support-v4.jar,这个包是为了方便实现android view之间的切换,关
- Android之JNI NDK如何取消native support 重新add native support
- Android之JNI NDK如何取消native support 重新add native support
- Android Makefile中如何识别TARGET_PRODUCT
- Windows下安装和构建我们第一个React Native应用程序(我是用的win10)
- Android原生(Native)C开发之三:鼠标事件篇(捕鼠记)
- Android使用编码方式编写界面
- 玩转Android之二维码生成与识别
- 关于android-support-v4.jar包的错误
- Android触摸屏中的手势识别
- 在Android上使用FFmpeg将摄像头采集的YUV裸流编码为h264。
- 关于android-support-v4.jar包的错误
- android-support-v7-appcompat.jar包的引用
- Android进阶UI之百分比布局库(percent-support-lib) 解析与扩展