(Unicode) UTF-8与UTF-16之间转换
2016-12-06 17:52
393 查看
一、Unicode的由来
1、我们知道计算机其实只认识0101这样的字符串,当然了让我们看这样的01串会比较头晕,所以为了描述简单一般都用八进制、十进制、十六进制表示。
实际上都是等价的。其它像文字图片音视频等计算机也是不认识的,为了让计算机能表示这些信息就必须转换成一些数字,必须按照一些规则转换。
比如:刚开始的时候就有ASCII字符集(American Standard Code for Information Interchange, 美国信息交换标准码)它使用7 bits来表示一个字符,
总共表示128个字符,我们一般都是用字节(byte:即8个01串)来作为基本单位。当时一个字节来表示字符时第一个bit总是0,剩下的七个字节就来表示实际内容。后来IBM公司在此基础上进行了扩展,用8bit来表示一个字符,总共可以表示256个字符。也就是当第一个bit是0时仍表示之前那些常用的字符,当为1时就表示其他补充的字符。
2、英文字母再加一些其他标点字符之类的也不会超过256个,一个字节表示足够了。但其他一些文字不止这么多 ,像汉字就上万个,
于是又出现了其他各种字符集。这样不同的字符集交换数据时就有问题了,可能你用某个数字表示字符A,但另外的字符集又是用另外一个数字表示A。
为了适应全球化的发展,便于不同语言之间的兼容交互,而ASCII不再能胜任此任务了。所以就出现了Unicode和ISO这样的组织来统一制定一个标准,任何一个字符只对应一个确定的数字。ISO取的名字叫UCS(Universal Character Set)(ucs-2对应utf-16,ucs-4对应utf-32),Unicode取的名字就叫unicode了。
二、UTF-8和UTF-16的由来
1、Unicode第一个版本涉及到两个步骤:首先定义一个规范,给所有的字符指定一个唯一对应的数字,Unicode是用0至65535(2的16次方)之间的数字来表示所有字符,其中0至127这128个数字表示的字符仍然跟ASCII完全一样;第二怎么把字符对应的数字(0至65535)转化成01串保保存在计算机中。在保存时就涉及到了在计算机中占多少字节空间,就有不同的保存方式,于是出现了UTF(unicode transformation format):UTF-8和UTF-16。
三、UTF-8和UTF-16的区别
1、UTF-16:是任何字符对应的数字都用两个字节来保存,但如果都是英文字母(一个字节能表示一个字符)这样做有点浪费。
2、UTF-8:是任何字符对应的数字保存时所占的空间是可变的,可能用一个、两个或三个字节表示一个字符。
四、UTF-8和UTF-16的优劣
1、如果全部英文或英文与其他文字混合(英文占绝大部分),用UTF-8就比UTF-16节省了很多空间。
2、而如果全部是中文这样类似的字符或者混合字符(中文占绝大多数),UTF-16就可以节省很多空间,另外还有个容错问题(比如:UTF-8需要判断每个字节中的开头标志信息,所以如果一当某个字节在传送过程中出错了,就会导致后面的字节也会解析出错;而UTF-16不会判断开头标志,即使错也只会错一个字符,所以容错能力强)。
五、Unicode举例说明
1、例如:中文字"汉"对应的unicode是6C49(这是用十六进制表示,用十进制表示是27721);
2、UTF-16表示"汉":比较简单,就是01101100 01001001(共16 bit,两个字节),程序解析的时候知道是UTF-16就把两个字节当成一个单元来解析。
3、UTF-8表示"汉":比较复杂,因为程序是一个字节一个字节的来读取,然后再根据字节中开头的bit标志来识别是该把一个、两个或三个字节做为一个单元来处理。规则如下:
0xxxxxxx:如果是这样的格式,也就是以0开头就表示把一个字节做为一个单元,就跟ASCII完全一样;
110xxxxx 10xxxxxx:如果是这样的格式,则把两个字节当一个单元;
1110xxxx 10xxxxxx 10xxxxxx:如果是这样的格式,则把三个字节当一个单元。
4、由于UTF-16不需要用其它字符来做标志,所以两字节也就是2的16次能表示65536个字符;
5、而UTF-8由于里面有额外的标志信息,所有一个字节只能表示2的7次方128个字符,两个字节只能表示2的11次方2048个字符,而三个字节能表示2的16次方,65536个字符。
6、由于"汉"的编码27721大于2048了所有两个字节还不够,所以用1110xxxx 10xxxxxx 10xxxxxx这种格式,把27721对应的二进制从左到右填充XXX符号(实际上不一定从左到右,也可以从右到左)。
7、由于填充方式的不一样,于是就出现了Big-Endian、Little-Endian的术语。Big-Endian就是从左到右,Little-Endian是从右到左。
六、Unicode第二个版本
第一个版本的65536显然不算太多的数字,用它来表示常用的字符是没一点问题足够了,但如果加上很多特殊的也就不够了。于是从1996年有了第二个版本,用四个字节表示所有字符,这样就出现了UTF-8、UTF16、UTF-32,原理和之前是完全一样的,UTF-32就是把所有的字符都用32bit也就是4个字节来表示。然后UTF-8、UTF-16就视情况而定了。UTF-8可以选择1至8个字节中的任一个来表示,而UTF-16只能是选两字节或四字节。
七、代码
utf.c
[cpp]
view plain
copy
/* ************************************************************************
* Filename: utf.c
* Description:
* Version: 1.0
* Created: 2016年10月21日 09时50分05秒
* Revision: none
* Compiler: gcc
* Author: YOUR NAME (),
* Company:
* ************************************************************************/
#include <stdio.h>
#include <string.h>
#include "utf.h"
static boolean isLegalUTF8(const UTF8 *source, int length)
{
UTF8 a;
const UTF8 *srcptr = NULL;
if (NULL == source){
printf("ERR, isLegalUTF8: source=%p\n", source);
return FALSE;
}
srcptr = source+length;
switch (length) {
default:
printf("ERR, isLegalUTF8 1: length=%d\n", length);
return FALSE;
/* Everything else falls through when "TRUE"... */
case 4:
if ((a = (*--srcptr)) < 0x80 || a > 0xBF){
printf("ERR, isLegalUTF8 2: length=%d, a=%x\n", length, a);
return FALSE;
}
case 3:
if ((a = (*--srcptr)) < 0x80 || a > 0xBF){
printf("ERR, isLegalUTF8 3: length=%d, a=%x\n", length, a);
return FALSE;
}
case 2:
if ((a = (*--srcptr)) > 0xBF){
printf("ERR, isLegalUTF8 4: length=%d, a=%x\n", length, a);
return FALSE;
}
switch (*source)
{
/* no fall-through in this inner switch */
case 0xE0:
if (a < 0xA0){
printf("ERR, isLegalUTF8 1: source=%x, a=%x\n", *source, a);
return FALSE;
}
break;
case 0xED:
if (a > 0x9F){
printf("ERR, isLegalUTF8 2: source=%x, a=%x\n", *source, a);
return FALSE;
}
break;
case 0xF0:
if (a < 0x90){
printf("ERR, isLegalUTF8 3: source=%x, a=%x\n", *source, a);
return FALSE;
}
break;
case 0xF4:
if (a > 0x8F){
printf("ERR, isLegalUTF8 4: source=%x, a=%x\n", *source, a);
return FALSE;
}
break;
default:
if (a < 0x80){
printf("ERR, isLegalUTF8 5: source=%x, a=%x\n", *source, a);
return FALSE;
}
}
case 1:
if (*source >= 0x80 && *source < 0xC2){
printf("ERR, isLegalUTF8: source=%x\n", *source);
return FALSE;
}
}
if (*source > 0xF4)
return FALSE;
return TRUE;
}
ConversionResult Utf8_To_Utf16 (const UTF8* sourceStart, UTF16* targetStart, size_t outLen , ConversionFlags flags)
{
ConversionResult result = conversionOK;
const UTF8* source = sourceStart;
UTF16* target = targetStart;
UTF16* targetEnd = targetStart + outLen/2;
const UTF8* sourceEnd = NULL;
if ((NULL == source) || (NULL == targetStart)){
printf("ERR, Utf8_To_Utf16: source=%p, targetStart=%p\n", source, targetStart);
return conversionFailed;
}
sourceEnd = strlen((const char*)sourceStart) + sourceStart;
while (*source){
UTF32 ch = 0;
unsigned short extraBytesToRead = trailingBytesForUTF8[*source];
if (source + extraBytesToRead >= sourceEnd){
printf("ERR, Utf8_To_Utf16----sourceExhausted: source=%p, extraBytesToRead=%d, sourceEnd=%p\n", source, extraBytesToRead, sourceEnd);
result = sourceExhausted;
break;
}
/* Do this check whether lenient or strict */
if (! isLegalUTF8(source, extraBytesToRead+1)){
printf("ERR, Utf8_To_Utf16----isLegalUTF8 return FALSE: source=%p, extraBytesToRead=%d\n", source, extraBytesToRead);
result = sourceIllegal;
break;
}
/*
* The cases all fall through. See "Note A" below.
*/
switch (extraBytesToRead) {
case 5: ch += *source++; ch <<= 6; /* remember, illegal UTF-8 */
case 4: ch += *source++; ch <<= 6; /* remember, illegal UTF-8 */
case 3: ch += *source++; ch <<= 6;
case 2: ch += *source++; ch <<= 6;
case 1: ch += *source++; ch <<= 6;
case 0: ch += *source++;
}
ch -= offsetsFromUTF8[extraBytesToRead];
if (target >= targetEnd) {
source -= (extraBytesToRead+1); /* Back up source pointer! */
printf("ERR, Utf8_To_Utf16----target >= targetEnd: source=%p, extraBytesToRead=%d\n", source, extraBytesToRead);
result = targetExhausted;
break;
}
if (ch <= UNI_MAX_BMP){
/* Target is a character <= 0xFFFF */
/* UTF-16 surrogate values are illegal in UTF-32 */
if (ch >= UNI_SUR_HIGH_START && ch <= UNI_SUR_LOW_END){
if (flags == strictConversion){
source -= (extraBytesToRead+1); /* return to the illegal value itself */
printf("ERR, Utf8_To_Utf16----ch >= UNI_SUR_HIGH_START && ch <= UNI_SUR_LOW_END: source=%p, extraBytesToRead=%d\n", source, extraBytesToRead);
result = sourceIllegal;
break;
} else {
*target++ = UNI_REPLACEMENT_CHAR;
}
} else{
*target++ = (UTF16)ch; /* normal case */
}
}else if (ch > UNI_MAX_UTF16){
if (flags == strictConversion) {
result = sourceIllegal;
source -= (extraBytesToRead+1); /* return to the start */
printf("ERR, Utf8_To_Utf16----ch > UNI_MAX_UTF16: source=%p, extraBytesToRead=%d\n", source, extraBytesToRead);
break; /* Bail out; shouldn't continue */
} else {
*target++ = UNI_REPLACEMENT_CHAR;
}
} else {
/* target is a character in range 0xFFFF - 0x10FFFF. */
if (target + 1 >= targetEnd) {
source -= (extraBytesToRead+1); /* Back up source pointer! */
printf("ERR, Utf8_To_Utf16----target + 1 >= targetEnd: source=%p, extraBytesToRead=%d\n", source, extraBytesToRead);
result = targetExhausted; break;
}
ch -= halfBase;
*target++ = (UTF16)((ch >> halfShift) + UNI_SUR_HIGH_START);
*target++ = (UTF16)((ch & halfMask) + UNI_SUR_LOW_START);
}
}
return result;
}
int Utf16_To_Utf8 (const UTF16* sourceStart, UTF8* targetStart, size_t outLen , ConversionFlags flags)
{
int result = 0;
const UTF16* source = sourceStart;
UTF8* target = targetStart;
UTF8* targetEnd = targetStart + outLen;
if ((NULL == source) || (NULL == targetStart)){
printf("ERR, Utf16_To_Utf8: source=%p, targetStart=%p\n", source, targetStart);
return conversionFailed;
}
while ( *source ) {
UTF32 ch;
unsigned short bytesToWrite = 0;
const UTF32 byteMask = 0xBF;
const UTF32 byteMark = 0x80;
const UTF16* oldSource = source; /* In case we have to back up because of target overflow. */
ch = *source++;
/* If we have a surrogate pair, convert to UTF32 first. */
if (ch >= UNI_SUR_HIGH_START && ch <= UNI_SUR_HIGH_END) {
/* If the 16 bits following the high surrogate are in the source buffer... */
if ( *source ){
UTF32 ch2 = *source;
/* If it's a low surrogate, convert to UTF32. */
if (ch2 >= UNI_SUR_LOW_START && ch2 <= UNI_SUR_LOW_END) {
ch = ((ch - UNI_SUR_HIGH_START) << halfShift) + (ch2 - UNI_SUR_LOW_START) + halfBase;
++source;
}else if (flags == strictConversion) { /* it's an unpaired high surrogate */
--source; /* return to the illegal value itself */
result = sourceIllegal;
break;
}
} else { /* We don't have the 16 bits following the high surrogate. */
--source; /* return to the high surrogate */
result = sourceExhausted;
break;
}
} else if (flags == strictConversion) {
/* UTF-16 surrogate values are illegal in UTF-32 */
if (ch >= UNI_SUR_LOW_START && ch <= UNI_SUR_LOW_END){
--source; /* return to the illegal value itself */
result = sourceIllegal;
break;
}
}
/* Figure out how many bytes the result will require */
if(ch < (UTF32)0x80){
bytesToWrite = 1;
} else if (ch < (UTF32)0x800) {
bytesToWrite = 2;
} else if (ch < (UTF32)0x10000) {
bytesToWrite = 3;
} else if (ch < (UTF32)0x110000){
bytesToWrite = 4;
} else {
bytesToWrite = 3;
ch = UNI_REPLACEMENT_CHAR;
}
target += bytesToWrite;
if (target > targetEnd) {
source = oldSource; /* Back up source pointer! */
target -= bytesToWrite; result = targetExhausted; break;
}
switch (bytesToWrite) { /* note: everything falls through. */
case 4: *--target = (UTF8)((ch | byteMark) & byteMask); ch >>= 6;
case 3: *--target = (UTF8)((ch | byteMark) & byteMask); ch >>= 6;
case 2: *--target = (UTF8)((ch | byteMark) & byteMask); ch >>= 6;
case 1: *--target = (UTF8)(ch | firstByteMark[bytesToWrite]);
}
target += bytesToWrite;
}
return result;
}
int main(int argc, char *argv[])
{
int i=0;
UTF8 buf8[256]="";
UTF16 buf16[256]={0};
strcpy(buf8,"程序员");
Utf8_To_Utf16(buf8,buf16,sizeof(buf16),strictConversion);
printf("\nUTF-8 => UTF-16 = ");
while(buf16[i])
{
printf("%#x ",buf16[i]);
i++;
}
memset(buf8,0,sizeof(buf8));
memset(buf16,0,sizeof(buf16));
buf16[0]=0x7a0b;
buf16[1]=0x5e8f;
buf16[2]=0x5458;
Utf16_To_Utf8 (buf16, buf8, sizeof(buf8) , strictConversion);
printf("\nUTF-16 => UTF-8 = %s\n\n",buf8);
return 0;
}
utf.h
[cpp]
view plain
copy
/* ************************************************************************
* Filename: utf.h
* Description:
* Version: 1.0
* Created: 2016年10月21日 09时50分47秒
* Revision: none
* Compiler: gcc
* Author: YOUR NAME (),
* Company:
* ************************************************************************/
#ifndef __UTF_H__
#define __UTF_H__
#define FALSE 0
#define TRUE 1
#define halfShift 10
#define UNI_SUR_HIGH_START (UTF32)0xD800
#define UNI_SUR_HIGH_END (UTF32)0xDBFF
#define UNI_SUR_LOW_START (UTF32)0xDC00
#define UNI_SUR_LOW_END (UTF32)0xDFFF
/* Some fundamental constants */
#define UNI_REPLACEMENT_CHAR (UTF32)0x0000FFFD
#define UNI_MAX_BMP (UTF32)0x0000FFFF
#define UNI_MAX_UTF16 (UTF32)0x0010FFFF
#define UNI_MAX_UTF32 (UTF32)0x7FFFFFFF
#define UNI_MAX_LEGAL_UTF32 (UTF32)0x0010FFFF
typedef unsigned char boolean;
typedef unsigned int CharType ;
typedef unsigned char UTF8;
typedef unsigned short UTF16;
typedef unsigned int UTF32;
static const UTF32 halfMask = 0x3FFUL;
static const UTF32 halfBase = 0x0010000UL;
static const UTF8 firstByteMark[7] = { 0x00, 0x00, 0xC0, 0xE0, 0xF0, 0xF8, 0xFC };
static const UTF32 offsetsFromUTF8[6] = { 0x00000000UL, 0x00003080UL, 0x000E2080UL, 0x03C82080UL, 0xFA082080UL, 0x82082080UL };
static const char trailingBytesForUTF8[256] =
{
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
};
typedef enum
{
strictConversion = 0,
lenientConversion
} ConversionFlags;
typedef enum
{
conversionOK, /* conversion successful */
sourceExhausted, /* partial character in source, but hit end */
targetExhausted, /* insuff. room in target for conversion */
sourceIllegal, /* source sequence is illegal/malformed */
conversionFailed
} ConversionResult;
#endif
运行结果如下:
安卓 ril.cpp 中的转换方法:
view plain
copy
const char16_t* Parcel::readString16Inplace(size_t* outLen) const
{
int32_t size = readInt32();
// watch for potential int overflow from size+1
if (size >= 0 && size < INT32_MAX) {
*outLen = size;
const char16_t* str = (const char16_t*)readInplace((size+1)*sizeof(char16_t));
if (str != NULL) {
return str;
}
}
*outLen = 0;
return NULL;
}
const void* Parcel::readInplace(size_t len) const
{
if ((mDataPos+PAD_SIZE(len)) >= mDataPos && (mDataPos+PAD_SIZE(len)) <= mDataSize) {
const void* data = mData+mDataPos;
mDataPos += PAD_SIZE(len);
LOGV("readInplace Setting data pos of %p to %d\n", this, mDataPos);
return data;
}
return NULL;
}
直接上代码:
[cpp]
view plain
copy
#include <cutils/jstring.h>
#include <assert.h>
#include <stdlib.h>
/**
* Given a UTF-16 string, compute the length of the corresponding UTF-8
* string in bytes.
*/
extern size_t strnlen16to8(const char16_t* utf16Str, size_t len)
{
size_t utf8Len = 0;
while (len--) {
unsigned int uic = *utf16Str++;
if (uic > 0x07ff)
utf8Len += 3;
else if (uic > 0x7f || uic == 0)
utf8Len += 2;
else
utf8Len++;
}
return utf8Len;
}
/**
* Convert a Java-Style UTF-16 string + length to a JNI-Style UTF-8 string.
*
* This basically means: embedded \0's in the UTF-16 string are encoded
* as "0xc0 0x80"
*
* Make sure you allocate "utf8Str" with the result of strlen16to8() + 1,
* not just "len".
*
* Please note, a terminated \0 is always added, so your result will always
* be "strlen16to8() + 1" bytes long.
*/
extern char* strncpy16to8(char* utf8Str, const char16_t* utf16Str, size_t len)
{
char* utf8cur = utf8Str;
while (len--) {
unsigned int uic = *utf16Str++;
if (uic > 0x07ff) {
*utf8cur++ = (uic >> 12) | 0xe0;
*utf8cur++ = ((uic >> 6) & 0x3f) | 0x80;
*utf8cur++ = (uic & 0x3f) | 0x80;
} else if (uic > 0x7f || uic == 0) {
*utf8cur++ = (uic >> 6) | 0xc0;
*utf8cur++ = (uic & 0x3f) | 0x80;
} else {
*utf8cur++ = uic;
if (uic == 0) {
break;
}
}
}
*utf8cur = '\0';
return utf8Str;
}
/**
* Convert a UTF-16 string to UTF-8.
*
* Make sure you allocate "dest" with the result of strblen16to8(),
* not just "strlen16()".
*/
char * strndup16to8 (const char16_t* s, size_t n)
{
char *ret;
if (s == NULL) {
return NULL;
}
ret = malloc(strnlen16to8(s, n) + 1);
strncpy16to8 (ret, s, n);
return ret;
}
[cpp]
view plain
copy
#include <cutils/jstring.h>
#include <assert.h>
#include <stdlib.h>
#include <limits.h>
/* See http://www.unicode.org/reports/tr22/ for discussion
* on invalid sequences
*/
#define UTF16_REPLACEMENT_CHAR 0xfffd
/* Clever trick from Dianne that returns 1-4 depending on leading bit sequence*/
#define UTF8_SEQ_LENGTH(ch) (((0xe5000000 >> ((ch >> 3) & 0x1e)) & 3) + 1)
/* note: macro expands to multiple lines */
#define UTF8_SHIFT_AND_MASK(unicode, byte) \
(unicode)<<=6; (unicode) |= (0x3f & (byte));
#define UNICODE_UPPER_LIMIT 0x10fffd
/**
* out_len is an out parameter (which may not be null) containing the
* length of the UTF-16 string (which may contain embedded \0's)
*/
extern char16_t * strdup8to16 (const char* s, size_t *out_len)
{
char16_t *ret;
size_t len;
if (s == NULL) return NULL;
len = strlen8to16(s);
// fail on overflow
if (len && SIZE_MAX/len < sizeof(char16_t))
return NULL;
// no plus-one here. UTF-16 strings are not null terminated
ret = (char16_t *) malloc (sizeof(char16_t) * len);
return strcpy8to16 (ret, s, out_len);
}
/**
* Like "strlen", but for strings encoded with Java's modified UTF-8.
*
* The value returned is the number of UTF-16 characters required
* to represent this string.
*/
extern size_t strlen8to16 (const char* utf8Str)
{
size_t len = 0;
int ic;
int expected = 0;
while ((ic = *utf8Str++) != '\0') {
/* bytes that start 0? or 11 are lead bytes and count as characters.*/
/* bytes that start 10 are extention bytes and are not counted */
if ((ic & 0xc0) == 0x80) {
/* count the 0x80 extention bytes. if we have more than
* expected, then start counting them because strcpy8to16
* will insert UTF16_REPLACEMENT_CHAR's
*/
expected--;
if (expected < 0) {
len++;
}
} else {
len++;
expected = UTF8_SEQ_LENGTH(ic) - 1;
/* this will result in a surrogate pair */
if (expected == 3) {
len++;
}
}
}
return len;
}
/*
* Retrieve the next UTF-32 character from a UTF-8 string.
*
* Stops at inner \0's
*
* Returns UTF16_REPLACEMENT_CHAR if an invalid sequence is encountered
*
* Advances "*pUtf8Ptr" to the start of the next character.
*/
static inline uint32_t getUtf32FromUtf8(const char** pUtf8Ptr)
{
uint32_t ret;
int seq_len;
int i;
/* Mask for leader byte for lengths 1, 2, 3, and 4 respectively*/
static const char leaderMask[4] = {0xff, 0x1f, 0x0f, 0x07};
/* Bytes that start with bits "10" are not leading characters. */
if (((**pUtf8Ptr) & 0xc0) == 0x80) {
(*pUtf8Ptr)++;
return UTF16_REPLACEMENT_CHAR;
}
/* note we tolerate invalid leader 11111xxx here */
seq_len = UTF8_SEQ_LENGTH(**pUtf8Ptr);
ret = (**pUtf8Ptr) & leaderMask [seq_len - 1];
if (**pUtf8Ptr == '\0') return ret;
(*pUtf8Ptr)++;
for (i = 1; i < seq_len ; i++, (*pUtf8Ptr)++) {
if ((**pUtf8Ptr) == '\0') return UTF16_REPLACEMENT_CHAR;
if (((**pUtf8Ptr) & 0xc0) != 0x80) return UTF16_REPLACEMENT_CHAR;
UTF8_SHIFT_AND_MASK(ret, **pUtf8Ptr);
}
return ret;
}
/**
* out_len is an out parameter (which may not be null) containing the
* length of the UTF-16 string (which may contain embedded \0's)
*/
extern char16_t * strcpy8to16 (char16_t *utf16Str, const char*utf8Str,
size_t *out_len)
{
char16_t *dest = utf16Str;
while (*utf8Str != '\0') {
uint32_t ret;
ret = getUtf32FromUtf8(&utf8Str);
if (ret <= 0xffff) {
*dest++ = (char16_t) ret;
} else if (ret <= UNICODE_UPPER_LIMIT) {
/* Create surrogate pairs */
/* See http://en.wikipedia.org/wiki/UTF-16/UCS-2#Method_for_code_points_in_Plane_1.2C_Plane_2 */
*dest++ = 0xd800 | ((ret - 0x10000) >> 10);
*dest++ = 0xdc00 | ((ret - 0x10000) & 0x3ff);
} else {
*dest++ = UTF16_REPLACEMENT_CHAR;
}
}
*out_len = dest - utf16Str;
return utf16Str;
}
/**
* length is the number of characters in the UTF-8 string.
* out_len is an out parameter (which may not be null) containing the
* length of the UTF-16 string (which may contain embedded \0's)
*/
extern char16_t * strcpylen8to16 (char16_t *utf16Str, const char*utf8Str,
int length, size_t *out_len)
{
/* TODO: Share more of this code with the method above. Only 2 lines changed. */
char16_t *dest = utf16Str;
const char *end = utf8Str + length; /* This line */
while (utf8Str < end) { /* and this line changed. */
uint32_t ret;
ret = getUtf32FromUtf8(&utf8Str);
if (ret <= 0xffff) {
*dest++ = (char16_t) ret;
} else if (ret <= UNICODE_UPPER_LIMIT) {
/* Create surrogate pairs */
/* See http://en.wikipedia.org/wiki/UTF-16/UCS-2#Method_for_code_points_in_Plane_1.2C_Plane_2 */
*dest++ = 0xd800 | ((ret - 0x10000) >> 10);
*dest++ = 0xdc00 | ((ret - 0x10000) & 0x3ff);
} else {
*dest++ = UTF16_REPLACEMENT_CHAR;
}
}
*out_len = dest - utf16Str;
return utf16Str;
}
1、我们知道计算机其实只认识0101这样的字符串,当然了让我们看这样的01串会比较头晕,所以为了描述简单一般都用八进制、十进制、十六进制表示。
实际上都是等价的。其它像文字图片音视频等计算机也是不认识的,为了让计算机能表示这些信息就必须转换成一些数字,必须按照一些规则转换。
比如:刚开始的时候就有ASCII字符集(American Standard Code for Information Interchange, 美国信息交换标准码)它使用7 bits来表示一个字符,
总共表示128个字符,我们一般都是用字节(byte:即8个01串)来作为基本单位。当时一个字节来表示字符时第一个bit总是0,剩下的七个字节就来表示实际内容。后来IBM公司在此基础上进行了扩展,用8bit来表示一个字符,总共可以表示256个字符。也就是当第一个bit是0时仍表示之前那些常用的字符,当为1时就表示其他补充的字符。
2、英文字母再加一些其他标点字符之类的也不会超过256个,一个字节表示足够了。但其他一些文字不止这么多 ,像汉字就上万个,
于是又出现了其他各种字符集。这样不同的字符集交换数据时就有问题了,可能你用某个数字表示字符A,但另外的字符集又是用另外一个数字表示A。
为了适应全球化的发展,便于不同语言之间的兼容交互,而ASCII不再能胜任此任务了。所以就出现了Unicode和ISO这样的组织来统一制定一个标准,任何一个字符只对应一个确定的数字。ISO取的名字叫UCS(Universal Character Set)(ucs-2对应utf-16,ucs-4对应utf-32),Unicode取的名字就叫unicode了。
二、UTF-8和UTF-16的由来
1、Unicode第一个版本涉及到两个步骤:首先定义一个规范,给所有的字符指定一个唯一对应的数字,Unicode是用0至65535(2的16次方)之间的数字来表示所有字符,其中0至127这128个数字表示的字符仍然跟ASCII完全一样;第二怎么把字符对应的数字(0至65535)转化成01串保保存在计算机中。在保存时就涉及到了在计算机中占多少字节空间,就有不同的保存方式,于是出现了UTF(unicode transformation format):UTF-8和UTF-16。
三、UTF-8和UTF-16的区别
1、UTF-16:是任何字符对应的数字都用两个字节来保存,但如果都是英文字母(一个字节能表示一个字符)这样做有点浪费。
2、UTF-8:是任何字符对应的数字保存时所占的空间是可变的,可能用一个、两个或三个字节表示一个字符。
四、UTF-8和UTF-16的优劣
1、如果全部英文或英文与其他文字混合(英文占绝大部分),用UTF-8就比UTF-16节省了很多空间。
2、而如果全部是中文这样类似的字符或者混合字符(中文占绝大多数),UTF-16就可以节省很多空间,另外还有个容错问题(比如:UTF-8需要判断每个字节中的开头标志信息,所以如果一当某个字节在传送过程中出错了,就会导致后面的字节也会解析出错;而UTF-16不会判断开头标志,即使错也只会错一个字符,所以容错能力强)。
五、Unicode举例说明
1、例如:中文字"汉"对应的unicode是6C49(这是用十六进制表示,用十进制表示是27721);
2、UTF-16表示"汉":比较简单,就是01101100 01001001(共16 bit,两个字节),程序解析的时候知道是UTF-16就把两个字节当成一个单元来解析。
3、UTF-8表示"汉":比较复杂,因为程序是一个字节一个字节的来读取,然后再根据字节中开头的bit标志来识别是该把一个、两个或三个字节做为一个单元来处理。规则如下:
0xxxxxxx:如果是这样的格式,也就是以0开头就表示把一个字节做为一个单元,就跟ASCII完全一样;
110xxxxx 10xxxxxx:如果是这样的格式,则把两个字节当一个单元;
1110xxxx 10xxxxxx 10xxxxxx:如果是这样的格式,则把三个字节当一个单元。
4、由于UTF-16不需要用其它字符来做标志,所以两字节也就是2的16次能表示65536个字符;
5、而UTF-8由于里面有额外的标志信息,所有一个字节只能表示2的7次方128个字符,两个字节只能表示2的11次方2048个字符,而三个字节能表示2的16次方,65536个字符。
6、由于"汉"的编码27721大于2048了所有两个字节还不够,所以用1110xxxx 10xxxxxx 10xxxxxx这种格式,把27721对应的二进制从左到右填充XXX符号(实际上不一定从左到右,也可以从右到左)。
7、由于填充方式的不一样,于是就出现了Big-Endian、Little-Endian的术语。Big-Endian就是从左到右,Little-Endian是从右到左。
六、Unicode第二个版本
第一个版本的65536显然不算太多的数字,用它来表示常用的字符是没一点问题足够了,但如果加上很多特殊的也就不够了。于是从1996年有了第二个版本,用四个字节表示所有字符,这样就出现了UTF-8、UTF16、UTF-32,原理和之前是完全一样的,UTF-32就是把所有的字符都用32bit也就是4个字节来表示。然后UTF-8、UTF-16就视情况而定了。UTF-8可以选择1至8个字节中的任一个来表示,而UTF-16只能是选两字节或四字节。
七、代码
utf.c
[cpp]
view plain
copy
/* ************************************************************************
* Filename: utf.c
* Description:
* Version: 1.0
* Created: 2016年10月21日 09时50分05秒
* Revision: none
* Compiler: gcc
* Author: YOUR NAME (),
* Company:
* ************************************************************************/
#include <stdio.h>
#include <string.h>
#include "utf.h"
static boolean isLegalUTF8(const UTF8 *source, int length)
{
UTF8 a;
const UTF8 *srcptr = NULL;
if (NULL == source){
printf("ERR, isLegalUTF8: source=%p\n", source);
return FALSE;
}
srcptr = source+length;
switch (length) {
default:
printf("ERR, isLegalUTF8 1: length=%d\n", length);
return FALSE;
/* Everything else falls through when "TRUE"... */
case 4:
if ((a = (*--srcptr)) < 0x80 || a > 0xBF){
printf("ERR, isLegalUTF8 2: length=%d, a=%x\n", length, a);
return FALSE;
}
case 3:
if ((a = (*--srcptr)) < 0x80 || a > 0xBF){
printf("ERR, isLegalUTF8 3: length=%d, a=%x\n", length, a);
return FALSE;
}
case 2:
if ((a = (*--srcptr)) > 0xBF){
printf("ERR, isLegalUTF8 4: length=%d, a=%x\n", length, a);
return FALSE;
}
switch (*source)
{
/* no fall-through in this inner switch */
case 0xE0:
if (a < 0xA0){
printf("ERR, isLegalUTF8 1: source=%x, a=%x\n", *source, a);
return FALSE;
}
break;
case 0xED:
if (a > 0x9F){
printf("ERR, isLegalUTF8 2: source=%x, a=%x\n", *source, a);
return FALSE;
}
break;
case 0xF0:
if (a < 0x90){
printf("ERR, isLegalUTF8 3: source=%x, a=%x\n", *source, a);
return FALSE;
}
break;
case 0xF4:
if (a > 0x8F){
printf("ERR, isLegalUTF8 4: source=%x, a=%x\n", *source, a);
return FALSE;
}
break;
default:
if (a < 0x80){
printf("ERR, isLegalUTF8 5: source=%x, a=%x\n", *source, a);
return FALSE;
}
}
case 1:
if (*source >= 0x80 && *source < 0xC2){
printf("ERR, isLegalUTF8: source=%x\n", *source);
return FALSE;
}
}
if (*source > 0xF4)
return FALSE;
return TRUE;
}
ConversionResult Utf8_To_Utf16 (const UTF8* sourceStart, UTF16* targetStart, size_t outLen , ConversionFlags flags)
{
ConversionResult result = conversionOK;
const UTF8* source = sourceStart;
UTF16* target = targetStart;
UTF16* targetEnd = targetStart + outLen/2;
const UTF8* sourceEnd = NULL;
if ((NULL == source) || (NULL == targetStart)){
printf("ERR, Utf8_To_Utf16: source=%p, targetStart=%p\n", source, targetStart);
return conversionFailed;
}
sourceEnd = strlen((const char*)sourceStart) + sourceStart;
while (*source){
UTF32 ch = 0;
unsigned short extraBytesToRead = trailingBytesForUTF8[*source];
if (source + extraBytesToRead >= sourceEnd){
printf("ERR, Utf8_To_Utf16----sourceExhausted: source=%p, extraBytesToRead=%d, sourceEnd=%p\n", source, extraBytesToRead, sourceEnd);
result = sourceExhausted;
break;
}
/* Do this check whether lenient or strict */
if (! isLegalUTF8(source, extraBytesToRead+1)){
printf("ERR, Utf8_To_Utf16----isLegalUTF8 return FALSE: source=%p, extraBytesToRead=%d\n", source, extraBytesToRead);
result = sourceIllegal;
break;
}
/*
* The cases all fall through. See "Note A" below.
*/
switch (extraBytesToRead) {
case 5: ch += *source++; ch <<= 6; /* remember, illegal UTF-8 */
case 4: ch += *source++; ch <<= 6; /* remember, illegal UTF-8 */
case 3: ch += *source++; ch <<= 6;
case 2: ch += *source++; ch <<= 6;
case 1: ch += *source++; ch <<= 6;
case 0: ch += *source++;
}
ch -= offsetsFromUTF8[extraBytesToRead];
if (target >= targetEnd) {
source -= (extraBytesToRead+1); /* Back up source pointer! */
printf("ERR, Utf8_To_Utf16----target >= targetEnd: source=%p, extraBytesToRead=%d\n", source, extraBytesToRead);
result = targetExhausted;
break;
}
if (ch <= UNI_MAX_BMP){
/* Target is a character <= 0xFFFF */
/* UTF-16 surrogate values are illegal in UTF-32 */
if (ch >= UNI_SUR_HIGH_START && ch <= UNI_SUR_LOW_END){
if (flags == strictConversion){
source -= (extraBytesToRead+1); /* return to the illegal value itself */
printf("ERR, Utf8_To_Utf16----ch >= UNI_SUR_HIGH_START && ch <= UNI_SUR_LOW_END: source=%p, extraBytesToRead=%d\n", source, extraBytesToRead);
result = sourceIllegal;
break;
} else {
*target++ = UNI_REPLACEMENT_CHAR;
}
} else{
*target++ = (UTF16)ch; /* normal case */
}
}else if (ch > UNI_MAX_UTF16){
if (flags == strictConversion) {
result = sourceIllegal;
source -= (extraBytesToRead+1); /* return to the start */
printf("ERR, Utf8_To_Utf16----ch > UNI_MAX_UTF16: source=%p, extraBytesToRead=%d\n", source, extraBytesToRead);
break; /* Bail out; shouldn't continue */
} else {
*target++ = UNI_REPLACEMENT_CHAR;
}
} else {
/* target is a character in range 0xFFFF - 0x10FFFF. */
if (target + 1 >= targetEnd) {
source -= (extraBytesToRead+1); /* Back up source pointer! */
printf("ERR, Utf8_To_Utf16----target + 1 >= targetEnd: source=%p, extraBytesToRead=%d\n", source, extraBytesToRead);
result = targetExhausted; break;
}
ch -= halfBase;
*target++ = (UTF16)((ch >> halfShift) + UNI_SUR_HIGH_START);
*target++ = (UTF16)((ch & halfMask) + UNI_SUR_LOW_START);
}
}
return result;
}
int Utf16_To_Utf8 (const UTF16* sourceStart, UTF8* targetStart, size_t outLen , ConversionFlags flags)
{
int result = 0;
const UTF16* source = sourceStart;
UTF8* target = targetStart;
UTF8* targetEnd = targetStart + outLen;
if ((NULL == source) || (NULL == targetStart)){
printf("ERR, Utf16_To_Utf8: source=%p, targetStart=%p\n", source, targetStart);
return conversionFailed;
}
while ( *source ) {
UTF32 ch;
unsigned short bytesToWrite = 0;
const UTF32 byteMask = 0xBF;
const UTF32 byteMark = 0x80;
const UTF16* oldSource = source; /* In case we have to back up because of target overflow. */
ch = *source++;
/* If we have a surrogate pair, convert to UTF32 first. */
if (ch >= UNI_SUR_HIGH_START && ch <= UNI_SUR_HIGH_END) {
/* If the 16 bits following the high surrogate are in the source buffer... */
if ( *source ){
UTF32 ch2 = *source;
/* If it's a low surrogate, convert to UTF32. */
if (ch2 >= UNI_SUR_LOW_START && ch2 <= UNI_SUR_LOW_END) {
ch = ((ch - UNI_SUR_HIGH_START) << halfShift) + (ch2 - UNI_SUR_LOW_START) + halfBase;
++source;
}else if (flags == strictConversion) { /* it's an unpaired high surrogate */
--source; /* return to the illegal value itself */
result = sourceIllegal;
break;
}
} else { /* We don't have the 16 bits following the high surrogate. */
--source; /* return to the high surrogate */
result = sourceExhausted;
break;
}
} else if (flags == strictConversion) {
/* UTF-16 surrogate values are illegal in UTF-32 */
if (ch >= UNI_SUR_LOW_START && ch <= UNI_SUR_LOW_END){
--source; /* return to the illegal value itself */
result = sourceIllegal;
break;
}
}
/* Figure out how many bytes the result will require */
if(ch < (UTF32)0x80){
bytesToWrite = 1;
} else if (ch < (UTF32)0x800) {
bytesToWrite = 2;
} else if (ch < (UTF32)0x10000) {
bytesToWrite = 3;
} else if (ch < (UTF32)0x110000){
bytesToWrite = 4;
} else {
bytesToWrite = 3;
ch = UNI_REPLACEMENT_CHAR;
}
target += bytesToWrite;
if (target > targetEnd) {
source = oldSource; /* Back up source pointer! */
target -= bytesToWrite; result = targetExhausted; break;
}
switch (bytesToWrite) { /* note: everything falls through. */
case 4: *--target = (UTF8)((ch | byteMark) & byteMask); ch >>= 6;
case 3: *--target = (UTF8)((ch | byteMark) & byteMask); ch >>= 6;
case 2: *--target = (UTF8)((ch | byteMark) & byteMask); ch >>= 6;
case 1: *--target = (UTF8)(ch | firstByteMark[bytesToWrite]);
}
target += bytesToWrite;
}
return result;
}
int main(int argc, char *argv[])
{
int i=0;
UTF8 buf8[256]="";
UTF16 buf16[256]={0};
strcpy(buf8,"程序员");
Utf8_To_Utf16(buf8,buf16,sizeof(buf16),strictConversion);
printf("\nUTF-8 => UTF-16 = ");
while(buf16[i])
{
printf("%#x ",buf16[i]);
i++;
}
memset(buf8,0,sizeof(buf8));
memset(buf16,0,sizeof(buf16));
buf16[0]=0x7a0b;
buf16[1]=0x5e8f;
buf16[2]=0x5458;
Utf16_To_Utf8 (buf16, buf8, sizeof(buf8) , strictConversion);
printf("\nUTF-16 => UTF-8 = %s\n\n",buf8);
return 0;
}
utf.h
[cpp]
view plain
copy
/* ************************************************************************
* Filename: utf.h
* Description:
* Version: 1.0
* Created: 2016年10月21日 09时50分47秒
* Revision: none
* Compiler: gcc
* Author: YOUR NAME (),
* Company:
* ************************************************************************/
#ifndef __UTF_H__
#define __UTF_H__
#define FALSE 0
#define TRUE 1
#define halfShift 10
#define UNI_SUR_HIGH_START (UTF32)0xD800
#define UNI_SUR_HIGH_END (UTF32)0xDBFF
#define UNI_SUR_LOW_START (UTF32)0xDC00
#define UNI_SUR_LOW_END (UTF32)0xDFFF
/* Some fundamental constants */
#define UNI_REPLACEMENT_CHAR (UTF32)0x0000FFFD
#define UNI_MAX_BMP (UTF32)0x0000FFFF
#define UNI_MAX_UTF16 (UTF32)0x0010FFFF
#define UNI_MAX_UTF32 (UTF32)0x7FFFFFFF
#define UNI_MAX_LEGAL_UTF32 (UTF32)0x0010FFFF
typedef unsigned char boolean;
typedef unsigned int CharType ;
typedef unsigned char UTF8;
typedef unsigned short UTF16;
typedef unsigned int UTF32;
static const UTF32 halfMask = 0x3FFUL;
static const UTF32 halfBase = 0x0010000UL;
static const UTF8 firstByteMark[7] = { 0x00, 0x00, 0xC0, 0xE0, 0xF0, 0xF8, 0xFC };
static const UTF32 offsetsFromUTF8[6] = { 0x00000000UL, 0x00003080UL, 0x000E2080UL, 0x03C82080UL, 0xFA082080UL, 0x82082080UL };
static const char trailingBytesForUTF8[256] =
{
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
};
typedef enum
{
strictConversion = 0,
lenientConversion
} ConversionFlags;
typedef enum
{
conversionOK, /* conversion successful */
sourceExhausted, /* partial character in source, but hit end */
targetExhausted, /* insuff. room in target for conversion */
sourceIllegal, /* source sequence is illegal/malformed */
conversionFailed
} ConversionResult;
#endif
运行结果如下:
安卓 ril.cpp 中的转换方法:
ril.cpp中的strndup16to8和strdup8to16和readString16Inplace
[html]view plain
copy
const char16_t* Parcel::readString16Inplace(size_t* outLen) const
{
int32_t size = readInt32();
// watch for potential int overflow from size+1
if (size >= 0 && size < INT32_MAX) {
*outLen = size;
const char16_t* str = (const char16_t*)readInplace((size+1)*sizeof(char16_t));
if (str != NULL) {
return str;
}
}
*outLen = 0;
return NULL;
}
const void* Parcel::readInplace(size_t len) const
{
if ((mDataPos+PAD_SIZE(len)) >= mDataPos && (mDataPos+PAD_SIZE(len)) <= mDataSize) {
const void* data = mData+mDataPos;
mDataPos += PAD_SIZE(len);
LOGV("readInplace Setting data pos of %p to %d\n", this, mDataPos);
return data;
}
return NULL;
}
直接上代码:
[cpp]
view plain
copy
#include <cutils/jstring.h>
#include <assert.h>
#include <stdlib.h>
/**
* Given a UTF-16 string, compute the length of the corresponding UTF-8
* string in bytes.
*/
extern size_t strnlen16to8(const char16_t* utf16Str, size_t len)
{
size_t utf8Len = 0;
while (len--) {
unsigned int uic = *utf16Str++;
if (uic > 0x07ff)
utf8Len += 3;
else if (uic > 0x7f || uic == 0)
utf8Len += 2;
else
utf8Len++;
}
return utf8Len;
}
/**
* Convert a Java-Style UTF-16 string + length to a JNI-Style UTF-8 string.
*
* This basically means: embedded \0's in the UTF-16 string are encoded
* as "0xc0 0x80"
*
* Make sure you allocate "utf8Str" with the result of strlen16to8() + 1,
* not just "len".
*
* Please note, a terminated \0 is always added, so your result will always
* be "strlen16to8() + 1" bytes long.
*/
extern char* strncpy16to8(char* utf8Str, const char16_t* utf16Str, size_t len)
{
char* utf8cur = utf8Str;
while (len--) {
unsigned int uic = *utf16Str++;
if (uic > 0x07ff) {
*utf8cur++ = (uic >> 12) | 0xe0;
*utf8cur++ = ((uic >> 6) & 0x3f) | 0x80;
*utf8cur++ = (uic & 0x3f) | 0x80;
} else if (uic > 0x7f || uic == 0) {
*utf8cur++ = (uic >> 6) | 0xc0;
*utf8cur++ = (uic & 0x3f) | 0x80;
} else {
*utf8cur++ = uic;
if (uic == 0) {
break;
}
}
}
*utf8cur = '\0';
return utf8Str;
}
/**
* Convert a UTF-16 string to UTF-8.
*
* Make sure you allocate "dest" with the result of strblen16to8(),
* not just "strlen16()".
*/
char * strndup16to8 (const char16_t* s, size_t n)
{
char *ret;
if (s == NULL) {
return NULL;
}
ret = malloc(strnlen16to8(s, n) + 1);
strncpy16to8 (ret, s, n);
return ret;
}
[cpp]
view plain
copy
#include <cutils/jstring.h>
#include <assert.h>
#include <stdlib.h>
#include <limits.h>
/* See http://www.unicode.org/reports/tr22/ for discussion
* on invalid sequences
*/
#define UTF16_REPLACEMENT_CHAR 0xfffd
/* Clever trick from Dianne that returns 1-4 depending on leading bit sequence*/
#define UTF8_SEQ_LENGTH(ch) (((0xe5000000 >> ((ch >> 3) & 0x1e)) & 3) + 1)
/* note: macro expands to multiple lines */
#define UTF8_SHIFT_AND_MASK(unicode, byte) \
(unicode)<<=6; (unicode) |= (0x3f & (byte));
#define UNICODE_UPPER_LIMIT 0x10fffd
/**
* out_len is an out parameter (which may not be null) containing the
* length of the UTF-16 string (which may contain embedded \0's)
*/
extern char16_t * strdup8to16 (const char* s, size_t *out_len)
{
char16_t *ret;
size_t len;
if (s == NULL) return NULL;
len = strlen8to16(s);
// fail on overflow
if (len && SIZE_MAX/len < sizeof(char16_t))
return NULL;
// no plus-one here. UTF-16 strings are not null terminated
ret = (char16_t *) malloc (sizeof(char16_t) * len);
return strcpy8to16 (ret, s, out_len);
}
/**
* Like "strlen", but for strings encoded with Java's modified UTF-8.
*
* The value returned is the number of UTF-16 characters required
* to represent this string.
*/
extern size_t strlen8to16 (const char* utf8Str)
{
size_t len = 0;
int ic;
int expected = 0;
while ((ic = *utf8Str++) != '\0') {
/* bytes that start 0? or 11 are lead bytes and count as characters.*/
/* bytes that start 10 are extention bytes and are not counted */
if ((ic & 0xc0) == 0x80) {
/* count the 0x80 extention bytes. if we have more than
* expected, then start counting them because strcpy8to16
* will insert UTF16_REPLACEMENT_CHAR's
*/
expected--;
if (expected < 0) {
len++;
}
} else {
len++;
expected = UTF8_SEQ_LENGTH(ic) - 1;
/* this will result in a surrogate pair */
if (expected == 3) {
len++;
}
}
}
return len;
}
/*
* Retrieve the next UTF-32 character from a UTF-8 string.
*
* Stops at inner \0's
*
* Returns UTF16_REPLACEMENT_CHAR if an invalid sequence is encountered
*
* Advances "*pUtf8Ptr" to the start of the next character.
*/
static inline uint32_t getUtf32FromUtf8(const char** pUtf8Ptr)
{
uint32_t ret;
int seq_len;
int i;
/* Mask for leader byte for lengths 1, 2, 3, and 4 respectively*/
static const char leaderMask[4] = {0xff, 0x1f, 0x0f, 0x07};
/* Bytes that start with bits "10" are not leading characters. */
if (((**pUtf8Ptr) & 0xc0) == 0x80) {
(*pUtf8Ptr)++;
return UTF16_REPLACEMENT_CHAR;
}
/* note we tolerate invalid leader 11111xxx here */
seq_len = UTF8_SEQ_LENGTH(**pUtf8Ptr);
ret = (**pUtf8Ptr) & leaderMask [seq_len - 1];
if (**pUtf8Ptr == '\0') return ret;
(*pUtf8Ptr)++;
for (i = 1; i < seq_len ; i++, (*pUtf8Ptr)++) {
if ((**pUtf8Ptr) == '\0') return UTF16_REPLACEMENT_CHAR;
if (((**pUtf8Ptr) & 0xc0) != 0x80) return UTF16_REPLACEMENT_CHAR;
UTF8_SHIFT_AND_MASK(ret, **pUtf8Ptr);
}
return ret;
}
/**
* out_len is an out parameter (which may not be null) containing the
* length of the UTF-16 string (which may contain embedded \0's)
*/
extern char16_t * strcpy8to16 (char16_t *utf16Str, const char*utf8Str,
size_t *out_len)
{
char16_t *dest = utf16Str;
while (*utf8Str != '\0') {
uint32_t ret;
ret = getUtf32FromUtf8(&utf8Str);
if (ret <= 0xffff) {
*dest++ = (char16_t) ret;
} else if (ret <= UNICODE_UPPER_LIMIT) {
/* Create surrogate pairs */
/* See http://en.wikipedia.org/wiki/UTF-16/UCS-2#Method_for_code_points_in_Plane_1.2C_Plane_2 */
*dest++ = 0xd800 | ((ret - 0x10000) >> 10);
*dest++ = 0xdc00 | ((ret - 0x10000) & 0x3ff);
} else {
*dest++ = UTF16_REPLACEMENT_CHAR;
}
}
*out_len = dest - utf16Str;
return utf16Str;
}
/**
* length is the number of characters in the UTF-8 string.
* out_len is an out parameter (which may not be null) containing the
* length of the UTF-16 string (which may contain embedded \0's)
*/
extern char16_t * strcpylen8to16 (char16_t *utf16Str, const char*utf8Str,
int length, size_t *out_len)
{
/* TODO: Share more of this code with the method above. Only 2 lines changed. */
char16_t *dest = utf16Str;
const char *end = utf8Str + length; /* This line */
while (utf8Str < end) { /* and this line changed. */
uint32_t ret;
ret = getUtf32FromUtf8(&utf8Str);
if (ret <= 0xffff) {
*dest++ = (char16_t) ret;
} else if (ret <= UNICODE_UPPER_LIMIT) {
/* Create surrogate pairs */
/* See http://en.wikipedia.org/wiki/UTF-16/UCS-2#Method_for_code_points_in_Plane_1.2C_Plane_2 */
*dest++ = 0xd800 | ((ret - 0x10000) >> 10);
*dest++ = 0xdc00 | ((ret - 0x10000) & 0x3ff);
} else {
*dest++ = UTF16_REPLACEMENT_CHAR;
}
}
*out_len = dest - utf16Str;
return utf16Str;
}
相关文章推荐
- Linux系统shell脚本编程——生产实战案例(批量检查在线IP与开放端口)
- 猪猪的斐波那契
- UDP之(feiQ)
- vue
- linux网卡混杂模式
- NOIP2016普及组参赛总结
- Java 学习
- Hbase-1.2.4 javaAPI操作总结
- Spring MVC之@RequestParam @RequestBody @RequestHeader 等详解
- TCP之(chat)
- JavaScript DOM编程艺术 学习笔记(十)用JavaScript实现动画效果
- Mac note -two
- Spring MVC之@RequestMapping 详解
- Git中的AutoCRLF与SafeCRLF换行符问题
- Android Studio 获取数字签名的方法
- TCP之(send and receive information)
- API 25 (Android 7.1.1 API) widget.AbsSpinner
- 矩阵奇异值分解(SVD)及其应用
- redis处理库存问题
- 从C转到Java下打通了任督二脉