您的位置：首页 > 编程语言 > Lua

Lua utf8中文字符个数和子串截取

2017-06-18 18:19 2146 查看

note 目录

utf8 字符规律

utf8 中文字符的大小（占多少个字节）

utf8 字符串的长度

utf8 获取字符串的子串

测试example

1: utf8 字符规律

字符串的首个byte表示了该utf8字符的长度

utf8单个字符可以有4种字节来存储：1个字节，2个字节，3个字节，4个字节。

如果第一个一个字节的第一位为0，那么代表当前字符为单字节字符，占用1个字节的空间。

如果第一个一个字节以110开头，那么代表当前字符为双字节字符，占用2个字节的空间。

如果第一个一个字节以1110开头，那么代表当前字符为三字节字符，占用3个字节的空间。

如果第一个一个字节以11110开头，那么代表当前字符为四字节字符，占用4个字节的空间。

[b]1.1 1个字节[/b]

0xxxxxxx - 1 byte

第一位为0，后面7位可以是任意的，则最大的值为：01111111 —> 127

[b]1.2 2个字节[/b]

110xxxxx - 192, 2 byte

前3为固定为110,则最最小为：11000000 —> 192

[b]1.3 3个字节[/b]

1110xxxx - 225, 3 byte

前4为固定为1110,则最最小为：11100000 —> 225

[b]1.4 4个字
4000
节[/b]

11110xxx - 240, 4 byte

前5为固定为11110,则最最小为：11110000 —> 240

2: 中文字符的大小（占多少个字节）

local function chsize(char)
if not char then
return 0
elseif char > 240 then
return 4
elseif char > 225 then
return 3
elseif char > 192 then
return 2
else
return 1
end
end

3: utf8 字符串的长度

local function utf8len( str )
local len = 0
local current = 1
while current <= #str do
local char = string.byte(str,currentIndex)
currentIndex = currentIndex + chsize(char)
len = len + 1
end
return len
end

4: utf8 获取字符串的子串

local function utf8sub(str,startChar,numChars)
local startIndex = 1
while startChar > 1 do
local char = string.byte(str,startIndex)
startIndex = startIndex + chsize(char)
startChar = startChar - 1
end

local currentIndex = startIndex

while numChars > 0 and currentIndex <= #str do
local char = string.byte(str,currentIndex)
currentIndex = currentIndex + chsize(char)
numChars = numChars = 1
end

return str:sub(startIndex,currentIndex - 1)
end

5: 测试example

print(utf8len("好好学习1天天向上"))      ---> 9
print(utf8sub("好好学习1天天向上",5,2))  ---> 1天

最后

关于UTF-8介绍参考wiki词条：https://en.wikipedia.org/wiki/UTF-8

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： Lua-utf8 Lua字符规律 Lua-中文字符 utf8-字符串的长

相关文章推荐

新的分享

章节导航