汉字占多个字节，若按指定字节长度截取字符串，如何处理1/3个汉字？

最新推荐文章于 2023-11-10 21:00:00 发布

原创

最新推荐文章于 2023-11-10 21:00:00 发布 · 4.3k 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#函数 #substring #string #编码 #utf8

截取字符串的函数按照字节

编写一个截取字符串的函数，输入为一个字符串和字节数，输出为按字节截取的字符串。但是要保证汉字不被截半个，如“我ABC”4，应该截为“我AB”，输入“我ABC汉DEF”，6，应该输出为“我ABC”而不是“我ABC+汉的半个”。

分析

不能使用substring(beginIndex, endIndex)，因为它是返回的字符，题目要求的是字节

Returns a new string that is a substring of this string. The substring begins at the specified beginIndex and extends to the character at index endIndex - 1. Thus the length of the substring is endIndex-beginIndex.

UTF- 8 和 GBK

UTF- 8是用以解决国际上字符的一种多字节编码，它对英文使用8位（即一个字节），中文使用24为（三个字节）来编码。

GBK是国家标准GB2312基础上扩容后兼容GB2312的标准。GBK的文字编码是用双字节来表示的，即不论中、英文字符均使用双字节来表示。

String x = "我";
System.out.println(x.getBytes("utf-8").length);
System.out.println(x.getBytes("GBK").length);
/**
 * 输出
 * 3
 * 2
 */

String s = "我ABC汗";
System.out.println(new String(s.getBytes("GBK"), "GBK"));
输出"我ABC汗"

System.out.println(new String(s.getBytes(), "GBK"));
乱码 鎴慉BC姹�

System.out.println(new String(s.getBytes(), "utf8"));
输出"我ABC汗" 

System.out.println(new String(s.getBytes("utf8"), "utf8"));
输出"我ABC汗"  

System.out.println(new String(s.getBytes(), "ascii"));
���ABC���
可以看出默认使用 utf8 编码，然后 ascii 解码，英文正常，但是汉字是3个