字符串与字符集

最新推荐文章于 2024-07-09 20:14:50 发布

原创最新推荐文章于 2024-07-09 20:14:50 发布 · 705 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#string #byte #encoding #2010 #算法 #测试

java 专栏收录该内容

16 篇文章

订阅专栏

因为项目中cics中用到报文编码集转换处理，所以测试了一下编码集这方面的内部转换机制，时间问题也没有彻底弄懂，先乱七八糟写一些，回头再整理，同时也抛砖引玉。

先看一段代码

    public static int getByteLeng(String s,final String incoding) throws UnsupportedEncodingException{
        System.out.println("--------------"+incoding+"["+s+"]--------------");
        byte b[]=s.getBytes(incoding);
        System.out.println(Arrays.toString(b));
        System.out.print("codePoint:");
        for (int i = 0; i < s.length(); i++) {
            System.out.print(s.codePointAt(i)+" | ");
        }
        System.out.println();
        return b.length;
    }

测试如下

    public static void main(String args[]) throws UnsupportedEncodingException{
//        String str =new String( "一二三四五六七八九十二");
        String str =new String( "一二三四1234");
        System.out.println("Unicode length="+CICSUtil.getByteLeng(str,"Unicode"));
        System.out.println("utf-16 length="+CICSUtil.getByteLeng(str,"utf-16"));
        System.out.println("utf-8 length="+CICSUtil.getByteLeng(str,"utf-8"));
        System.out.println("GBK length="+CICSUtil.getByteLeng(str,"GBK"));
        System.out.println("GBK length="+CICSUtil.getByteLeng(str,"gb2312"));

}

输出：

--------------Unicode[一二三四1234]--------------
[-1, -2, 0, 78, -116, 78, 9, 78, -37, 86, 49, 0, 50, 0, 51, 0, 52, 0]
codePoint:19968 | 20108 | 19977 | 22235 | 49 | 50 | 51 | 52 |
Unicode length=18
--------------utf-16[一二三四1234]--------------
[-2, -1, 78, 0, 78, -116, 78, 9, 86, -37, 0, 49, 0, 50, 0, 51, 0, 52]
codePoint:19968 | 20108 | 19977 | 22235 | 49 | 50 | 51 | 52 |
utf-16 length=18
--------------utf-8[一二三四1234]--------------
[-28, -72, -128, -28, -70, -116, -28, -72, -119, -27, -101, -101, 49, 50, 51, 52]
codePoint:19968 | 20108 | 19977 | 22235 | 49 | 50 | 51 | 52 |
utf-8 length=16
--------------GBK[一二三四1234]--------------
[-46, -69, -74, -2, -56, -3, -53, -60, 49, 50, 51, 52]
codePoint:19968 | 20108 | 19977 | 22235 | 49 | 50 | 51 | 52 |
GBK length=12
--------------gb2312[一二三四1234]--------------
[-46, -69, -74, -2, -56, -3, -53, -60, 49, 50, 51, 52]
codePoint:19968 | 20108 | 19977 | 22235 | 49 | 50 | 51 | 52 |
GBK length=12
根据输出可以看出，不同编码集转换为字节流以后差别还是挺大的，Unicode与utf-16编码方式类似，属于定长编码集，前面两位都是用来标示高低位，后面才是对字符的编码集，无论对于汉字或者字母都是转换为两位byte，只不过字母字符实际上只有一位，在另一位就会补0站位，Unicode是在低位补0，而utf-16却是在高位补0，其他转换算法相同；

utf-8编码稍微有些特殊，属于变长编码集，他会将汉字字符和字母字符按照不同的方式解码，汉字被解成3个byte而字母字符直接就对应的ascii码或者叫hash码。

gbk与gb232实际上就是同一种编码，也属于变长编码集，不过他将汉字解码成2位byte，字母字符直接就对应的ascii码。

java5 开始引入了codepoint，上面也打印出来了，不过对于汉字编码算法还是没有看懂，如果哪位朋友知道提醒我一下。

对编码方式搞明白了，处理就简单了，最常见的就是截取字串问题。这里一个思路是，把字符转换为byte[]，不管用那种编码集，处理完后再用这种编码集转变回来，就不会有问题，不同的编码集，截取处理方式是不相同的，但都能达到同样截取的目的。

如果你取n个char，什么也不用考虑，直接用substring函数就搞定，但如果涉及通讯方面，底层一般都是按照byte流来传递的，这就难免要用到byte []处理

给个例子，按gbk编码方式来处理的

    /**
    * 按位拷贝字符串
    * @param s 要截取的字符串
    * @param sindex 开始位索引
    * @param eindex 结束位索引，这一位是不包括在内的
    * @return
    * @throws UnsupportedEncodingException
    */
    public static String substring(String origin, int sindex, int eindex ) throws UnsupportedEncodingException {
        String encoding="GBK";
        if (origin == null || origin.equals(""))
            return "";
        int len = eindex - sindex;
        byte[] strByte = new byte[len];
        System.arraycopy(origin.getBytes(encoding), sindex, strByte, 0, len);
        int count = 0;
        for (int i = 0; i < len; i++) {
            int value = (int) strByte[i];
            if (value < 0) {
                count++;
            }
        }
        if (count % 2 != 0) {
            len = (len == 1) ? ++len : --len;
        }
        return new String(strByte, 0, len,encoding);
    }

用Unicode编码集处理也是同样效果
    public static String Substring(String s, int length) throws UnsupportedEncodingException {
        byte[] bytes = s.getBytes("Unicode");
        int n = 0;
        int i = 2;
        for (; i < bytes.length && n < length; i++) {
            if (i % 2 == 1) {
                n++;
            } else {
                if (bytes[i] != 0) {
                    n++;
                }
            }
        }
        if (i % 2 == 1){
            if (bytes[i - 1] != 0)
                i = i - 1;
            else
                i = i + 1;
        }
        return new String(bytes, 0, i, "Unicode");
    }