Java源码分析之String_java从源码分析string-优快云博客

本文链接：https://blog.youkuaiyun.com/hbuexiaonai/article/details/94589395

String，这个集神秘与强势于一体的常用封装类，今天要一探究竟。

public final class String
implements java.io.Serializable, Comparable<String>, CharSequence {

首先，String类被定义为final类型，说明其不可被继承。实现了Serializable, Comparable<String>, CharSequence三个接口。

java.io.Serializable 这个接口实现了序列化，对于序列化，就是将对象转换为字节序列的过程（就是把内存里面的这些对象给变成一连串的字节描述的过程。常见的就是变成文件），从而实现将对象实例持久化，进行保存数据、使用套接字在网络上传输数据时使用等。

Comparable

这个接口只有一个compareTo(T 0)接口，用于对两个实例化对象比较大小。

CharSequence

这个接口是一个只读的字符序列。包括length(), charAt(int index), subSequence(int start, int end)这几个API接口，值得一提的是，StringBuffer和StringBuild也是实现了该接口。

private final char value[];

    /** Cache the hash code for the string */
    private int hash; // Default to 0

    /** use serialVersionUID from JDK 1.0.2 for interoperability */
    private static final long serialVersionUID = -6849794470754667710L;

定义了三个成员变量，value[]用来存放String字符串，hash哈希缓存值，serialVersionUID序列化UID。

然后String类提供的构造方法竟然有16个那么多，都是各种花式赋值初始化，这里便不一一列举。

然后就是一些基本常用方法：length()获取长度，isEmpty()判空，charAt索引下标为index的值，还有以下不常用的方法：

// 返回下标为index的值，由于返回类型是int，所以返回的是ascall值
public int codePointAt(int index) {
        if ((index < 0) || (index >= value.length)) {
            throw new StringIndexOutOfBoundsException(index);
        }
        return Character.codePointAtImpl(value, index, value.length);
    }
// 返回下标为index-1的值
public int codePointBefore(int index) {
        int i = index - 1;
        if ((i < 0) || (i >= value.length)) {
            throw new StringIndexOutOfBoundsException(index);
        }
        return Character.codePointBeforeImpl(value, index, 0);
    }
// 返回下标从beginIndex到endIndex的字符个数
public int codePointCount(int beginIndex, int endIndex) {
        if (beginIndex < 0 || endIndex > value.length || beginIndex > endIndex) {
            throw new IndexOutOfBoundsException();
        }
        return Character.codePointCountImpl(value, beginIndex, endIndex - beginIndex);
    }
// 返回下标为index的值加上codePointOffset的下标
public int offsetByCodePoints(int index, int codePointOffset) {
        if (index < 0 || index > value.length) {
            throw new IndexOutOfBoundsException();
        }
        return Character.offsetByCodePointsImpl(value, 0, value.length,
                index, codePointOffset);
    }

这是一个将自身字符串的值复制到dst数组（从下标dstBegin开始）里的方法，但奇怪的是，按照上面的逻辑和风格，这个方法竟然没有处理边界值，也就是说，当调用这个方法并且数组越界时不会抛出异常，真是百思不得其解！

void getChars(char dst[], int dstBegin) {
        System.arraycopy(value, 0, dst, dstBegin, value.length);
    }

再看这个重载方法就改回老本行了（可能不是一个人写的），从secBegin到secEnd的字符复制给dst。。。

public void getChars(int srcBegin, int srcEnd, char dst[], int dstBegin) {
        if (srcBegin < 0) {
            throw new StringIndexOutOfBoundsException(srcBegin);
        }
        if (srcEnd > value.length) {
            throw new StringIndexOutOfBoundsException(srcEnd);
        }
        if (srcBegin > srcEnd) {
            throw new StringIndexOutOfBoundsException(srcEnd - srcBegin);
        }
        System.arraycopy(value, srcBegin, dst, dstBegin, srcEnd - srcBegin);
    }

以下都是获取字符串的二进制，以byte数组存放

public void getBytes(int srcBegin, int srcEnd, byte dst[], int dstBegin) {
        if (srcBegin < 0) {
            throw new StringIndexOutOfBoundsException(srcBegin);
        }
        if (srcEnd > value.length) {
            throw new StringIndexOutOfBoundsException(srcEnd);
        }
        if (srcBegin > srcEnd) {
            throw new StringIndexOutOfBoundsException(srcEnd - srcBegin);
        }
        Objects.requireNonNull(dst);

        int j = dstBegin;
        int n = srcEnd;
        int i = srcBegin;
        char[] val = value;   /* avoid getfield opcode */

        while (i < n) {
            dst[j++] = (byte)val[i++];
        }
    }

public byte[] getBytes(String charsetName)
            throws UnsupportedEncodingException {
        if (charsetName == null) throw new NullPointerException();
        return StringCoding.encode(charsetName, value, 0, value.length);
    }

public byte[] getBytes(Charset charset) {
        if (charset == null) throw new NullPointerException();
        return StringCoding.encode(charset, value, 0, value.length);
    }

public byte[] getBytes() {
        return StringCoding.encode(value, 0, value.length);
    }

equals方法，其实现逻辑是先判断两个对象引用是否相同，再判断比较对象是否是它及其子类的实例对象，是就一个一个字符比较。满足一个条件就认定相等，也就是说String str="abc";Object dst="abc";用equals比较这两个对象返回是true。

public boolean equals(Object anObject) {
        if (this == anObject) {
            return true;
        }
        if (anObject instanceof String) {
            String anotherString = (String)anObject;
            int n = value.length;
            if (n == anotherString.value.length) {
                char v1[] = value;
                char v2[] = anotherString.value;
                int i = 0;
                while (n-- != 0) {
                    if (v1[i] != v2[i])
                        return false;
                    i++;
                }
                return true;
            }
        }
        return false;
    }

还有其他contentEquals、nonSyncContentEquals比较内容的，equalsIgnoreCase不管大小写比较内容的，compareTo按字典序比较大小的，比较该字符串和其他一个字符串从分别指定地点开始的n个字符是否相等的regionMatches等等。

然后看看hashCode方法，算法就不说了，可以看出来每个对象实例hash值只计算一次，然后永远不变。

为什么系数选择31？31的乘法可以由i*31== (i<<5)-1来表示，现在很多虚拟机里面都有做相关优化，使用31的原因可能是为了更好的分配hash地址，并且31只占用5bits！在java乘法中如果数字相乘过大会导致溢出的问题，从而导致数据的丢失. 而31则是素数（质数）而且不是很长的数字，最终它被选择为相乘的系数。

public int hashCode() {
        int h = hash;
        if (h == 0 && value.length > 0) {
            char val[] = value;

            for (int i = 0; i < value.length; i++) {
                h = 31 * h + val[i];
            }
            hash = h;
        }
        return h;
    }

然后是各种索引字符位置（下标）的indexOf方法，各种字符串的截取substring方法，字符串的拼接concat方法，各种字符的替换replace方法。。。

接下来是重头戏，正则表达式的split方法

public String[] split(String regex, int limit) {
        /* fastpath if the regex is a
         (1)one-char String and this character is not one of the
            RegEx's meta characters ".$|()[{^?*+\\", or
         (2)two-char String and the first char is the backslash and
            the second is not the ascii digit or ascii letter.
         */
        char ch = 0;
        if (((regex.value.length == 1 &&
             ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
             (regex.length() == 2 &&
              regex.charAt(0) == '\\' &&
              (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
              ((ch-'a')|('z'-ch)) < 0 &&
              ((ch-'A')|('Z'-ch)) < 0)) &&
            (ch < Character.MIN_HIGH_SURROGATE ||
             ch > Character.MAX_LOW_SURROGATE))
        {
            int off = 0;
            int next = 0;
            boolean limited = limit > 0;
            ArrayList<String> list = new ArrayList<>();
            while ((next = indexOf(ch, off)) != -1) {
                if (!limited || list.size() < limit - 1) {
                    list.add(substring(off, next));
                    off = next + 1;
                } else {    // last one
                    //assert (list.size() == limit - 1);
                    list.add(substring(off, value.length));
                    off = value.length;
                    break;
                }
            }
            // If no match was found, return this
            if (off == 0)
                return new String[]{this};

            // Add remaining segment
            if (!limited || list.size() < limit)
                list.add(substring(off, value.length));

            // Construct result
            int resultSize = list.size();
            if (limit == 0) {
                while (resultSize > 0 && list.get(resultSize - 1).length() == 0) {
                    resultSize--;
                }
            }
            String[] result = new String[resultSize];
            return list.subList(0, resultSize).toArray(result);
        }
        return Pattern.compile(regex).split(this, limit);
    }

先看逻辑判断条件：

1.((regex.value.length == 1 && ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) //regex长度为1并且regex不为已列出字符。

2.(regex.length() == 2 && regex.charAt(0) == '\\' && (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 && ((ch-'a')|('z'-ch)) < 0 && ((ch-'A')|('Z'-ch)) < 0)) //regex长度为2，第1位是转义字符，第2位是不是数字字母，ch小于a大于z，ch小于A大于Z。

3.(ch < Character.MIN_HIGH_SURROGATE || ch > Character.MAX_LOW_SURROGATE)) //不在utf编码范围之内的。

而三者之间又以（1 || 2）&& 3的关系依赖，这些特殊情况的处理，然后就是找分割点while循环，然后分割字符串。

最后，做了这么多操作，对于一般的正则，还是调用更底层的方法来实现，也就是说，我们所用到的大部分正则匹配的都是return Pattern.compile(regex).split(this, limit);这行代码来完成的（是使用Pattern的正则方式去解析并拆分成字符串数组）。

然后是加字符串的join方法，字符串转换大小写的toUpperCase、toLowerCase方法，toString方法，转字符数组的toCharArray方法，格式化format方法，各种取值valueOf方法等。

最后一个trim()方法，删除字符串前后ascall码小于空格的字符（空格、制表符等），算法是找出首尾第一个不为空格的位置，取原字符串的子串。

public String trim() {
        int len = value.length;
        int st = 0;
        char[] val = value;    /* avoid getfield opcode */

        while ((st < len) && (val[st] <= ' ')) {
            st++;
        }
        while ((st < len) && (val[len - 1] <= ' ')) {
            len--;
        }
        return ((st > 0) || (len < value.length)) ? substring(st, len) : this;
    }