为什么String在jdk8以前内部定义final char[ ] value 用于存储字符串数据，但是jkd9时却改为byte[ ]存储呢？（最详细解答疑惑）

原创已于 2023-08-17 17:12:41 修改 · 236 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#java #开发语言 #String #StringTable

于 2023-08-17 14:23:35 首次发布

StringTable 专栏收录该内容

10 篇文章

订阅专栏

本文讨论了JavaString类的内存优化策略，提出从UTF-16字符数组转换为使用byte[]和编码标志，以节省空间并改善性能，同时保持向后兼容。文章还分析了动机、测试计划和潜在风险。

1、String的不可变性

String：字符串，使用一对" "引起来表示。String s1="丁总"，String s2=new String（"丁总"）
String声明为final的，不可被继承
String实现Serializable接口：表示字符串是支持序列化的。实现了Comparable接口：表示String可以比较大小
String在jdk8以前内部定义了final char[ ] value 用于存储字符串数据。jdk9时改为byte[ ]

JEP 254: Compact Strings

Author Brent Christian
Owner Xueming Shen
Type Feature
Scope Implementation
Status Closed / Delivered
Release 9
Component core-libs / java.lang
Discussion core dash libs dash dev at openjdk dot java dot net
Effort L
Duration XL
Relates to JEP 192: String Deduplication in G1
8144691: JEP 254: Compact Strings: endiannes mismatch in Java source code and intrinsic
JEP 250: Store Interned Strings in CDS Archives
JEP 280: Indify String Concatenation
Reviewed by Aleksey Shipilev, Brian Goetz, Charlie Hunt
Endorsed by Brian Goetz
Created 2014/08/04 21:54
Updated 2022/04/11 23:06
Issue 8054307
Summary

Adopt a more space-efficient internal representation for strings.

Goals

Improve the space efficiency of the String class and related classes while maintaining performance in most scenarios and preserving full compatibility for all related Java and native interfaces.

Non-Goals

It is not a goal to use alternate encodings such as UTF-8 in the internal representation of strings. A subsequent JEP may explore that approach.

Motivation

The current implementation of the String class stores characters in a char array, using two bytes (sixteen bits) for each character. Data gathered from many different applications indicates that strings are a major component of heap usage and, moreover, that most String objects contain only Latin-1 characters. Such characters require only one byte of storage, hence half of the space in the internal char arrays of such String objects is going unused.

Description

We propose to change the internal representation of the String class from a UTF-16 char array to a byte array plus an encoding-flag field. The new String class will store characters encoded either as ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes per character), based upon the contents of the string. The encoding flag will indicate which encoding is used.

String-related classes such as AbstractStringBuilder, StringBuilder, and StringBuffer will be updated to use the same representation, as will the HotSpot VM's intrinsic string operations.

This is purely an implementation change, with no changes to existing public interfaces. There are no plans to add any new public APIs or other interfaces.

The prototyping work done to date confirms the expected reduction in memory footprint, substantial reductions of GC activity, and minor performance regressions in some corner cases.

For further detail, see:

State of String Density Performance
String Density Impact on SPECjbb2005 on SPARC
Alternatives

We tried a "compressed strings" feature in JDK 6 update releases, enabled by an -XX flag. When enabled, String.value was changed to an Object reference and would point either to a byte array, for strings containing only 7-bit US-ASCII characters, or else a char array. This implementation was not open-sourced, so it was difficult to maintain and keep in sync with the mainline JDK source. It has since been removed.

Testing

Thorough compatibility and regression testing will be essential for a change to such a fundamental part of the platform.

We will also need to confirm that we have fulfilled the performance goals of this project. Analysis of memory savings will need to be done. Performance testing should be done using a broad range of workloads, ranging from focused microbenchmarks to large-scale server workloads.

We will encourage the entire Java community to perform early testing with this change in order to identify any remaining issues.

Risks and Assumptions

Optimizing character storage for memory may well come with a trade-off in terms of run-time performance. We expect that this will be offset by reduced GC activity and that we will be able to maintain the throughput of typical server benchmarks. If not, we will investigate optimizations that can strike an acceptable balance between memory saving and run-time performance.

Other recent projects have already reduced the heap space used by strings, in particular JEP 192: String Deduplication in G1. Even with duplicates eliminated, the remaining string data can be made to consume less space if encoded more efficiently. We are assuming that this project will still provide a benefit commensurate with the effort required.

Installing

Contributing

Sponsoring

Developers' Guide

Vulnerabilities

JDK GA/EA Builds

Mailing lists

Wiki · IRC

Bylaws · Census

Legal

Workshop

JEP Process

Source code

Mercurial

GitHub

Tools

Git

jtreg harness

Groups

(overview)

Adoption

Build

Client Libraries

Compatibility & Specification Review

Compiler

Conformance

Core Libraries

Governing Board

HotSpot

IDE Tooling & Support

Internationalization

JMX

Members

Networking

Porters

Quality

Security

Serviceability

Vulnerability

Web

Projects

(overview, archive)

Amber

Audio Engine

CRaC

Caciocavallo

Closures

Code Tools

Coin

Common VM Interface

Compiler Grammar

Detroit

Developers' Guide

Device I/O

Duke

Font Scaler

Galahad

Graal

Graphics Rasterizer

IcedTea

JDK 7

JDK 8

JDK 8 Updates

JDK 9

JDK (…, 20, 21, 22)

JDK Updates

JavaDoc.Next

Jigsaw

Kona

Kulla

Lambda

Lanai

Leyden

Lilliput

Locale Enhancement

Loom

Memory Model Update

Metropolis

Mission Control

Modules

Multi-Language VM

Nashorn

New I/O

OpenJFX

Panama

Penrose

Port: AArch32

Port: AArch64

Port: BSD

Port: Haiku

Port: Mac OS X

Port: MIPS

Port: Mobile

Port: PowerPC/AIX

Port: RISC-V

Port: s390x

Portola

SCTP

Shenandoah

Skara

Sumatra

Tiered Attribution

Tsan

Type Annotations

Valhalla

Verona

VisualVM

Wakefield

Zero

ZGC

© 2023 Oracle Corporation and/or its affiliates
Terms of Use · License: GPLv2 · Privacy · Trademarks

2、Motivation（动机）

String类的当前实现将字符存储在字符数组中，每个字符使用两个字节（16位）。从许多不同的应用程序收集的数据表明，字符串是堆使用的主要组成部分，而且，大多数String对象只包含Latin-1字符。这些字符只需要一个字节的存储空间，因此这些String对象的内部字符数组中有一半的空间是未使用的。

3、Description（描述）

我们建议将String类的内部表示形式从UTF-16字符数组更改为字符数组加上编码标志字段。新的String类将根据字符串的内容存储编码为ISO-8859-1/Latin-1（每个字符一个字节）或UTF-16（每个字符两个字节）的字符。编码标志将指示使用哪种编码。
与字符串相关的类，如AbstractStringBuilder、StringBuilder和StringBuffer，将被更新为使用相同的表示，HotSop VM的固定字符串操作也是如此。
这是一个纯粹的实现更改，没有更改现有的公共接口。没有计划添加任何新的公共api或其他接口。
到目前为止所做的原型工作证实了预期的内存占用减少、GC活动的大量减少以及在某些极端情况下的轻微性能下降。
有关详情，请参阅：弘密度性能状态，弘密度对SPECjbb2005对SPARC的影响
结论：String再也不用char[ ]来存储啦，改成了byte[ ]加上编码标记，节约了一些空间。

4、String的基本特性

String：代表不可变的字符序列。简称：不可变性。

当对字符串重新赋值时，需要重写指定内存区域赋值，不能使用原有的value进行赋值。
当对现有的字符串进行连接操作时，也需要重新定义内存区域赋值，不能使用原有的value进行赋值。
当调用String的replace（）方法修改指定字符或字符串时，也需要重新指定内存区域赋值，不能使用原来的value进行赋值。

通过字面量的方式（区别于new）给一个字符串赋值，此时的字符串值声明在字符串常量池中。

package string;


import org.junit.Test;

public class StringTest1 {

    @Test
    public void test1() {
        String s1 = "abc";
        String s2 = "abc";

        System.out.println(s1 == s2);//true
        System.out.println(s1);//abc
        System.out.println(s2);//abc
    }

    @Test
    public void test2() {
        String s1 = "abc";
        String s2 = "abc";
        s1 = "hello";

        System.out.println(s1 == s2);//false
        System.out.println(s1);//hello
        System.out.println(s2);//abc
    }

    @Test
    public void test3() {
        String s1 = "abc";
        String s2 = "abc";
        s2 += "def";
        System.out.println(s2);//abcdef
        System.out.println(s1);//abc
    }

    @Test
    public void test4() {
        String s1 = "abc";
        String s2 = s1.replace('a', 'm');
        System.out.println(s1);//abc
        System.out.println(s2);//mbc
    }
}

5、面试题

package string;

public class StringExer {

    String str = new String("good");
    char[] ch = {'t', 'e', 's', 't'};

    public void change(String str, char ch[]) {
        str = "test ok";
        ch[0] = 'b';
    }

    public static void main(String[] args) {
        StringExer ex = new StringExer();
        ex.change(ex.str, ex.ch);
        System.out.println(ex.str);//good
        System.out.println(ex.ch);//best
    }
}
D:\Java\jdk-17\bin\java.exe "-javaagent:D:\BaiduNetdiskDownload\IntelliJ IDEA 2023.2\lib\idea_rt.jar=31487:D:\BaiduNetdiskDownload\IntelliJ IDEA 2023.2\bin" -Dfile.encoding=UTF-8 -classpath F:\IdeaProjects\JavaSenior\out\production\jdk8;D:\develop\maven\repository\junit\junit\4.13.1\junit-4.13.1.jar;D:\develop\maven\repository\org\hamcrest\hamcrest-core\1.3\hamcrest-core-1.3.jar string.StringExer
good
best

Process finished with exit code 0