C++实现的简单Map为什么比Java慢？深入分析JIT和AOT编译策略

lxyzcm

已于 2024-12-18 18:21:46 修改

阅读量863

点赞数 10

文章标签：开发语言 java c++ 数据结构算法

于 2024-12-18 18:19:55 首次发布

本文链接：https://blog.youkuaiyun.com/weixin_61470881/article/details/144566309

版权

研究目标

探究即时编译(JIT)和预先编译(AOT)策略之间的性能差异，以及理解它们各自的优势和劣势。需要强调的是，本研究的目的不是为了证明某种语言比另一种更慢或更差。

测试结果概述

在我们的测试中，我们观察到：

使用JIT编译的HotSpot JVM 23（使用JVMCI和C2）表现最好
使用Clang 18编译的C++版本
使用native-image编译的GraalVM 23
使用-Xcomp标志的HotSpot JVM 23（JVMCI和C2）
这三种情况的性能都相对较慢。我们希望理解这种现象背后的原因，并找到可能的方法来提升C++版本的性能，使其能够匹配Java的JIT编译结果。

基准测试设计

我们的基准测试包括比较Java和C++中完全相同的简单哈希表（map）实现。我们确保两种实现在代码层面上是完全对应的（逐行对应）。这里需要说明的是，我们不是在比较标准库的哈希表实现（java.util.HashMap和std::unordered_map），因为它们的源代码实现本来就不等价。

测试参数设置：

哈希表容量：20,000个桶
插入对象数量：2,000,000个
测试流程：先插入一次，清空map后再插入一次

这样设计的目的是为了利用哈希表的内部对象池机制：

第一次插入时，对象会在堆上分配
第二次插入时，对象会从内部对象池中重用

性能测试结果

HotSpot JVM（使用Graal JVMCI JIT）

PUT => 平均: 371 ns | 最小: 28 ns | 99.9% = [平均: 367 ns, 最大: 1.743 微秒]
PUT => 平均: 613 ns | 最小: 27 ns | 99.9% = [平均: 606 ns, 最大: 2.184 微秒]
GET => 平均: 615 ns | 最小: 14 ns | 99.9% = [平均: 607 ns, 最大: 2.549 微秒]
DEL => 平均: 662 ns | 最小: 18 ns | 99.9% = [平均: 658 ns, 最大: 2.538 微秒]

HotSpot JVM（使用C2 JIT）

PUT => 平均: 342 ns | 最小: 29 ns | 99.9% = [平均: 338 ns, 最大: 1.661 微秒]
PUT => 平均: 596 ns | 最小: 28 ns | 99.9% = [平均: 589 ns, 最大: 2.161 微秒]
GET => 平均: 599 ns | 最小: 20 ns | 99.9% = [平均: 592 ns, 最大: 2.275 微秒]
DEL => 平均: 826 ns | 最小: 23 ns | 99.9% = [平均: 817 ns, 最大: 3.420 微秒]

C++ LLVM（clang）

PUT => 平均: 726 ns | 最小: 30 ns | 99.9% = [平均: 720 ns, 最大: 4.097 微秒]
PUT => 平均: 857 ns | 最小: 18 ns | 99.9% = [平均: 848 ns, 最大: 2.933 微秒]
GET => 平均: 874 ns | 最小: 18 ns | 99.9% = [平均: 865 ns, 最大: 3.010 微秒]
DEL => 平均: 875 ns | 最小: 19 ns | 99.9% = [平均: 871 ns, 最大: 2.810 微秒]

GraalVM（native-image）

PUT => 平均: 190 ns | 最小: 21 ns | 99.9% = [平均: 183 ns, 最大: 814 ns]
PUT => 平均: 659 ns | 最小: 23 ns | 99.9% = [平均: 656 ns, 最大: 2.762 微秒]
GET => 平均: 399 ns | 最小: 21 ns | 99.9% = [平均: 396 ns, 最大: 2.124 微秒]
DEL => 平均: 323 ns | 最小: 27 ns | 99.9% = [平均: 321 ns, 最大: 1.850 微秒]

性能分析与优化建议

关键发现

C++版本在第二次PUT操作中仍然比Java慢，这一点特别值得注意，因为此时所有对象都应该从对象池中获取，不涉及新的内存分配。
通过分析机器码，我们发现：
- 主循环都是链表的线性搜索
- Java使用32位压缩引用，而C++使用原生指针
- 即使使用clang++ -O3 ... -m32编译成32位代码，C++版本仍然较慢
- Java对象至少24字节，而不是仅仅12字节（2个引用+一个int）
性能差异可能与以下因素有关：
- 内存分配模式
- 缓存未命中
- Java在循环内执行更多的机器指令
- Java需要将32位值左移3位转换为指针

优化方向

自定义分配器：评论中提到，C++的主要改进点可能在于为哈希表内部的Entry对象使用自定义分配器。Java在堆上分配对象的速度可能比C++的new关键字更快，这个差距可以通过自定义分配器来弥补。
内存布局优化：考虑改进数据结构的内存布局，提高缓存利用率。
编译器优化：探索更多的Clang优化选项，特别是针对指针操作和内存访问模式的优化。

测试环境

操作系统：Ubuntu 18.04.6 LTS
处理器：Intel(R) Xeon(R) E-2288G CPU @ 3.70GHz
架构：x86_64

编译器版本：
- clang++ 18.1.0
- Java 23.0.1 (Oracle GraalVM)
- native-image 23.0.1

编译命令

C++代码编译

rm -f target/cpp/int_map_benchmark target/cpp/int_map.o target/cpp/bench.o target/cpp/int_map_benchmark.o
mkdir -p target/cpp
clang++ -Ofast -march=native -flto -std=c++17 -I./src/main/c -c ./src/main/c/int_map.cpp -o ./target/cpp/int_map.o
clang++ -Ofast -march=native -flto -std=c++17 -I./src/main/c -c ./src/main/c/bench.cpp -o ./target/cpp/bench.o
clang++ -Ofast -march=native -flto -std=c++17 -I./src/main/c -c ./src/main/c/int_map_benchmark.cpp -o ./target/cpp/int_map_benchmark.o
clang++ -Ofast -march=native -flto -std=c++17 -o ./target/cpp/int_map_benchmark ./target/cpp/int_map.o ./target/cpp/bench.o ./target/cpp/int_map_benchmark.o

运行命令

#!/bin/bash
WARMUP=${1:-0}
MEASUREMENTS=${2:-2000000}
CAPACITY=${3:-20000}
./target/cpp/int_map_benchmark $WARMUP $MEASUREMENTS $CAPACITY

完整代码实现

bench.hpp

#ifndef BENCH_HPP
#define BENCH_HPP

#include <chrono>
#include <iostream>
#include <limits>
#include <iomanip>
#include <string>
#include <cmath>
#include <map>
#include <sstream>

class Bench {
public:
    Bench(int warmupCount = 0);
    ~Bench();

    void mark();
    void measure();
    bool measure(long long);
    void reset();
    void reset(bool);
    void printResults() const;
    void printResults(bool) const;
    bool isWarmingUp() const;
    int getIterations() const;
    int getMeasurements() const;
    double getAverage() const;

private:
    int warmupCount;
    int measurementCount;
    long long sum;
    long long minTime;
    long long maxTime;
    int size;
    std::map<long long, long long>* results;
    std::chrono::steady_clock::time_point startTime;
    
    static std::string formatWithCommas(long long value);
    static std::pair<double, std::string> formatTime(double nanos);
    static std::string formatPercentage(double perc);
    static double roundToDecimals(double d, int decimals);
    void printPercentiles() const;
    void addPercentile(double perc) const;
    double avg() const;
};

#endif // BENCH_HPP

bench.cpp

#include "bench.hpp"
using namespace std;

Bench::Bench(int warmupCount)
    : warmupCount(warmupCount),
      measurementCount(0),
      sum(0),
      minTime(numeric_limits<long long>::max()),
      maxTime(numeric_limits<long long>::min()),
      size(0) {

        results = new map<long long, long long>();

}

Bench::~Bench() {
    delete results;
}

void Bench::mark() {
    startTime = chrono::steady_clock::now();
}

void Bench::measure() {
    auto endTime = chrono::steady_clock::now();
    auto elapsed = chrono::duration_cast<chrono::nanoseconds>(endTime - startTime).count();
    measure(elapsed);
}

bool Bench::measure(long long elapsed) {

    bool isToMeasure = ++measurementCount > warmupCount;

    if (isToMeasure) {
        sum += elapsed;
        if (elapsed < minTime) minTime = elapsed;
        if (elapsed > maxTime) maxTime = elapsed;

        // Increment the frequency of this elapsed time
        auto it = results->find(elapsed);
        if (it == results->end()) {
            results->insert({elapsed, 1});
        } else {
            it->second++;
        }
        size++;
    }
    
    return isToMeasure;
}

int Bench::getIterations() const {
    return measurementCount;
}

int Bench::getMeasurements() const {
    return size;
}

void Bench::reset() {
    reset(false);
}

void Bench::reset(bool repeatWarmup) {
    measurementCount = 0;
    sum = 0;
    if (!repeatWarmup) warmupCount = 0;
    minTime = numeric_limits<long long>::max();
    maxTime = numeric_limits<long long>::min();
    results->clear();
    size = 0;
}

bool Bench::isWarmingUp() const {
    return warmupCount <= measurementCount;
}

double Bench::avg() const {
    const int effectiveCount = measurementCount - warmupCount;
    if (effectiveCount <= 0) {
        return 0;
    }
    const double avg = static_cast<double>(sum) / effectiveCount;
    const double rounded = round(avg * 100.0) / 100.0;
    return rounded;
}

double Bench::getAverage() const {
    return avg();
}    

void Bench::printResults() const {
    printResults(true);
}

void Bench::printResults(bool includePercentiles) const {

    int effectiveCount = measurementCount - warmupCount;

    string effCountStr = formatWithCommas(effectiveCount);
    string warmupStr = formatWithCommas(warmupCount);
    string totalStr = formatWithCommas(measurementCount);

    cout << "Measurements: " << effCountStr
         << " | Warm-Up: " << warmupStr
         << " | Iterations: " << totalStr << endl;
         
    if (effectiveCount > 0) {

        auto [avgVal, avgUnit] = formatTime(avg());
        auto [minVal, minUnit] = formatTime(static_cast<double>(minTime));
        auto [maxVal, maxUnit] = formatTime(static_cast<double>(maxTime));
    
        cout << fixed << setprecision(3);
        cout << "Avg Time: " << avgVal << " " << avgUnit << " | "
             << "Min Time: " << minVal << " " << minUnit << " | "
             << "Max Time: " << maxVal << " " << maxUnit << endl;
    
        if (includePercentiles) printPercentiles();
    }
    
    cout << endl;
}

string Bench::formatWithCommas(long long value) {
    string numStr = to_string(value);
    int insertPosition = static_cast<int>(numStr.length()) - 3;
    while (insertPosition > 0) {
        numStr.insert(insertPosition, ",");
        insertPosition -= 3;
    }
    return numStr;
}

pair<double, string> Bench::formatTime(double nanos) {
    if (nanos >= 1'000'000'000.0) {
        double seconds = nanos / 1'000'000'000.0;
        return {roundToDecimals(seconds, 3), seconds > 1 ? "seconds" : "second"};
    } else if (nanos >= 1'000'000.0) {
        double millis = nanos / 1'000'000.0;
        return {roundToDecimals(millis, 3), millis > 1 ? "millis" : "milli"};
    } else if (nanos >= 1000.0) {
        double micros = nanos / 1000.0;
        return {roundToDecimals(micros, 3), micros > 1 ? "micros" : "micro"};
    } else {
        double ns = nanos;
        return {roundToDecimals(ns, 3), ns > 1 ? "nanos" : "nano"};
    }
}

double Bench::roundToDecimals(double d, int decimals) {
    double pow10 = pow(10.0, decimals);
    return round(d * pow10) / pow10;
}

void Bench::printPercentiles() const {

    if (size == 0) return;

    double percentiles[] = {0.75, 0.90, 0.99, 0.999, 0.9999, 0.99999};

    for (double p : percentiles) {
        addPercentile(p);
    }
}

string Bench::formatPercentage(double perc) {
    double p = perc * 100.0;

    ostringstream oss;
    oss << fixed << setprecision(6) << p;

    string s = oss.str();
    // remove trailing zeros
    while (s.back() == '0') {
        s.pop_back();
    }

    // if the last character is now a '.', remove it
    if (s.back() == '.') {
        s.pop_back();
    }

    // Append the '%' sign
    s += "%";

    return s;
}

void Bench::addPercentile(double perc) const {

    if (results->empty()) return;

    long long target = static_cast<long long>(llround(perc * size));
    if (target == 0) return;
    if (target > size) target = size;

    // Iterate through the map to find the top element at position target
    long long iTop = 0;
    long long sumTop = 0;
    long long maxTop = -1;

    for (auto &entry : *results) {
        long long time = entry.first;
        long long count = entry.second;

        for (int i = 0; i < count; i++) {
            iTop++;
            sumTop += time;
            if (iTop == target) {
                maxTop = time;
                goto FOUND;
            }
        }
    }

FOUND:;

    double avgTop = static_cast<double>(sumTop) / iTop;
    auto [avgVal, avgUnit] = formatTime(avgTop);
    auto [maxVal, maxUnit] = formatTime(static_cast<double>(maxTop));

    cout << fixed << setprecision(3);
    cout << formatPercentage(perc) << " = [avg: " << avgVal << " " << avgUnit
         << ", max: " << maxVal << " " << maxUnit << "]\n";
}

深入性能分析

代码设计特点

指针使用
- C++版本大量使用指针以模拟Java的语义
- 存储的"值"实际上都是指向同一个虚拟对象的指针
- put方法通过const引用接收对象（在汇编层面是指针）
- get方法返回非const指针
哈希表实现
- 使用数组加链表的结构
- 负载因子为100，这使得测试实际上更像是链表性能测试
- 每个操作的时间足够长，可以通过Linux的clock_gettime进行测量
内存管理
- 使用无限大小的空闲列表
- 只在析构函数中进行删除操作
- remove()操作将节点从桶的链表移动到空闲列表的头部
- put操作优先使用空闲列表的头部节点，只在空闲列表为空时调用new

性能差异分析

内存分配模式
- Java的GC和内存分配可能比C++的new操作更高效
- 第二轮put操作仍然较慢，表明问题不仅仅在于内存分配
缓存影响
- 链表遍历过程中的缓存未命中可能是主要瓶颈
- Java的对象布局可能在某些情况下提供更好的缓存局部性
指针操作开销
- Java使用32位压缩引用
- C++使用原生指针（64位系统上为64位）
- 即使使用32位编译，C++版本仍然较慢
编译器优化
- JIT编译器可能对特定模式有更好的优化
- 运行时优化可能比静态编译产生更好的代码

优化建议

自定义内存分配器
- 实现针对Entry对象的特定分配器
- 考虑使用内存池或区域分配策略
- 优化内存布局以提高缓存效率
数据结构优化
- 考虑使用开放寻址法代替链表
- 优化负载因子以减少冲突
- 实现更紧凑的内存布局
编译器选项
- 探索更多的Clang优化标志
- 考虑使用profile-guided optimization
- 测试不同的内联策略

测试环境配置

系统环境

操作系统：Ubuntu 18.04.6 LTS
处理器：Intel(R) Xeon(R) E-2288G CPU @ 3.70GHz
架构：x86_64

编译器版本

clang++ 18.1.0
Java 23.0.1 (Oracle GraalVM)
native-image 23.0.1

编译参数说明

-Ofast: 启用最高级别的优化
-march=native: 针对本地CPU架构优化
-flto: 启用链接时优化
-std=c++17: 使用C++17标准