直接字节缓冲器相对读取性能与绝对读取性能

performance java jvm microbenchmark jmh

2022-09-04 02:07:16

当我测试直接java.nio.ByteBuffer的读取性能时，我注意到绝对读取平均比相对读取快2倍。此外，如果我比较相对读取与绝对读取的源代码，代码几乎是相同的，除了相对读取维护和内部计数器。我想知道为什么我看到如此大的速度差异？

以下是我的JMH基准测试的源代码：

public class DirectByteBufferReadBenchmark {

    private static final int OBJ_SIZE = 8 + 4 + 1;
    private static final int NUM_ELEM = 10_000_000;

    @State(Scope.Benchmark)
    public static class Data {

        private ByteBuffer directByteBuffer;

        @Setup
        public void setup() {
            directByteBuffer = ByteBuffer.allocateDirect(OBJ_SIZE * NUM_ELEM);
            for (int i = 0; i < NUM_ELEM; i++) {
                directByteBuffer.putLong(i);
                directByteBuffer.putInt(i);
                directByteBuffer.put((byte) (i & 1));
            }
        }
    }



    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    @OutputTimeUnit(TimeUnit.SECONDS)
    public long testReadAbsolute(Data d) throws InterruptedException {
        long val = 0l;
        for (int i = 0; i < NUM_ELEM; i++) {
            int index = OBJ_SIZE * i;
            val += d.directByteBuffer.getLong(index);
            d.directByteBuffer.getInt(index + 8);
            d.directByteBuffer.get(index + 12);
        }
        return val;
    }

    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    @OutputTimeUnit(TimeUnit.SECONDS)
    public long testReadRelative(Data d) throws InterruptedException {
        d.directByteBuffer.rewind();

        long val = 0l;
        for (int i = 0; i < NUM_ELEM; i++) {
            val += d.directByteBuffer.getLong();
            d.directByteBuffer.getInt();
            d.directByteBuffer.get();
        }

        return val;
    }

    public static void main(String[] args) throws Exception {
        Options opt = new OptionsBuilder()
            .include(DirectByteBufferReadBenchmark.class.getSimpleName())
            .warmupIterations(5)
            .measurementIterations(5)
            .forks(3)
            .threads(1)
            .build();

        new Runner(opt).run();
    }
}

这些是我的基准测试运行的结果：

Benchmark                                        Mode  Cnt   Score   Error  Units
DirectByteBufferReadBenchmark.testReadAbsolute  thrpt   15  88.605 ± 9.276  ops/s
DirectByteBufferReadBenchmark.testReadRelative  thrpt   15  42.904 ± 3.018  ops/s

该测试在MacbookPro（2.2GHz Intel Core i7，16Gb DDR3）和JDK 1.8.0_73上运行。

更新

我用JDK 9-ea b134运行相同的测试。两个测试都显示速度提高了约10%，但两者之间的速度差仍然相似。

# JMH 1.13 (released 45 days ago)
# VM version: JDK 9-ea, VM 9-ea+134
# VM invoker: /Library/Java/JavaVirtualMachines/jdk-9.jdk/Contents/Home/bin/java
# VM options: <none>


Benchmark                                        Mode  Cnt    Score    Error  Units
DirectByteBufferReadBenchmark.testReadAbsolute  thrpt   15  102.170 ± 10.199  ops/s
DirectByteBufferReadBenchmark.testReadRelative  thrpt   15   45.988 ±  3.896  ops/s

答案 1

JDK 8 确实为具有相对字节缓冲访问的循环生成了更差的代码。

JMH具有内置的探查器，可打印最热区域的生成汇编代码。我用它来比较编译与，以下是主要区别：perfasmtestReadAbsolutetestReadRelative

的相对更新位置字段。VM 不优化这些更新：每个循环迭代有 3 次内存写入。getLong / getInt/ getByteBuffer
position范围检查不会消除：每个循环迭代上的条件分支都保留在编译的代码中。
由于冗余字段更新和范围检查会使循环正文更长，因此 VM 仅展开循环的 2 次迭代。具有绝对访问权限的循环的编译版本具有 16 次展开迭代。

testReadAbsolute编译得很好：主循环只读取 16 个 long，将它们求和并跳转到下一个迭代 if 。的状态未更新。然而，JVM并不是那么聪明：似乎它不能从外部优化对象的现场访问。index < 10_000_000 - 16directByteBuffertestReadRelative

JDK 9 中有很多工作来优化 ByteBuffer。我已经在JDK 9-ea b134上运行了相同的测试，并验证了没有冗余内存写入和范围检查。现在，它的运行速度几乎与 .testReadRelativetestReadAbsolute

// JDK 1.8.0_92, VM 25.92-b14

Benchmark                                        Mode  Cnt   Score   Error  Units
DirectByteBufferReadBenchmark.testReadAbsolute  thrpt   10  99,727 ± 0,542  ops/s
DirectByteBufferReadBenchmark.testReadRelative  thrpt   10  47,126 ± 0,289  ops/s

// JDK 9-ea, VM 9-ea+134

Benchmark                                        Mode  Cnt    Score   Error  Units
DirectByteBufferReadBenchmark.testReadAbsolute  thrpt   10  109,369 ± 0,403  ops/s
DirectByteBufferReadBenchmark.testReadRelative  thrpt   10   97,140 ± 0,572  ops/s

更新

为了帮助JIT编译器进行优化，我引入了局部变量

ByteBuffer directByteBuffer = d.directByteBuffer

在两个基准测试中。否则，间接寻址级别不允许编译器消除字段更新。ByteBuffer.position

答案 2