程序超过理论内存传输速率

performance memory java benchmarking hardware

2022-09-02 20:40:12

我有一台配备英特尔酷睿2双核2.4GHz CPU和2x4Gb DDR3模块1066MHz的笔记本电脑。

我期望这个内存可以以1067 MiB / sec的速度运行，只要有两个通道，最大速度为2134 MiB / sec（以防操作系统内存分派器允许）。

我做了一个小的Java应用程序来测试它：

private static final int size = 256 * 1024 * 1024; // 256 Mb
private static final byte[] storage = new byte[size];

private static final int s = 1024; // 1Kb
private static final int duration = 10; // 10sec

public static void main(String[] args) {
    long start = System.currentTimeMillis();
    Random rnd = new Random();
    byte[] buf1 = new byte[s];
    rnd.nextBytes(buf1);
    long count = 0;
    while (System.currentTimeMillis() - start < duration * 1000) {
        long begin = (long) (rnd.nextDouble() * (size - s));
        System.arraycopy(buf1, 0, storage, (int) begin, s);
        ++count;
    }
    double totalSeconds = (System.currentTimeMillis() - start) / 1000.0;
    double speed = count * s / totalSeconds / 1024 / 1024;
    System.out.println(count * s + " bytes transferred in " + totalSeconds + " secs (" + speed + " MiB/sec)");

    byte[] buf2 = new byte[s];
    count = 0;
    start = System.currentTimeMillis();
    while (System.currentTimeMillis() - start < duration * 1000) {
        long begin = (long) (rnd.nextDouble() * (size - s));
        System.arraycopy(storage, (int) begin, buf2, 0, s);
        Arrays.fill(buf2, (byte) 0);
        ++count;
    }
    totalSeconds = (System.currentTimeMillis() - start) / 1000.0;
    speed = count * s / totalSeconds / 1024 / 1024;
    System.out.println(count * s + " bytes transferred in " + totalSeconds + " secs (" + speed + " MiB/sec)");
}

我预计结果低于2134 MiB / sec，但是我得到了以下内容：

17530212352 bytes transferred in 10.0 secs (1671.811328125 MiB/sec)
31237926912 bytes transferred in 10.0 secs (2979.080859375 MiB/sec)

速度几乎达到3 GiB/秒，这怎么可能呢？

DDR3 module photo

答案 1

这里有很多事情在起作用。

首先：DDR3的内存传输速率公式为

memory clock rate
× 4  (for bus clock multiplier)
× 2  (for data rate)
× 64 (number of bits transferred)
/ 8  (number of bits/byte)
=    memory clock rate × 64 (in MB/s)

对于 DDR3-1066（时钟频率为），我们获得理论内存带宽或单通道和，或双通道。133⅓ MHz8533⅓ MB/s8138.02083333... MiB/s17066⅔ MB/s16276.0416666... MiB/s

其次：传输一大块数据比传输许多小块数据更快。

第三：测试忽略可能发生的缓存效果。

第四：如果进行时间测量，则应使用.此方法更精确。System.nanoTime()

下面是测试程序 ¹ 的重写版本。

import java.util.Random;

public class Main {

  public static void main(String... args) {
    final int SIZE = 1024 * 1024 * 1024;
    final int RUNS = 8;
    final int THREADS = 8;
    final int TSIZE = SIZE / THREADS;
    assert (TSIZE * THREADS == THREADS) : "TSIZE must divide SIZE!";
    byte[] src = new byte[SIZE];
    byte[] dest = new byte[SIZE];
    Random r = new Random();
    long timeNano = 0;

    Thread[] threads = new Thread[THREADS];
    for (int i = 0; i < RUNS; ++i) {
      System.out.print("Initializing src... ");
      for (int idx = 0; idx < SIZE; ++idx) {
        src[idx] = ((byte) r.nextInt(256));
      }
      System.out.println("done!");
      System.out.print("Starting test... ");
      for (int idx = 0; idx < THREADS; ++idx) {
        final int from = TSIZE * idx;
        threads[idx]
            = new Thread(() -> {
          System.arraycopy(src, from, dest, 0, TSIZE);
        });
      }
      long start = System.nanoTime();
      for (int idx = 0; idx < THREADS; ++idx) {
        threads[idx].start();
      }
      for (int idx = 0; idx < THREADS; ++idx) {
        try {
          threads[idx].join();
        } catch (InterruptedException e) {
          e.printStackTrace();
        }
      }
      timeNano += System.nanoTime() - start;
      System.out.println("done!");
    }
    double timeSecs = timeNano / 1_000_000_000d;

    System.out.println("Transfered " + (long) SIZE * RUNS
        + " bytes in " + timeSecs + " seconds.");

    System.out.println("-> "
        + ((long) SIZE * RUNS / timeSecs / 1024 / 1024 / 1024)
        + " GiB/s");
  }
}

通过这种方式，可以减轻尽可能多的“其他计算”，并且（几乎）仅测量通过的内存复制率。此算法在缓存方面可能仍然存在问题。System.arraycopy(...)

对于我的系统（双通道DDR3-1600），我得到了一些东西，而理论限制就在附近（包括双通道）。6 GiB/s25 GiB/s

正如Nick Mertin所指出的那样，JVM引入了一些开销。因此，预计您无法达到理论极限。

_{¹ 旁注：要运行程序，必须为 JVM 提供更多的堆空间。在我的情况下，4096 MB就足够了。}

答案 2

您的测试方法在许多方面都设计不当，并且您对RAM额定值的解释也是如此。

让我们从评级开始;自引入SDRam以来，营销人员以总线规格命名模块 - 即总线时钟频率，与突发传输速率配对。这是最好的情况，在实践中，它不能持续下去。

该标签省略的参数是实际访问时间（也称为延迟）和总周期时间（也称为预充电时间）。这些可以通过实际查看“时间”规格（2-3-3的东西）来解决。查找一篇详细解释这些东西的文章。实际上，CPU通常不会传输单个字节，而是传输整个缓存行（例如，每8个字节8个条目= 64个字节）。

您的测试代码设计不当，因为您正在使用与实际数据边界不对齐的相对较小的块进行随机访问。这种随机访问还会在 MMU 中频繁出现页面错误（了解 TLB 是什么/做什么）。因此，您正在测量不同系统方面的混合体。