为什么我的Java lambda与虚拟任务比没有它快得多?

2022-09-03 02:47:58

我知道对Java微基准标记做出判断是非常令人担忧的,但是我看到了一些看起来很奇怪的东西,我想得到一些解释。

请注意,我没有为此使用JMH框架。我知道这一点,但我不想为此而达到那么远。

我将提供整个代码示例,但简而言之,当我测试这两种方法的性能时

private FooPrime[] testStreamToArray(ArrayList<Foo> fooList) {
    return (FooPrime[]) fooList.stream().
                map(it -> {
                    return new FooPrime().gamma(it.getAlpha() + it.getBeta());
                }).
                toArray(FooPrime[]::new);
}

private FooPrime[] testStreamToArray2(ArrayList<Foo> fooList) {
    return (FooPrime[]) fooList.stream().
                map(it -> {
                    int stuff = it.getAlpha().length();
                    return new FooPrime().gamma(it.getAlpha() + it.getBeta());
                }).
                toArray(FooPrime[]::new);
}

我发现非常令人惊讶的结果。在较大的代码示例中,我测量了四种不同的方法,前三种方法的性能非常接近。它们每次迭代都运行大约50k ns。但是,第二个代码示例始终运行不到该总数的一半。没错。它不是更慢,而是快得多。

最后一次运行显示如下数字:

manualcopy:54575 ns
toarray:53617 ns
streamtoarray:52990 ns
streamtoarray2:24217 ns

每次运行都有与此类似的数字。

现在,我将提供整个类和基类。请注意,我确实有一个“预热”阶段,在开始计时之前,我将测试方法执行几千次。另请注意,尽管这最后运行“testStreamToArray2”,但我也尝试将该块移动到第一个测试中,并且数字大致相同。注释掉的行是为了让我相信这些方法实际上正在做一些事情(时间仍然与那些没有被注释掉的行大致相同)。

package timings;

import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

public class ListToArrayOfPrimesTiming {

    public static void main(String[] args) {
        ListToArrayOfPrimesTiming tests = new ListToArrayOfPrimesTiming(args);
        tests.go();
    }

    public ListToArrayOfPrimesTiming(String[] args) { }

    private void go() {

        final ArrayList<Foo> fooList    = new ArrayList<>();

        for (int ctr = 0; ctr < 1000; ++ ctr) {
            fooList.add(new Foo().alpha("a" + ctr).beta("b" + ctr));
        }

        for (int ctr = 0; ctr < 20000; ++ ctr) {
            testManualCopy(fooList);
            testToArray(fooList);
            testStreamToArray(fooList);
            testStreamToArray2(fooList);
        }

        int iters   = 100000;

//      Set<Integer> lengths    = new HashSet<>();
//      Set<FooPrime>   distinctFooPrimes   = new HashSet<>();
//      lengths.clear();
//      distinctFooPrimes.clear();

        new TimingContainer(iters, "manualcopy", new TimingTest() {
            @Override
            public void run() {
                FooPrime[] fooPrimeArray = testManualCopy(fooList);
//              lengths.add(fooPrimeArray.length);
//              distinctFooPrimes.add(fooPrimeArray[0]);
            }
        }).run();

//      System.out.println("lengths[" + lengths + "]");
//      lengths.clear();
//      System.out.println("distinctFooPrimes[" + distinctFooPrimes + "]");
//      distinctFooPrimes.clear();

        new TimingContainer(iters, "toarray", new TimingTest() {
            @Override
            public void run() {
                FooPrime[] fooPrimeArray = testManualCopy(fooList);
//              lengths.add(fooPrimeArray.length);
//              distinctFooPrimes.add(fooPrimeArray[0]);
            }
        }).run();

//      System.out.println("lengths[" + lengths + "]");
//      lengths.clear();
//      System.out.println("distinctFooPrimes[" + distinctFooPrimes + "]");
//      distinctFooPrimes.clear();

        new TimingContainer(iters, "streamtoarray", new TimingTest() {
            @Override
            public void run() {
                FooPrime[] fooPrimeArray = testStreamToArray(fooList);
//              lengths.add(fooPrimeArray.length);
//              distinctFooPrimes.add(fooPrimeArray[0]);
            }
        }).run();

//      System.out.println("lengths[" + lengths + "]");
//      lengths.clear();
//      System.out.println("distinctFooPrimes[" + distinctFooPrimes + "]");
//      distinctFooPrimes.clear();

        new TimingContainer(iters, "streamtoarray2", new TimingTest() {
            @Override
            public void run() {
                FooPrime[] fooPrimeArray = testStreamToArray2(fooList);
//              lengths.add(fooPrimeArray.length);
//              distinctFooPrimes.add(fooPrimeArray[0]);
            }
        }).run();

//      System.out.println("lengths[" + lengths + "]");
//      lengths.clear();
//      System.out.println("distinctFooPrimes[" + distinctFooPrimes + "]");
//      distinctFooPrimes.clear();
    }

    private FooPrime[] testManualCopy(ArrayList<Foo> fooList) {
        FooPrime[] fooPrimeArray    = new FooPrime[fooList.size()];
        int index = -1;
        for (Foo foo: fooList) {
            ++ index;
            fooPrimeArray[index]    = new FooPrime().gamma(foo.getAlpha() + foo.getBeta());
        }
        return fooPrimeArray;
    }

    private FooPrime[] testToArray(ArrayList<Foo> fooList) {
        List<FooPrime>  fooPrimeList    = new ArrayList<>();
        for (Foo foo: fooList) {
            fooPrimeList.add(new FooPrime().gamma(foo.getAlpha() + foo.getBeta()));
        }
        return fooPrimeList.toArray(new FooPrime[fooList.size()]);
    }

    private FooPrime[] testStreamToArray(ArrayList<Foo> fooList) {
        return (FooPrime[]) fooList.stream().
                    map(it -> {
                        return new FooPrime().gamma(it.getAlpha() + it.getBeta());
                    }).
                    toArray(FooPrime[]::new);
    }

    private FooPrime[] testStreamToArray2(ArrayList<Foo> fooList) {
        return (FooPrime[]) fooList.stream().
                    map(it -> {
                        int stuff = it.getAlpha().length();
                        return new FooPrime().gamma(it.getAlpha() + it.getBeta());
                    }).
                    toArray(FooPrime[]::new);
    }

    public static FooPrime fooToFooPrime(Foo foo) {
        return new FooPrime().gamma(foo.getAlpha() + foo.getBeta());
    }

    public static class Foo {
        private String alpha;
        private String beta;

        public String getAlpha() { return alpha; }
        public String getBeta() { return beta; }

        public void setAlpha(String alpha) { this.alpha = alpha; }
        public void setBeta(String beta) { this.beta = beta; }

        public Foo alpha(String alpha) { this.alpha = alpha; return this; }
        public Foo beta(String beta) { this.beta = beta; return this; }
    }

    public static class FooPrime {
        private String gamma;

        public String getGamma() { return gamma; }

        public void setGamma(String gamma) { this.gamma = gamma; }

        public FooPrime gamma(String gamma) { this.gamma = gamma; return this; }

        @Override
        public int hashCode() {
            final int prime = 31;
            int result = 1;
            result = prime * result + ((gamma == null) ? 0 : gamma.hashCode());
            return result;
        }

        @Override
        public boolean equals(Object obj) {
            if (this == obj)
                return true;
            if (obj == null)
                return false;
            if (getClass() != obj.getClass())
                return false;
            FooPrime other = (FooPrime) obj;
            if (gamma == null) {
                if (other.gamma != null)
                    return false;
            } else if (!gamma.equals(other.gamma))
                return false;
            return true;
        }

        @Override
        public String toString() {
            return "FooPrime [gamma=" + gamma + "]";
        }
    }
}

基类:

package timings;

public class TimingContainer {
    private int         iterations;
    private String      label;
    private TimingTest  timingTest;

    public TimingContainer(int iterations, String label, TimingTest timingTest) {
        this.iterations = iterations;
        this.label      = label;
        this.timingTest = timingTest;
    }

    public void run() {
        long startTime  = System.nanoTime();
        for (int ctr = 0; ctr < iterations; ++ ctr) {
            timingTest.randomize();
            timingTest.run();
        }
        long    endTime = System.nanoTime();
        long    totalns = (endTime - startTime);
        System.out.println(label + ":" + (totalns / iterations) + " ns");
    }
}

答案 1

(修订后的答案。

在 Java 中进行基准测试很困难。不过,让我们把JMH扔给它...我已将你的基准测试移植到 JMH(参见 http://github.com/lemire/microbenchmarks)。

这些是相关的方法...

    public FooPrime[] basicstream(BenchmarkState s) {
            return (FooPrime[]) s.fooList.stream().map(it -> {
                    return new FooPrime().gamma(it.getAlpha() + it.getBeta());
            }).toArray(FooPrime[]::new);
    }

    public FooPrime[] tweakedbasicstream(BenchmarkState s) {
            return (FooPrime[]) s.fooList.stream().map(it -> {
                    int stuff = it.getAlpha().length();
                    return new FooPrime().gamma(it.getAlpha() + it.getBeta());
            }).toArray(FooPrime[]::new);
    }

这是我跑步的结果...

git clone https://github.com/lemire/microbenchmarks.git
cd microbenchmarks
mvn clean install
java -cp target/microbenchmarks-0.0.1-jar-with-dependencies.jar me.lemire.microbenchmarks.mysteries.MysteriousLambda
Benchmark                                      Mode  Samples      Score    Error  Units
m.l.m.m.MysteriousLambda.basicstream           avgt        5  17013.784 ± 46.536  ns/op
m.l.m.m.MysteriousLambda.tweakedbasicstream    avgt        5  16240.451 ± 67.884  ns/op

奇怪的是,这两个函数似乎不以完全相同的平均速度运行,存在相当显着的差异。这是在使用JMH的时候,JMH是一个相当好的基准测试框架。

起初我以为你的两段代码在逻辑上是等价的,但它们不是。当返回的 String 对象为 null 时,明显无用的长度方法访问会强制代码引发异常。

所以它实际上更接近下面的代码段...

    @Benchmark
    public FooPrime[] nullbasicstream(BenchmarkState s) {
            return (FooPrime[]) s.fooList.stream().map(it -> {
                    if( it.getAlpha() == null) throw new NullPointerException();
                    return new FooPrime().gamma(it.getAlpha() + it.getBeta());
            }).toArray(FooPrime[]::new);
    }

这甚至比您调整的功能更快...

Benchmark                                      Mode  Samples      Score    Error  Units
m.l.m.m.MysteriousLambda.basicstream           avgt        5  17013.784 ± 46.536  ns/op
m.l.m.m.MysteriousLambda.nullbasicstream       avgt        5  15983.762 ± 92.593  ns/op
m.l.m.m.MysteriousLambda.tweakedbasicstream    avgt        5  16240.451 ± 67.884  ns/op

为什么会这样?

让我们避开Java 8的流编程,以愚蠢的旧方式编写函数,有和没有空检查:

    @Benchmark
    public FooPrime[] basicsum(BenchmarkState s) {
            int howmany = s.fooList.size();
            FooPrime[] answer = new FooPrime[s.fooList.size()];
            for(int k = 0; k < howmany ; ++k ) {
                    Foo x = s.fooList.get(k);
                    answer[k] = new FooPrime(x.getAlpha() + x.getBeta());
            }
            return answer;
    }

    @Benchmark
    public FooPrime[] basicsumnull(BenchmarkState s) {
            int howmany = s.fooList.size();
            FooPrime[] answer = new FooPrime[s.fooList.size()];
            for(int k = 0; k < howmany ; ++k ) {
                    Foo x = s.fooList.get(k);
                    if(x.getAlpha() == null) throw new NullPointerException();
                    answer[k] = new FooPrime(x.getAlpha() + x.getBeta());
            }
            return answer;
    }

这就是我们获得最佳性能的方式...

 m.l.m.m.MysteriousLambda.basicstream                        avgt        5  17019.730 ±  61.982  ns/op
 m.l.m.m.MysteriousLambda.nullbasicstream                    avgt        5  16019.332 ±  62.831  ns/op
 m.l.m.m.MysteriousLambda.basicsum                           avgt        5  15635.474 ± 119.890  ns/op
 m.l.m.m.MysteriousLambda.basicsumnull                       avgt        5  14342.016 ± 109.958  ns/op

但空检查的好处仍然存在。

还行。让我们对字符串和进行基准测试,而不使用其他任何内容(没有自定义类)。让我们同时拥有标准和总和,然后进行空检查:

    @Benchmark
    public void stringsum(BenchmarkState s) {
            for(int k = 0; k < s.N; ++k) s.list3[k] = s.list1[k] + s.list2[k];
    }


    @Benchmark
    public void stringsum_withexcept(BenchmarkState s) {
            for(int k = 0; k < s.N; ++k) {
                    if(s.list1[k] == null) throw new NullPointerException();
                    s.list3[k] = s.list1[k] + s.list2[k];
            }
    }

我们得到空检查减慢了我们的速度...

    m.l.m.m.StringMerge.stringsum               avgt        5  27011.111 ±  4.077  ns/op
    m.l.m.m.StringMerge.stringsum_withexcept    avgt        5  28387.825 ± 82.523  ns/op

答案 2

根据@DanielLemire的答案,我有一个想法,这可能会给我们带来更远的地方(不是一个明确的解释,但对于评论来说太长了)。在

int stuff = it.getAlpha().length();
return new FooPrime().gamma(it.getAlpha() + it.getBeta());

相关部分是

if (it.getAlpha() == null) throw new NullPointerException();
String s = it.getAlpha() + it.getBeta()

我在那里介绍了串联的结果。稍微重写一下,我们得到s

String a = it.getAlpha();
if (a == null) throw new NullPointerException();
String b = it.getBeta();
String s = (a == null ? "null" : a) + (b == null ? "null" : b);

第一个检查使第二个检查变得多余。 使用 翻译字符串串联。这对于解释器来说已经足够好了,并且得到了JIT编译器的识别,JIT编译器也识别了多余的检查。对于最常用的模式,有很多特殊的外壳,并非所有的外壳都同样得到优化。如果这是原因,我不会感到惊讶。a == nulljavacStringBuilder

另一个可能的原因是NPE抛出代码可能导致类似

if (a == null) goto AWAY;
String s = a + (b == null ? "null" : b);

其中,生成的机器代码大大缩短,因为空案例的处理被移动到某个特殊路径。实际上,空检查所需的所有操作都是取消引用指针,在将 的内容复制到 中时,无论如何都会完成。当它是 时,虚拟内存系统会生成一个 SIGSEGV,该 SIGSEGV 在异常路径的某个位置进行处理。在快速的道路上,什么都没有。循环体更短,可以得到更好的优化(例如,更多的循环展开)。asnull


推荐