通过简单地添加方法参数(更精简的jit代码)无法解释的10%+性能提升原因
(注:正确答案必须超越复制)。
经过数百万次调用,quicksort1 绝对比 quicksort2 快,后者除了这 1 个额外的 arg 之外,还具有相同的代码。
剧透:我还发现jit代码比它胖了224个字节,即使它实际上应该更简单(就像字节代码大小告诉的那样;请参阅下面的最新更新)。
即使尝试使用一些微基准线束(JMH)来消除这种影响,性能差异仍然存在。
我在问:为什么生成的原生代码有如此大的差异,它在做什么?
通过向方法添加参数,它可以使其更快...!我知道gc / jit / warmup /等效果。您可以按原样运行代码,也可以使用更大/更小的迭代计数运行代码。实际上,您甚至应该注释掉一个,然后另一个perf测试,并在不同的jvm实例中运行每个,只是为了证明它不是彼此之间的干扰。
字节码没有显示出太大的差异,除了sleft/sright的明显静止之外,还有一个奇怪的“iload 4”而不是“iload_3”(和istore 4/istore_3)
这到底是怎么回事?iload_3/istore_3真的比 iload 4/istore 4 慢吗?而且速度慢得多,即使添加的getstatic调用仍然没有使它变慢?我可以猜测静态字段未使用,因此jit可能会跳过它。
无论如何,我这边没有歧义,因为它总是可重复的,我正在寻找解释为什么javac / jit做了他们所做的事情,以及为什么性能受到如此大的影响。这些是相同的递归算法,具有相同的数据,相同的内存变动等...如果我愿意,我无法进行更孤立的更改,以显示显着的可复制运行时差异。
环境:
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
(also tried and reproduced on java9)
on a 4 core i5 laptop 8GB ram.
windows 10 with the meltdown/specter patch.
使用 -verbose:gc -XX:+PrintCompilation,没有 gc,jit 编译在 C2(第 4 层)中稳定下来。
当 n=20000 时:
main]: qs1: 1561.3336199999999 ms (res=null)
main]: qs2: 1749.748416 ms (res=null)
main]: qs1: 1422.0767509999998 ms (res=null)
main]: qs2: 1700.4858689999999 ms (res=null)
main]: qs1: 1519.5280269999998 ms (res=null)
main]: qs2: 1786.2206899999999 ms (res=null)
main]: qs1: 1450.0802979999999 ms (res=null)
main]: qs2: 1675.223256 ms (res=null)
main]: qs1: 1452.373318 ms (res=null)
main]: qs2: 1634.581156 ms (res=null)
顺便说一句,漂亮的java9似乎使两者都更快,但仍然有10-15%的折扣。
[0.039s][info][gc] Using G1
main]: qs1: 1287.062819 ms (res=null)
main]: qs2: 1451.041391 ms (res=null)
main]: qs1: 1240.800305 ms (res=null)
main]: qs2: 1391.2404299999998 ms (res=null)
main]: qs1: 1257.1707159999999 ms (res=null)
main]: qs2: 1433.84716 ms (res=null)
main]: qs1: 1233.7582109999998 ms (res=null)
main]: qs2: 1394.7195849999998 ms (res=null)
main]: qs1: 1250.885867 ms (res=null)
main]: qs2: 1413.88437 ms (res=null)
main]: qs1: 1261.5805739999998 ms (res=null)
main]: qs2: 1458.974334 ms (res=null)
main]: qs1: 1237.039902 ms (res=null)
main]: qs2: 1394.823751 ms (res=null)
main]: qs1: 1255.14672 ms (res=null)
main]: qs2: 1400.693295 ms (res=null)
main]: qs1: 1293.009808 ms (res=null)
main]: qs2: 1432.430952 ms (res=null)
main]: qs1: 1262.839628 ms (res=null)
main]: qs2: 1421.376644 ms (res=null)
代码(包括测试):
(请不要注意这个快速排序有多糟糕;它在问题旁边)。
import java.util.Random;
import java.util.concurrent.Callable;
public class QuicksortTrimmed {
static void p(Object msg) {
System.out.println(Thread.currentThread().getName()+"]: "+msg);
}
static void perf(int N, String msg, Callable c) throws Exception {
Object res = null;
long t = System.nanoTime();
for(int i=0; i<N; i++) {
res = c.call();
}
Double d = 1e-6*(System.nanoTime() - t);
p(msg+": "+d+" ms (res="+res+")");
}
static String und = "__________";//10
static {
und += und;//20
und += und;//40
und += und;//80
und += und;//160
}
static String sleft = "//////////";//10
static {
sleft += sleft;//20
sleft += sleft;//40
sleft += sleft;//80
sleft += sleft;//160
}
static String sright= "\\\\\\\\\\\\\\\\\\\\";//10
static {
sright += sright;//20
sright += sright;//40
sright += sright;//80
sright += sright;//160
}
//============
public static void main(String[] args) throws Exception {
int N = 20000;
int n = 1000;
int bound = 10000;
Random r = new Random(1);
for(int i=0; i<5; i++) {
testperf(N, r, n, bound);
//System.gc();
}
}
static void testperf(int N, Random r, int n, int bound) throws Exception {
final int[] orig = r.ints(n, 0, bound).toArray();
final int[] a = orig.clone();
perf(N, "qs1", () -> {
System.arraycopy(orig, 0, a, 0, orig.length);
quicksort1(a, 0, a.length-1, und);
return null;
});
perf(N, "qs2", () -> {
System.arraycopy(orig, 0, a, 0, orig.length);
quicksort2(a, 0, a.length-1);
return null;
});
System.out.println();
}
private static void quicksort1(int[] a, final int _from, final int _to, String mode) {
int len = _to - _from + 1;
if(len==2) {
if(a[_from] > a[_to]) {
int tmp = a[_from];
a[_from] = a[_to];
a[_to] = tmp;
}
} else { //len>2
int mid = _from+len/2;
final int pivot = a[mid];
a[mid] = a[_to];
a[_to] = pivot; //the pivot value is the 1st high value
int i = _from;
int j = _to;
while(i < j) {
if(a[i] < pivot)
i++;
else if(i < --j) { //j is the index of the leftmost high value
int tmp = a[i];
a[i] = a[j]; //THIS IS HARMFUL: maybe a[j] was a high value too.
a[j] = tmp;
}
}
//swap pivot in _to with other high value in j
int tmp = a[j];
a[j] = a[_to];
a[_to] = tmp;
if(_from < j-1)
quicksort1(a, _from, j-1, sleft);
if(j+1 < _to)
quicksort1(a, j+1, _to, sright);
}
}
private static void quicksort2(int[] a, final int _from, final int _to) {
int len = _to - _from + 1;
if(len==2) {
if(a[_from] > a[_to]) {
int tmp = a[_from];
a[_from] = a[_to];
a[_to] = tmp;
}
} else { //len>2
int mid = _from+len/2;
final int pivot = a[mid];
a[mid] = a[_to];
a[_to] = pivot; //the pivot value is the 1st high value
int i = _from;
int j = _to;
while(i < j) {
if(a[i] < pivot)
i++;
else if(i < --j) { //j is the index of the leftmost high value
int tmp = a[i];
a[i] = a[j]; //THIS IS HARMFUL: maybe a[j] was a high value too.
a[j] = tmp;
}
}
//swap pivot in _to with other high value in j
int tmp = a[j];
a[j] = a[_to];
a[_to] = tmp;
if(_from < j-1)
quicksort2(a, _from, j-1);
if(j+1 < _to)
quicksort2(a, j+1, _to);
}
}
}
更新:
我做了JMH测试,它仍然证明quicksort1比quicksort2更快。
# Run complete. Total time: 00:13:38
Benchmark Mode Cnt Score Error Units
MyBenchmark.testQuickSort1 thrpt 200 14861.437 ± 86.707 ops/s
MyBenchmark.testQuickSort2 thrpt 200 13097.209 ± 46.178 ops/s
以下是JMH基准测试:
package org.sample;
import java.util.Random;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.Level;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.infra.Blackhole;
public class MyBenchmark {
static String und = "__________";//10
static {
und += und;//20
und += und;//40
und += und;//80
und += und;//160
}
static String sleft = "//////////";//10
static {
sleft += sleft;//20
sleft += sleft;//40
sleft += sleft;//80
sleft += sleft;//160
}
static String sright= "\\\\\\\\\\\\\\\\\\\\";//10
static {
sright += sright;//20
sright += sright;//40
sright += sright;//80
sright += sright;//160
}
static final int n = 1000;
static final int bound = 10000;
static Random r = new Random(1);
static final int[] orig = r.ints(n, 0, bound).toArray();
@State(Scope.Thread)
public static class ThrState {
int[] a;
@Setup(Level.Invocation)
public void setup() {
a = orig.clone();
}
}
//============
@Benchmark
public void testQuickSort1(Blackhole bh, ThrState state) {
int[] a = state.a;
quicksort1(a, 0, a.length-1, und);
bh.consume(a);
}
@Benchmark
public void testQuickSort2(Blackhole bh, ThrState state) {
int[] a = state.a;
quicksort2(a, 0, a.length-1);
bh.consume(a);
}
private static void quicksort1(int[] a, final int _from, final int _to, String mode) {
int len = _to - _from + 1;
if(len==2) {
if(a[_from] > a[_to]) {
int tmp = a[_from];
a[_from] = a[_to];
a[_to] = tmp;
}
} else { //len>2
int mid = _from+len/2;
final int pivot = a[mid];
a[mid] = a[_to];
a[_to] = pivot; //the pivot value is the 1st high value
int i = _from;
int j = _to;
while(i < j) {
if(a[i] < pivot)
i++;
else if(i < --j) { //j is the index of the leftmost high value
int tmp = a[i];
a[i] = a[j]; //THIS IS HARMFUL: maybe a[j] was a high value too.
a[j] = tmp;
}
}
//swap pivot in _to with other high value in j
int tmp = a[j];
a[j] = a[_to];
a[_to] = tmp;
if(_from < j-1)
quicksort1(a, _from, j-1, sleft);
if(j+1 < _to)
quicksort1(a, j+1, _to, sright);
}
}
private static void quicksort2(int[] a, final int _from, final int _to) {
int len = _to - _from + 1;
if(len==2) {
if(a[_from] > a[_to]) {
int tmp = a[_from];
a[_from] = a[_to];
a[_to] = tmp;
}
} else { //len>2
int mid = _from+len/2;
final int pivot = a[mid];
a[mid] = a[_to];
a[_to] = pivot; //the pivot value is the 1st high value
int i = _from;
int j = _to;
while(i < j) {
if(a[i] < pivot)
i++;
else if(i < --j) { //j is the index of the leftmost high value
int tmp = a[i];
a[i] = a[j]; //THIS IS HARMFUL: maybe a[j] was a high value too.
a[j] = tmp;
}
}
//swap pivot in _to with other high value in j
int tmp = a[j];
a[j] = a[_to];
a[_to] = tmp;
if(_from < j-1)
quicksort2(a, _from, j-1);
if(j+1 < _to)
quicksort2(a, j+1, _to);
}
}
}
更新:
此时,我运行并捕获了 jitwatch 的 jit 日志(我使用了 1.3.0 标记,并从 https://github.com/AdoptOpenJDK/jitwatch/tree/1.3.0)
-verbose:gc
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-Xloggc:"./gc.log"
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintCompilation
-XX:+PrintSafepointStatistics
-XX:PrintSafepointStatisticsCount=1
-XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation -XX:+PrintInlining
jitwatch没有明显的“建议”,只是对于quicksort1和quicksort2来说,rgular太大而无法内联或太深。
一个重要的发现是字节码和本机代码的区别:
使用额外的方法参数 (quicksort1):字节代码 = 187 字节 本机代码 = 1872 字节
没有额外的方法参数(quicksort2):字节码 = 178 字节(小 9 个字节)本机代码 = 2096 个字节(大 224 个字节!!!)
这是一个强有力的证据,表明 jit 代码在 quicksort2 中更胖、更慢。
所以问题仍然存在:C2 jit编译器在想什么?当我添加一个方法参数和2个静态引用来加载和传递时,什么规则使它创建更快的本机代码?
我终于掌握了汇编代码,但正如我所期望的那样,几乎不可能区分和理解正在发生的事情。我遵循了从 https://stackoverflow.com/a/24524285/2023577 找到的最新指示。我有7MB的xml日志文件(压缩到675kB),你可以得到它,并在 https://wetransfer.com/downloads/65fe0e94ab409d57cba1b95459064dd420180427150905/612dc9 看到它7天(过期〜2018年5月4日),以防万一你能理解它(当然在jitwatch!
添加的字符串参数导致更紧凑的程序集代码。问题(仍然没有答案)是为什么?汇编代码中有什么不同?在较慢的代码中未使用的规则或优化是什么?