Java 8u40 Math.round() 非常慢
我有一个用Java 8编写的相当简单的业余爱好项目,它在其操作模式之一中广泛使用重复的Math.round()调用。例如,一种这样的模式通过 ExecutorService 生成 4 个线程,并将 48 个可运行的任务排入队列,每个任务运行类似于以下代码块的内容 2^31 次:
int3 = Math.round(float1 + float2);
int3 = Math.round(float1 * float2);
int3 = Math.round(float1 / float2);
事实并非如此(涉及数组和嵌套循环),但你明白了。无论如何,在Java 8u40之前,类似于上述代码的代码可以在大约13秒内在AMD A10-7700k上完成约1030亿个指令块的完整运行。使用Java 8u40,做同样的事情大约需要260秒。没有对代码进行任何更改,什么都没有,只是Java更新。
有没有人注意到Math.round()变慢了,特别是当它被重复使用时?这几乎就像JVM之前正在做某种优化,它不再做了。也许它在8u40之前使用SIMD,现在不是?
编辑:我已经完成了对MVCE的第二次尝试。您可以在此处下载第一次尝试:
https://www.dropbox.com/s/rm2ftcv8y6ye1bi/MathRoundMVCE.zip?dl=0
第二次尝试如下。我的第一次尝试已经从这篇文章中删除了,因为它被认为太长了,并且容易受到JVM的死代码删除优化(显然在8u40中发生得更少)。
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class MathRoundMVCE
{
static long grandtotal = 0;
static long sumtotal = 0;
static float[] float4 = new float[128];
static float[] float5 = new float[128];
static int[] int6 = new int[128];
static int[] int7 = new int[128];
static int[] int8 = new int[128];
static long[] longarray = new long[480];
final static int mil = 1000000;
public static void main(String[] args)
{
initmainarrays();
OmniCode omni = new OmniCode();
grandtotal = omni.runloops() / mil;
System.out.println("Total sum of operations is " + sumtotal);
System.out.println("Total execution time is " + grandtotal + " milliseconds");
}
public static long siftarray(long[] larray)
{
long topnum = 0;
long tempnum = 0;
for (short i = 0; i < larray.length; i++)
{
tempnum = larray[i];
if (tempnum > 0)
{
topnum += tempnum;
}
}
topnum = topnum / Runtime.getRuntime().availableProcessors();
return topnum;
}
public static void initmainarrays()
{
int k = 0;
do
{
float4[k] = (float)(Math.random() * 12) + 1f;
float5[k] = (float)(Math.random() * 12) + 1f;
int6[k] = 0;
k++;
}
while (k < 128);
}
}
class OmniCode extends Thread
{
volatile long totaltime = 0;
final int standard = 16777216;
final int warmup = 200000;
byte threads = 0;
public long runloops()
{
this.setPriority(MIN_PRIORITY);
threads = (byte)Runtime.getRuntime().availableProcessors();
ExecutorService executor = Executors.newFixedThreadPool(threads);
for (short j = 0; j < 48; j++)
{
executor.execute(new RoundFloatToIntAlternate(warmup, (byte)j));
}
executor.shutdown();
while (!executor.isTerminated())
{
try
{
Thread.sleep(100);
}
catch (InterruptedException e)
{
//Do nothing
}
}
executor = Executors.newFixedThreadPool(threads);
for (short j = 0; j < 48; j++)
{
executor.execute(new RoundFloatToIntAlternate(standard, (byte)j));
}
executor.shutdown();
while (!executor.isTerminated())
{
try
{
Thread.sleep(100);
}
catch (InterruptedException e)
{
//Do nothing
}
}
totaltime = MathRoundMVCE.siftarray(MathRoundMVCE.longarray);
executor = null;
Runtime.getRuntime().gc();
return totaltime;
}
}
class RoundFloatToIntAlternate extends Thread
{
int i = 0;
int j = 0;
int int3 = 0;
int iterations = 0;
byte thread = 0;
public RoundFloatToIntAlternate(int cycles, byte threadnumber)
{
iterations = cycles;
thread = threadnumber;
}
public void run()
{
this.setPriority(9);
MathRoundMVCE.longarray[this.thread] = 0;
mainloop();
blankloop();
}
public void blankloop()
{
j = 0;
long timer = 0;
long totaltimer = 0;
do
{
timer = System.nanoTime();
i = 0;
do
{
i++;
}
while (i < 128);
totaltimer += System.nanoTime() - timer;
j++;
}
while (j < iterations);
MathRoundMVCE.longarray[this.thread] -= totaltimer;
}
public void mainloop()
{
j = 0;
long timer = 0;
long totaltimer = 0;
long localsum = 0;
int[] int6 = new int[128];
int[] int7 = new int[128];
int[] int8 = new int[128];
do
{
timer = System.nanoTime();
i = 0;
do
{
int6[i] = Math.round(MathRoundMVCE.float4[i] + MathRoundMVCE.float5[i]);
int7[i] = Math.round(MathRoundMVCE.float4[i] * MathRoundMVCE.float5[i]);
int8[i] = Math.round(MathRoundMVCE.float4[i] / MathRoundMVCE.float5[i]);
i++;
}
while (i < 128);
totaltimer += System.nanoTime() - timer;
for(short z = 0; z < 128; z++)
{
localsum += int6[z] + int7[z] + int8[z];
}
j++;
}
while (j < iterations);
MathRoundMVCE.longarray[this.thread] += totaltimer;
MathRoundMVCE.sumtotal = localsum;
}
}
长话短说,这段代码在8u25中的执行速度与在8u40中的执行速度大致相同。如您所见,我现在将所有计算结果记录到数组中,然后将循环的定时部分之外的这些数组求和为局部变量,然后将其写入外部循环末尾的静态变量。
低于 8u25:总执行时间为 261545 毫秒
低于 8u40:总执行时间为 266890 毫秒
测试条件与以前相同。因此,看起来 8u25 和 8u31 正在执行 8u40 停止执行的死代码删除,从而导致代码在 8u40 中“变慢”。这并不能解释每一件奇怪的小事,但这似乎是其中的大部分。作为额外的奖励,这里提供的建议和答案给了我灵感,以改善我的爱好项目的其他部分,对此我非常感激。谢谢大家!