HotSpot, JIT, AOT and Warm-Up

The HotSpot JVM takes some time to profile a running Java application for hot spots in the code and then optimizes by compiling (to assembly) and inlining (when possible) these hot spot methods. That’s great because the JIT (just-in-time) compiler can surgically and aggressively optimize the parts of your application that matter the most instead of taking the AOT (ahead-of-time) approach of compiling and trying to optimize the whole thing beforehand. For example, method inlining is an aggressive form of optimization that usually requires runtime profiling information since inlining everything is impractical/impossible.

In this article we explore the JVM option -Xcomp -XX:-TieredCompilation to compile every method right before its first invocation. The drawback is that without any profiling this optimization can be conservative. For example, even though some basic method inlining is still performed, a more aggressive inlining approach can not happen without runtime profiling. The advantage is that your application will be able to perform at a native/assembly level right when it starts (even if not with the most optimized code) without having to wait until the HotSpot JVM has gathered enough profiling to compile and optimize the hot methods.

We also explore Azul Zing ReadyNow, which allows the profiling information to be saved from a previous run and re-applied on startup to improve the warm-up time.

Finally we conclude by talking a bit about Project Leyden from Oracle.

CoralSequencer with -Xcomp -XX:-TieredCompilation

Below we explore the difference in performance that -Xcomp -XX:-TieredCompilation makes for the CoralSequencer benchmark latency numbers.

Benchmark Environment

$ java -version
java version "21.0.1" 2023-10-17 LTS
Java(TM) SE Runtime Environment (build 21.0.1+12-LTS-29)
Java HotSpot(TM) 64-Bit Server VM (build 21.0.1+12-LTS-29, mixed mode, sharing)

$ uname -a
Linux hivelocity 4.15.0-20-generic #21-Ubuntu SMP Tue Apr 24 06:16:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/issue | head -n 1
Ubuntu 18.04.6 LTS \n \l

$ cat /proc/cpuinfo | grep "model name" | head -n 1 | awk -F ": " '{print $NF}'
Intel(R) Xeon(R) E-2288G CPU @ 3.70GHz

Regular JIT with warm-up

Iterations: 1,000 | Avg Time: 4.372 micros | Min Time: 3.634 micros | Max Time: 137.246 micros | 75% = [avg: 3.958 micros, max: 4.391 micros] | 90% = [avg: 4.035 micros, max: 4.474 micros] | 99% = [avg: 4.127 micros, max: 6.152 micros] | 99.9% = [avg: 4.239 micros, max: 22.849 micros] | 99.99% = [avg: 4.372 micros, max: 137.246 micros] | 99.999% = [avg: 4.372 micros, max: 137.246 micros]

Regular JIT without warm-up

Iterations: 1,000 | Avg Time: 58.026 micros | Min Time: 26.206 micros | Max Time: 2.809 millis | 75% = [avg: 36.13 micros, max: 52.503 micros] | 90% = [avg: 40.976 micros, max: 80.534 micros] | 99% = [avg: 46.905 micros, max: 324.93 micros] | 99.9% = [avg: 55.272 micros, max: 2.23 millis] | 99.99% = [avg: 58.026 micros, max: 2.809 millis] | 99.999% = [avg: 58.026 micros, max: 2.809 millis]

-Xcomp -XX:-TieredCompilation with warm-up

Iterations: 1,000 | Avg Time: 6.803 micros | Min Time: 5.741 micros | Max Time: 97.289 micros | 75% = [avg: 6.443 micros, max: 6.737 micros] | 90% = [avg: 6.499 micros, max: 6.88 micros] | 99% = [avg: 6.6 micros, max: 10.323 micros] | 99.9% = [avg: 6.712 micros, max: 24.872 micros] | 99.99% = [avg: 6.802 micros, max: 97.289 micros] | 99.999% = [avg: 6.802 micros, max: 97.289 micros]

-Xcomp -XX:-TieredCompilation without warm-up

Iterations: 1,000 | Avg Time: 7.005 micros | Min Time: 6.029 micros | Max Time: 126.315 micros | 75% = [avg: 6.545 micros, max: 6.994 micros] | 90% = [avg: 6.625 micros, max: 7.084 micros] | 99% = [avg: 6.737 micros, max: 11.505 micros] | 99.9% = [avg: 6.885 micros, max: 47.461 micros] | 99.99% = [avg: 7.005 micros, max: 126.315 micros] | 99.999% = [avg: 7.005 micros, max: 126.315 micros]

As you can see from the latency numbers above, by using -Xcomp -XX:-TieredCompilation we can mitigate the warm-up time by paying a price in performance (average 7.005 micros over average 4.372 micros). It emphasizes the advantages of runtime information (i.e. profiling) when it comes to the critical path optimizations performed by the HotSpot JIT compiler. Without profiling, there is only so much an AOT compiler can do, and the most aggressive optimizations might not be able to be applied beforehand. Of course this conclusion cannot be generalized for every application as it will depend heavily on the characteristics and particularities of the source code and its critical path.

CoralSequencer with Azul Zing ReadyNow

We performed three training runs of our application to record three generations of the ReadyNow profile log as instructed by the ReadyNow guide. On each training run we performed 3 million iterations of the critical path. The size of the final profile log was 4.5 megabytes.

Benchmark Environment

$ java -version
openjdk version "21.0.4" 2024-09-27 LTS
OpenJDK Runtime Environment Zing24.09.0.0+5 (build 21.0.4+4-LTS)
Zing 64-Bit Tiered VM Zing24.09.0.0+5 (build 21.0.4-zing_24.09.0.0-b5-release-linux-X86_64, mixed mode)

$ uname -a
Linux hivelocity 4.15.0-20-generic #21-Ubuntu SMP Tue Apr 24 06:16:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/issue | head -n 1
Ubuntu 18.04.6 LTS \n \l

$ cat /proc/cpuinfo | grep "model name" | head -n 1 | awk -F ": " '{print $NF}'
Intel(R) Xeon(R) E-2288G CPU @ 3.70GHz

Regular Zing JIT with warm-up

Iterations: 1,000 | Avg Time: 4.175 micros | Min Time: 3.317 micros | Max Time: 90.359 micros | 75% = [avg: 3.841 micros, max: 4.095 micros] | 90% = [avg: 3.923 micros, max: 4.418 micros] | 99% = [avg: 3.997 micros, max: 5.768 micros] | 99.9% = [avg: 4.088 micros, max: 20.067 micros] | 99.99% = [avg: 4.175 micros, max: 90.359 micros] | 99.999% = [avg: 4.175 micros, max: 90.359 micros]

Regular Zing JIT without warm-up

Iterations: 1,000 | Avg Time: 55.314 micros | Min Time: 21.089 micros | Max Time: 5.628 millis | 75% = [avg: 33.077 micros, max: 53.887 micros] | 90% = [avg: 38.055 micros, max: 75.808 micros] | 99% = [avg: 43.08 micros, max: 133.613 micros] | 99.9% = [avg: 49.735 micros, max: 2.563 millis] | 99.99% = [avg: 55.314 micros, max: 5.628 millis] | 99.999% = [avg: 55.314 micros, max: 5.628 millis]

Zing ReadyNow with warm-up

Iterations: 1,000 | Avg Time: 4.273 micros | Min Time: 3.396 micros | Max Time: 94.648 micros | 75% = [avg: 3.905 micros, max: 4.126 micros] | 90% = [avg: 3.987 micros, max: 4.501 micros] | 99% = [avg: 4.066 micros, max: 6.042 micros] | 99.9% = [avg: 4.182 micros, max: 21.433 micros] | 99.99% = [avg: 4.272 micros, max: 94.648 micros] | 99.999% = [avg: 4.272 micros, max: 94.648 micros]

Zing ReadyNow without warm-up

Iterations: 1,000 | Avg Time: 29.47 micros | Min Time: 18.966 micros | Max Time: 279.449 micros | 75% = [avg: 24.666 micros, max: 32.724 micros] | 90% = [avg: 26.66 micros, max: 42.978 micros] | 99% = [avg: 28.679 micros, max: 71.379 micros] | 99.9% = [avg: 29.219 micros, max: 141.065 micros] | 99.99% = [avg: 29.469 micros, max: 279.449 micros] | 99.999% = [avg: 29.469 micros, max: 279.449 micros]

As you can see from the latency numbers above, by using ReadyNow we were able to obtain a 50% improvement in performance without warm-up (average 29.470 micros over average 55.314 micros). We were also able to limit the outliers above the 99.9 percentile (max 141.065 micros over max 2.563 millis) before warming up. After warming up, the results with ReadyNow were similar to the results without ReadyNow (average 4.273 micros over average 4.175 micros and min 3.396 micros over min 3.317 micros).

Project Leyden from Oracle

At the time of this article (Nov/2024) Project Leyden from Oracle is fairly new (May/2022) but it has been making some great progress in the Java warm-up area. Our opinion, based on our own experiments with CoralSequencer and GraalVM, is that AOT is not as fast as JIT (after the code has warmed up) so real-time and past-time (archived) profiling information becomes crucial for achieving AOT + JIT maximum performance together with minimum time-to-peak (i.e. quick warm-up). It is important to emphasize that this AOT x JIT conclusion is targeted to CoralSequencer in particular and cannot be generalized for every application as it will depend heavily on the characteristics of the source code and its critical path. Also it is important to clarify that by fast we mean the ability to accomplish the lowest possible latency for the critical path. We are not referring to throughput, JVM start-up time and application start-up time.

That said, we are particularly excited about the following JEPs:

  • JEP draft 8325147: Ahead-of-Time Method Profiling => Method profiles from training runs are stored in the CDS archive, thereby enabling the JIT to begin compiling earlier during warmup. As a result, Java applications can reach peak performance faster. This feature is enabled by the VM flags -XX:+RecordTraining and -XX:+ReplayTraining.
  • JEP draft 8335368: Ahead-of-Time Code Compilation => Methods that are frequently used during the training run can be compiled and stored along with the CDS archive. As a result, as soon as the application starts up in the production run, its methods can be can be natively executed. This feature is enabled by the VM flags -XX:+StoreCachedCode, -XX:+LoadCachedCode, and -XX:CachedCodeFile.

We are currently working to be able to test CoralSequencer with the early-access builds of Project Leyden and we’ll report our findings soon.