Friday, May 26, 2023

A Stroll Down Memory Lane

Navigating the intricacies of observing Java Heap usage

As I've been working diligently on the final touches for the 'Live Heap Profiling' feature, I found myself needing to access the current Java heap usage from native code. Despite initial expectations that this would be straightforward, the actual code manifested into an intricate series of workarounds, or as I may euphemistically call it, a hot mess of hacks.

Unfortunately, there isn't a standardized method for querying the Java heap usage from the native code via either JVMTI or JNI proper.

Introducing Live Heap Profiling

Let's start with an overview of what I aim to accomplish with the 'Live Heap Profiling' feature.

The goal of this feature is to produce a sample of all live objects on the heap (live refers to objects not yet garbage collected). At a conceptual level, this is relatively simple, especially when using the JVMTI Allocation Sampler (introduced in JDK 11) to guide the sampling process. The process becomes complex, however, when there is a need to 'reconstruct' the live heap size from this data.

Users nowadays expect comprehensive information. They aren't satisfied with just raw samples; they desire to understand how these samples relate to the total byte count. Addressing this demand introduces a multitude of challenges.

The JVMTI Allocation Sampler has been a critical tool in our standard 'Allocation Profiling 'feature, helping us estimate the number of bytes each sample covers. We took inspiration from Go runtime's implementation of a similar feature, a logical choice considering the JVMTI Allocation Sampler is a ported version of Go's allocation sampler.

However, the 'upscaling' method we used for the live heap samples can't be applied in this case. The issue arises from the fact that garbage collection (GC) results in its own subsampling by collecting some instances referenced by the allocation samples. This subsampling process lacks known statistical properties that I know of. Yet, the proportions between the upscaling factors of the preserved allocation samples should give us a decent approximation of the 'live' heap used by those samples. This solution seems fairly straightforward, except it requires access to the Java heap usage, ideally right after a GC cycle has finished.

Integrating Java Heap Usage

Typically, we obtain the Java heap usage by using JMX and executing ManagementFactory.getMemoryMXBean().getMemoryUsage().
This will yield an instance containing values for initial, committed, used, and maximum heap size. Although this assumes JMX is available and initialized, that's generally a safe assumption nowadays.

Next, we need to make this value accessible to the profiler. As the profiler is a native library, we would need to use JNI to transfer the value over. This could be feasible if we don't need to execute this action frequently. So far, so good.

However, this only provides us with an 'undead' size value - the size of the heap containing both live objects and those not yet processed by the GC, essentially 'undead'. While not optimal, it's a decent starting point.

Next Up, The Live Heap Size

Once we've set up the code to capture the heap usage and report it to the profiler, we can upscale the live heap samples to more accurately reflect the actual live heap. This provides users with a clearer picture of what's going on.

But we can still push for better - after all, we're talking about 'Live Heap Profiling' here.

In an ideal world, another field in the memory usage object returned by the JMX call would provide us with the used heap size right after the most recent GC cycle. This could give us an accurate estimate of the live heap size as seen by the GC itself. But, alas, this field doesn't exist.

A knee-jerk solution would be to subscribe to JMX notifications on all reported GarbageCollectorMXBean instances and capture the heap usage right after each GC cycle. Although theoretically possible, JMX notifications aren't guaranteed to be delivered. Moreover, they tend to generate a lot of superfluous objects that keep the GC busy and could distort the profiling results. Adding to the pile of issues, the notifications are delayed, and the usage does not accurately reflect the 'clean' state post-GC cycle.

Luckily, there's another mechanism for observing GC activity in JVMTI. One can register a callback which executes each time GC completes its cycle. And, by a fortunate coincidence, we're already using this callback in our profiler to update the information about the 'liveness' of the allocation samples.

So, the GC notification mechanism is effective and lightweight (unless you unintentionally wreak havoc with Weak Global References in that callback, causing GC to go haywire). The only missing piece of the puzzle is to capture the heap usage value once we know a GC cycle has just finished.

However, this turned out to be a Herculean task. After combing through the JVMTI and JNI documentation, I'm convinced that there is a glaring oversight in the APIs, preventing native code from querying the Java heap usage. This discovery led me to explore alternative solutions.

Could We Call Back to Java?

Considering we can get the memory usage via JMX, perhaps we could initiate a series of Java upcalls to retrieve the heap usage.

But, this is a flawed concept. It's ill-advised to initiate Java upcalls from a GC callback due to the process being resource-intensive (parameter marshalling, crossing JVM boundary and exception handling are all not exactly free) and potentially problematic, especially when transitioning back to Java from a GC callback.

Perhaps Almost Calling Back to Java?

Interestingly, JMX delegates to a native method to gather the heap usage information. Thus, the information is accessible from the native code; there's just no clear-cut way to retrieve it from outside the JVM itself.

Did you know it's possible to intercept the binding of Java native methods to their native counterparts? You can do this from JVMTI, and this interception provides you with the address of the native function to bind to. Pretty neat, right?

We can exploit this interception to establish a function pointer to the native method that generates the MemoryUsage object for JMX. Once we have the function pointer, we can call the function, use JNI to break down the returned object into its primitive components, and utilize that information. This eliminates the need for Java upcalls and is relatively safe to execute even from the GC callback.

So, this seems to be a viable option. It does, however, have its shortcomings. The profiler must be initialized before the MemoryMXBean.getMemoryUsage() method is first invoked, otherwise, the native method would have already been bound. Also, the profiler must force the binding by calling that method upon initialization; otherwise, the function pointer won't be captured unless some other code calls MemoryMXBean.getMemoryUsage().

These are some potential drawbacks, but perhaps we can manage them.

Yet, could we do even better?

Welcome, VMStructs

While examining the JDK sources, I realized that the CollectedHeap class provides most of the information I need. But how do I access this internal class, and particularly, a usable instance of it?

Interestingly, there's a VMStructs class (defined in vmStructs.hpp and vmStructs.cpp files), originally created for "Sun Solaris Studio" (similar to the renowned AsyncGetCallTrace). This class has a well-known memory layout, described in the vmStructs.hpp file, which can be used to access many intriguing JVM internals.

The async-profiler, which our Java profiler heavily relies on, is already using VMStructs. Without it, most of the wizardry the async-profiler performs wouldn't be feasible.

You can get the address of the CollectedHeap instance from VMStructs, identified as the _collectedHeap field definition (not the C++ class field). In reality, the address points to a CollectedHeap subclass instance (e.g., G1CollectedHeap), but this is an implementation detail. The key point is that we can reach the CollectedHeap instance to obtain the heap memory usage.

Once we have the heap instance, we need to be able to call its methods. However, there are no exported header files we could use, so we will extend what the async-profiler does and try resolving symbols from the libjvm library. This process is manual and requires identifying the C++ mangled symbols corresponding to the source file methods we're trying to invoke. This task is made even more tedious by the fact that the symbols can and do vary between JDK versions and vendors.

After some investigation, I identified the following symbols to use:

_ZN13CollectedHeap12memory_usageEv for JDK 17+
_ZN13CollectedHeap19create_heap_summaryEv for JDK 11

I didn't attempt to find the symbol for JDK 8 because live heap profiling isn't supported on JDK 8, but the process would be the same if you ever need this functionality for that particular Java version.

With these function pointers and the CollectedHeap instance, we can actually make the calls to get the heap usage. Naturally, these calls will return JDK version-specific objects, whose layouts would need to be copied from the JDK sources, unless you're willing to manually interpret the memory bytes as field values.

And there you have it! Well, it wasn't exactly a walk in the park. It took me a while to piece together all the components I needed for this approach to work. However, it is operational and quite lightweight - as long as you're running on a Hotspot VM.

Unfortunately, other VMs are not supported as they lack the necessary symbols, VMStructs, or both.

What More Can We Extract From CollectedHeap?

While examining the CollectedHeap class to determine which methods should be used to retrieve heap usage, I noticed two very intriguing fields in JDK 17:

capacity_at_last_gc
used_at_last_gc

These fields contain historical GC information updated with each GC cycle.

This seems to be exactly what I was looking for - data on live heap size at the time of the last GC cycle! Regrettably, this information is only available for JDK 17+ - but it's better than nothing.
In fact, this information is also available in JDK 11, but only for some of the CollectedHeap subclasses.
Making use of that would require some form of runtime introspection to determine which CollectedHeap subclass instance is stored in the _collectedHeap field, and that's a challenge I haven't tackled yet.

Bringing It All Together

Recall when I referred to the final implementation of this functionality as a hot mess of hacks?

Here's the reason - to support the range of JDK versions and vendors, we need to ensure compatibility with all three modes: native-binding interception to get the JMX MemoryUsage equivalent, natively getting heap usage in the GC callback, and using the historical GC information from CollectedHeap when it's available. We need to try and capture the data for all these modes and then dispatch to the supported mode just in time.

In summary, the best performance will be available for JDK 17+ with Hotspot. Acceptable performance will be achieved with JDK 11 (again, with Hotspot), and any other setup will at least mostly work.

While the solution is not without its downsides and might appear daunting, it works well in practice. Now, we have live heap profiling at our disposal, providing detailed insight into the runtime heap and effectively enhancing the capabilities of our Java profiler once the changes are properly tested and merged in.

Monday, April 12, 2021

Dead or Alive (A short tale of Java heap live set)

Dead or Alive
(A short tale of Java heap live set)

While working on my now closed (as not done) PR to provide new JFR events containing an approximation of the heap live set size I happened to learn a bit about how the liveness was tracked in various GC implementation and how usable (or not) was the heap live set size estimation.

When talking about liveness and heap live set I mean the object graph that GC deems reachable and alive. Logically, anything that is not reachable is considered dead.

In this post I will try to summarize my findings in a concise way, mostly as a reminder for me when I need to come back to this topic some day, but also in hopes that the others might find it useful as well.

TL;DR

If you are not interested in all the gory details you can skip this section.

Usually, the GC implementations are not tracking the liveness information explicitly and it becomes available only shortly after a major GC cycle and becomes rapidly outdated.

But it turns out that using a knowledge of the inner workings of various GCs it is possible to create a somewhat representative estimate even after each minor GC cycle.

As mentioned above, the way to calculate the estimate is specific to each GC implementation.

Epsilon GC

Ok. This one is extremely trivial. Anything on the heap is considered live for all that this GC implementation knows. Epsilon GC is not performing any garbage collection and therefore has no way of determining the liveness of objects stored on the heap.

Serial GC

Here we can utilize the fact that mark-copy (minor GC) phase is naturally compacting so the number of bytes after copy is 'live' and that the mark-sweep (major GC) implementation keeps an internal info about objects being 'dead' but excluded from the compaction effort.

The ‘dead wood’ as those objects are being referred to are unreachable instances which should be removed and the space they occupy should be compacted but doing so would have an unreasonable cost when compared to the memory space released by such operation.

All this means that we can use the ‘dead-wood’ size to derive the old-gen live set size as used bytes minus the cumulative size of the 'dead-wood'.

Parallel GC

For Parallel GC the live set size estimate can be calculated as the sum of used bytes in all regions after the last GC cycle.

This seems to be a safe bet because this collector is always compacting and the number of used bytes corresponds to the actual live set size.

G1 GC

G1 GC is already keeping liveness information per region.

During concurrent mark processing phase we can utilize the region liveness info and just sum up the liveness information from all the relevant regions - by adding the live size summation into G1UpdateRemSetTrackingBeforeRebuild::do_heap_region() method.

For mixed and full GC cycles we can use the number of used bytes as reported at the end of G1CollectedHeap::gc_epilogue() method.

The downside of G1 GC is that concurrent mark is not happening very frequently (only if IHOP > 45% by default) and full GC cycles are basically a failure mode, meaning that the liveness information can go stale quite often.

Shenandoah GC

Shenandoah, as well as G1, is keeping the per-region liveness info.

This can be easily exploited by hooking into ShenandoahHeuristics::choose_collection_set() method which already does full region iteration after each mark phase where the per-region liveness would be just summed up.

ZGC

Actually, ZGC is readily available to provide the information about the live set size at the end of the most recent mark phase. The `ZStatHeap` class is already holding the liveness info - it just needs to be made public.

<Conclusion

The 'traditional' GC implementations like Serial GC or Parallel GC are trying really hard to avoid computing the liveness information. Usually, the only time they will guarantee an up-to-date and complete liveness information is right after full GC has happened and VM is still in safepoint. Once VM leaves the safepoint that information becomes stale almost immediately. If one would decide to do a full-heap inspection to calculate the live set size at an arbitrary time point it would have to be done in another JVM safepoint and it could easily take several hundred milliseconds, compromising the application responsiveness. G1 GC keeps the liveness information per region - but that information will usually be updated very infrequently as it is done during marking run. And that means that in order to get up-to-date liveness info it is necessary to perform a full-heap inspection with all the previously mentioned drawbacks. Good news is that both ZGC and Shenandoah are keeping fairly up-to-date liveness info which can be used almost immediately. In their current implementations (at the time of writing this article) the liveness information was not exposed publicly but there are no technical reasons why it could not be done - either via custom JFR events and/or GC specific MX Beans.

Sunday, January 24, 2021

Improved JFR allocation profiling in JDK 16

JFR (JDK Flight Recorded) has been providing the support for on-the fly allocation profiling for a while with the help of ObjectAllocationInNewTLAB and ObjectAllocationOutsideTLAB events.

Short excursion - what is TLAB?

TLAB stands for Thread Local Allocation Buffer and it is a region inside Eden, which is exclusively assigned to a thread. Each thread has its own TLAB. Thanks to that, as long as objects are allocated in TLABs, there is no need for any type of synchronization.
Allocation inside TLAB is a simple pointer bump(that’s why it’s sometimes called pointer bump allocation).

For more detailed info I recommend reading the following articles [https://dzone.com/articles/thread-local-allocation-buffers, https://shipilev.net/jvm/anatomy-quarks/4-tlab-allocation] - or simply search the web for 'java TLAB' and follow one of many blogs, articles and StackOverflow entries available out there.

ObjectAllocationInNewTLAB event is emitted each time the TLAB is filled up and contains the information about the instance type, the instance size and the full stack trace where the allocation happened.

ObjectAllocationOutsideTLAB event is fired when the size of the instance to be allocated is bigger than the TLAB size. Also for this event JFR provides the instance type, size and the stacktrace where the allocation was initiated.

TLAB based allocation sampling

By utilising the properties of these two events, namely the fact that they are emitted only when TLAB is filled up or the instance size is bigger than TLAB we can relatively cheaply obtain a heap allocation profile which can help us to identify allocation hot-spots.

Expectations vs. reality

At DataDog Continous Profiler we have been using the described technique to provide continuous, always on heap allocation profiling with great success. Until we started receiving reports of significant performance degradation from our internal teams. And the performance was restored the moment they disabled the TLAB related events and as such our allocation profiling.

Well, it turns out that some applications can generate a huuuuge amount of TLAB events simply due to the enormous number of allocations. And while collecting stacktraces in JFR is pretty swift, when the number of collections crosses a certain number it becomes certainly visible. The performance problems were made more prominent thanks to a bug in the stacktrace hashode calculation which has been fixed since then by my DataDog colleague Gabriel Reid and backported to all relevant JDK versions (JDK 8u282, 11.0.10 or 15.0.2 mentioning a few).

In addition to this performance regression the TLAB events contributed to a very significant increase in the recording size - again causing problems for our continuous profiler which needs to deal with hundreds of thousands such profiles daily. And to solve the size problem we would need a new way of collecting allocation samples which would guarantee a maximum number of samples per recording (to have a predictable recording size) while still providing statistically accurate picture.

Rate limited allocation profiling

This is an incremental improvement on top of the TLAB based allocation profiling. It is using the same data source (TLAB filled up, outside of TLAB allocation) but is applying an adaptive throttling mechanism to guarantee the maximum number of samples per recording (or time unit, more generally speaking).

The samples are emitted as ObjectAllocationSample events with throttled emission rate. The emission rate is customisable in JFC templates and defaults to 150 samples per second for the 'default' JFR profile and 300 samples per second for the 'profile' JFR profile.

The generic throttling option has been built into JFR but is currently in use only by the ObjectAllocationSample event.

Throttling sampler - implementation details

The main objective is to reduce the number of events emitted per time unit while maintaining statistical relevancy. Typically, this could be done by eg. reservoir sampling where a fixed size of random elements is maintained for the duration of the specified time unit.
Unfortunately, this simple approach would not work for JFR events - each event is associated with its stacktrace at the moment when the event is committed. Committing the event, in turn, writes the event into the global buffer and effectively publishes it. Because of the inability to split the event stack collection and publication a slightly more involved algorithm had to be devised.

JFR Adaptive Sampler

JFR adaptive sampler provides a generic support for controlling emission rate of JFR events. It is an 'online' sampler - meaning that the decision whether an event is to be sampled or not is done immediately without requiring any additional data structures like it is in the case of reservoir sampling. At the same time the sampler dynamically adjusts the effective sampling rate in order not to cross the maximum emission rate within the given time range (eg. per second) but still provide a good number of samples in 'quieter' periods.

Conceptually, the adaptive sampler is based on PID controller theory although it is missing the derivational part making it more similar to PI controller. Also, due to the additional limits imposed by the fact that the sampler is going to be used in latency sensitive parts of JVM the implementation had to be embellished with several heuristics not found in the standard discrete implementation.

1 Quantization

The adaptive sampler work is quantised into a series of discrete time windows. An effective sampling rate (represented as a sample probability 0<= P_w<=1) is assigned to each time window and that rate will stay constant for the duration of that particular time window. That means that the probability by which an event is picked to sample is constant within a time window.

2 Dynamic sample probability

In order to adapt to the changes in the incoming data the effective sample probability is recalculated after each time window using an estimated number of incoming elements (population) in the next window (based on the historical trend), the target number of samples per window (derived from the user supplied target rate) and 'sample budget'.

The formula to recompute the sample probability for window 'w' is:

P_w = (S_target + S_budget) / N_w,k

P_w- effective sample probability

N_w,k- estimated population size calculated as exponential weighted moving average over k previous windows

S_target - target number of samples per window = user supplied target rate / window duration in seconds

S_budget - sample budget = sum(S_i - S_target)^w-1_w-x ; budget is calculated per fixed one second interval and 'x' is the number of windows since the beginning of that interval

3 Sample budget

As already mentioned in the previous paragraph the adaptive sampler is employing the concept of 'sample budget'. This budget serves as a 'shock absorber' for situations when the prediction is diametrically different from the reality. It is exploiting the relaxation in terms of obeying the sampling rate only for time intervals longer than X windows - meaning that the actual effective sampling rates observed in each separate time window within this interval can be significantly higher or lower than the requested sampling rate.
Hence the 'budget' term - windows producing less samples than the requested number will 'store' unused samples which can later be used by other windows to accommodate occasional bursts.

Conclusion

The introduction of a throughput management mechanism in JFR allows getting fine details about the application behavior without the risk of being overwhelmed by the sheer number of JFR events.

The results of our preliminary tests of the setups previously completely unable to run with the allocation profiling events turned on are very exciting - JFR with event emission rate controller is able to provide a clear statistical picture of the allocation activity while keeping the recording size at a very manageable level thanks to the limit imposed on the number of captured TLAB events.

Also, let's ponder the fact that the event rate emission control (throttling) is not TLAB event specific and can be used for other event types as well if there is such demand. Although I am quite sure there will be no more throttled events in JDK 16 which is in its stabilisation phase as of writing of this blog it might be the good time to take a look at the potential candidates for JDK 17.

Sunday, December 6, 2020

[BTrace: update from trenches] - Unattended execution

A bit of history

BTrace origins go back more than 10 years and it shows that the modus operandi at that time was to have one JVM at a time and do all the experiments and debugging on that single JVM.
The standard workflow for BTrace dynamic attach was decided to be:

Identify the JVM to attach to
Attach to that JVM and deploy the probe(s)
Stay connected until you are satisfied with the results

While probably quite ok for the one JVM situation it becomes quite a hindrance when trying to operate in ad-hoc mode for bunch of JVMs running on several hosts at once (yes, k8s, talking about you). The lack of unattended execution basically makes BTrace unusable in modern environments - if dynamic attach is required.

Time to fix

So, after a period of procrastination I finally decided to add unattended mode of execution to BTrace. It was both easy and hard task at the same time - the binary I/O protocol BTrace is using is very easy to extend but the the underlying management of client 'sessions' had to be refactored slightly to allow disconnecting and reconnecting to a session without killing it. But nothing that could not have been done during a particularly grey and rainy COVID lockdown weekend.

So, here we go with a few improvements to the BTrace client which should make it much easier to use BTrace in fire&forget mode - which, I think, will become more and more popular with the JFR support when one can easily define a dynamic JFR event type and deploy a probe to generate that event and leave the probe running, turning on and of the JFR recording as necessary.

'Detach client' client command
In addition to 'Exit' option in the BTrace CLI it is now possible to detach from the running session.
Upon detaching a unique probe ID is displayed which can be used to later reconnect to that probe.

'List probes' client command
This command will list any probes which were deployed and the clients left them detached.

List probes from command line
Use btrace -lp <pid> to list the probes in detached mode in a particular JVM
Reconnect to a detached probe
Use btrace -r <probe id> <pid> to reconnect to a detached probe and start receiving probe data.
The detached probes are maintaining a circular buffer for the latest data so you can get a bit of history after reconnecting as well.
Attach a probe and disconnect immediately
Useful shortcut for scripting BTrace deployments when the probe is deployed and the client disconnects immediately.
Use btrace <btrace options> -x <pid> <probe file> to run in this mode.
Upon disconnecting the probe ID is printed out so it can be eg. processed and stored.

New possibilities

Having implemented the support for listing the detached probes and reconnecting to them as a form of command line switches opened doors to easy scripting when one can write a quick one-liner to attach to a named probe:
./bin/btrace -r $(./bin/btrace -lp anagrams.jar | fgrep AllMethods1 | cut -f2 -d' ') anagrams.jar

The unattended execution support was checked in and a development build binaries are available at https://github.com/btraceio/btrace/actions/runs/394037357

Saturday, November 21, 2020

[BTrace: update from trenches] - Experimental support for emitting JFR events, take two

In the previous post I have introduced the prototype of JFR support in BTrace.

The first attempt, however, was riddled with serious shortcomings - the events had to be defined externally and then added to boot classpath.
In addition to that the BTrace verifier had to be made more permeable to allow calls to JFR APIs - this caused the verifier complexity to increase significantly, opening potential holes to be exploited.

Fortunately, there is a very cool API directly in JDK which allows creating JFR event types dynamically, therefore removing the requirement to have the events defined beforehand and added to boot classpath. As an added benefit the refactoring allowed the use of the standard BTraceUtils accessor class for operating on JFR events, thus removing all the custom 'holes' in the BTrace verifier.

The implementation is available on GitHub in jfr_events branch. It is fairly complete (at least for the use cases I was able to come up with) but as usual a user input is more than welcome.

The code example showing the intended usage follows.

Fig.1: Code Example

Sunday, September 6, 2020

[BTrace: update from trenches] - Experimental support for emitting JFR events

Java Flight Recorder (JFR) is an amazing piece of technology allowing collection of a huge amount of very detailed data points from the running application and (also) Java runtime.
It has been widely available since Java 9 and recently it has been backported to JDK 8 update 265 - thus covering all Java version currently available (intentionally disregarding JDK 7 which I really hope will gracefully fade away very soon).

Having a standardized, low impact way to collect data from a running Java application is something BTrace can hugely benefit from.
Among other things this will allow seamless integration with tools already using JFR as their native format (JDK Mission Control or, recently, also VisualVM) creating synergy between 'free-form' instrumentation and well established perf analysis tools.

Although the implementation might seem quite trivial at the first look it quickly becomes more involved because the BTrace safety guards needs to be modified and extended to allow easy cooperation with JFR events while not compromising the security guarantees. Also, it is imperative that working JFR events does not need introducing and learning any new concepts - everything should be expressible via annotations and plain Java code.

After pondering all the requirements for a while I came up with the idea to split periodic and non-periodic event usage. This allows a neat registration of periodic event handlers while keeping the simple event handling code, well, simple.

Here are the proposed annotations:

@JfrPeriodicEventHandler

used to define the JFR related code which will be run at the beginning/end of a recording chunk or at a given time interval
the periodic event is passed in the handler as a method parameter by BTrace (the handler method must have exactly one argument which must be a jdk.jfr.Event subclass)
all safety rules known from eg. @OnMethod handlers still applies except of operations on the event instance

@JfrBlock

delimits the code block which allows creating new event instances and executing operation on those instances
the annotation may specify the list of event types which will be registered by BTrace (events may also be auto-registered so the usage depends on the actual event types)

Fig.1: Code Example

The custom events are to be developed uisng the standard JFR APIs and annotations and the resulting classes are to be packed in a jar file which will then be added to the bootClassPath BTrace agent argument. The events need to be added to the application bootstrap classpath in order for all possible instrumentations to have access to them (eg. instrumented java.util classes etc.).

This is still an experimental prototype and I am looking for early testers to validate my assumptions in the wild. You can get the binaries at bintray or build BTrace from source using jfr_events branch.

Looking for feedback about the proposed notation and the expected usability.

Enjoy!

Monday, March 16, 2020

Performance impact of JFR on JDK8u

Recently JFR was integrated into JDK8u repository meaning that JFR will be available in the next public Java 8 update which happens to be JDK8u262 in July 2020 - yay!

Before the final integration some concerns were voiced regarding a performance degradation in JFR enabled builds - even when a JFR recording was not started.

This really didn't correspond to my experience so I decided to give it a quick spin with a standard benchmark suite. For the licensing terms and the ease of getting such a suite I picked SPECjvm2008 - even though rather outdated it still can provide meaningful numbers.

Setup

dedicated c5.2xlarge AWS instance running Ubuntu 18.04
jdk8 builds (with and without JFR) downloaded from https://builds.shipilev.net/
SPECjvm2008 downloaded from https://www.spec.org/jvm2008/

Run

The benchmarks were run with '-i 7' argument to force 7 iterations for each particular case. This should reduce the test jitter but unfortunately it seems to be making these results 'non-compliant' but that should be fine for this quick check.

The runs (with and without JFR) were done in sequence to remove all interference and the host was fully dedicated to benchmarks.

Results

Results are, well, unsurprising. There is no statistically significant difference between the same JDK8 build with and without JFR and no JFR recording started. The overall composite scores are for all purposes equal.

In the following table you can find a more detailed breakdown of the benchmark runs - the 'diff' column shows the performance difference between the run with and without JFR where regression is indicated by a negative number.

Benchmark name	base (ops/m)	jfr (ops/m)	diff (ops/m)
compiler	619.37	633.77	14.4
compress	339.33	340.85	1.52
crypto	586.48	595.06	8.58
derby	713.85	732.22	18.37
mpegaudio	206.23	207.7	1.47
scimark.large	107.43	109.81	2.38
scimark.small	492.93	482.03	-10.9
serial	238.14	246.33	8.19
startup	50.73	51.19	0.46
sunflow	131.18	132.29	1.11
xml	811.26	814.39	3.13

composite	290.33	293.76	3.43

Addendum

In parallel to this quick SPECjvm2008 run we at DataDog ran also a bunch of more exhaustive benchmarks which happen to be internal and therefore irreproducible in public. But they all confirm the initial hunch that there would be no performance regression whatsoever for JFR enabled JDK8u - meaning that you can go and enjoy JFR on JDK8u!