Monday, April 12, 2021

Dead or Alive (A short tale of Java heap live set)

Dead or Alive
(A short tale of Java heap live set)


While working on my now closed (as not done) PR to provide new JFR events containing an approximation of the heap live set size I happened to learn a bit about how the liveness was tracked in various GC implementation and how usable (or not) was the heap live set size estimation.


When talking about liveness and heap live set I mean the object graph that GC deems reachable and alive. Logically, anything that is not reachable is considered dead.


In this post I will try to summarize my findings in a concise way, mostly as a reminder for me when I need to come back to this topic some day, but also in hopes that the others might find it useful as well.



TL;DR

If you are not interested in all the gory details you can skip this section.


Usually, the GC implementations are not tracking the liveness information explicitly and it becomes available only shortly after a major GC cycle and becomes rapidly outdated. 

But it turns out that using a knowledge of the inner workings of various GCs it is possible to create a somewhat representative estimate even after each minor GC cycle.

As mentioned above, the way to calculate the estimate is specific to each GC implementation.

Epsilon GC

Ok. This one is extremely trivial. Anything on the heap is considered live for all that this GC implementation knows. Epsilon GC is not performing any garbage collection and therefore has no way of determining the liveness of objects stored on the heap.

Serial GC

Here we can utilize the fact that mark-copy (minor GC) phase is naturally compacting so the number of bytes after copy is 'live' and that the mark-sweep (major GC) implementation keeps an internal info about objects being 'dead' but excluded from the compaction effort. 

The ‘dead wood’ as those objects are being referred to are unreachable instances which should be removed and the space they occupy should be compacted but doing so would have an unreasonable cost when compared to the memory space released by such operation.


All this means that we can use the ‘dead-wood’ size to derive the old-gen live set size as used bytes minus the cumulative size of the 'dead-wood'.

Parallel GC

For Parallel GC the live set size estimate can be calculated as the sum of used bytes in all regions after the last GC cycle

This seems to be a safe bet because this collector is always compacting and the number of used bytes corresponds to the actual live set size.

G1 GC

G1 GC is already keeping liveness information per region.


During concurrent mark processing phase we can utilize the region liveness info and just sum up the liveness information from all the relevant regions - by adding the live size summation into G1UpdateRemSetTrackingBeforeRebuild::do_heap_region() method.


For mixed and full GC cycles we can use the number of used bytes as reported at the end of G1CollectedHeap::gc_epilogue() method.


The downside of G1 GC is that concurrent mark is not happening very frequently  (only if IHOP > 45% by default) and full GC cycles are basically a failure mode, meaning that the liveness information can go stale quite often.

Shenandoah GC

Shenandoah, as well as G1, is keeping the per-region liveness info. 

This can be easily exploited by hooking into ShenandoahHeuristics::choose_collection_set() method which already does full region iteration after each mark phase where the per-region liveness would be just summed up.

ZGC

Actually, ZGC is readily available to provide the information about the live set size at the end of the most recent mark phase. The `ZStatHeap` class is already holding the liveness info - it just needs to be made public.


<Conclusion

The 'traditional' GC implementations like Serial GC or Parallel GC are trying really hard to avoid computing the liveness information. Usually, the only time they will guarantee an up-to-date and complete liveness information is right after full GC has happened and VM is still in safepoint. Once VM leaves the safepoint that information becomes stale almost immediately. If one would decide to do a full-heap inspection to calculate the live set size at an arbitrary time point it would have to be done in another JVM safepoint and it could easily take several hundred milliseconds, compromising the application responsiveness. G1 GC keeps the liveness information per region - but that information will usually be updated very infrequently as it is done during marking run. And that means that in order to get up-to-date liveness info it is necessary to perform a full-heap inspection with all the previously mentioned drawbacks. Good news is that both ZGC and Shenandoah are keeping fairly up-to-date liveness info which can be used almost immediately. In their current implementations (at the time of writing this article) the liveness information was not exposed publicly but there are no technical reasons why it could not be done - either via custom JFR events and/or GC specific MX Beans.




Sunday, January 24, 2021

Improved JFR allocation profiling in JDK 16

JFR (JDK Flight Recorded) has been providing the support for on-the fly allocation profiling for a while with the help of ObjectAllocationInNewTLAB and ObjectAllocationOutsideTLAB events

Short excursion - what is TLAB?

TLAB stands for Thread Local Allocation Buffer and it is a region inside Eden, which is exclusively assigned to a thread. Each thread has its own TLAB. Thanks to that, as long as objects are allocated in TLABs, there is no need for any type of synchronization.
Allocation inside TLAB is a simple pointer bump(that’s why it’s sometimes called pointer bump allocation).

For more detailed info I recommend reading the following articles [https://dzone.com/articles/thread-local-allocation-buffers, https://shipilev.net/jvm/anatomy-quarks/4-tlab-allocation] - or simply search the web for 'java TLAB' and follow one of many blogs, articles and StackOverflow entries available out there.

ObjectAllocationInNewTLAB event is emitted each time the TLAB is filled up and contains the information about the instance type, the instance size and the full stack trace where the allocation happened.

ObjectAllocationOutsideTLAB event is fired when the size of the instance to be allocated is bigger than the TLAB size. Also for this event JFR provides the instance type, size and the stacktrace where the allocation was initiated.

TLAB based allocation sampling

By utilising the properties of these two events, namely the fact that they are emitted only when TLAB is filled up or the instance size is bigger than TLAB we can relatively cheaply obtain a heap allocation profile which can help us to identify allocation hot-spots.

Expectations vs. reality

At DataDog Continous Profiler we have been using the described technique to provide continuous, always on heap allocation profiling with great success. Until we started receiving reports of significant performance degradation from our internal teams. And the performance was restored the moment they disabled the TLAB related events and as such our allocation profiling.

Well, it turns out that some applications can generate a huuuuge amount of TLAB events simply due to the enormous number of allocations. And while collecting stacktraces in JFR is pretty swift, when the number of collections crosses a certain number it becomes certainly visible. The performance problems were made more prominent thanks to a bug in the stacktrace hashode calculation which has been fixed since then by my DataDog colleague Gabriel Reid and backported to all relevant JDK versions (JDK 8u282, 11.0.10 or 15.0.2 mentioning a few).

In addition to this performance regression the TLAB events contributed to a very significant increase in the recording size - again causing problems for our continuous profiler which needs to deal with hundreds of thousands such profiles daily. And to solve the size problem we would need a new way of collecting allocation samples which would guarantee a maximum number of samples per recording (to have a predictable recording size) while still providing statistically accurate picture.

Rate limited allocation profiling

This is an incremental improvement on top of the TLAB based allocation profiling. It is using the same data source (TLAB filled up, outside of TLAB allocation) but is applying an adaptive throttling mechanism to guarantee the maximum number of samples per recording (or time unit, more generally speaking).
 
The samples are emitted as ObjectAllocationSample events with throttled emission rate. The emission rate is customisable in JFC templates and defaults to 150 samples per second for the 'default' JFR profile and 300 samples per second for the 'profile' JFR profile.
 
The generic throttling option has been built into JFR but is currently in use only by the ObjectAllocationSample event.
 

Throttling sampler - implementation details

The main objective is to reduce the number of events emitted per time unit while maintaining statistical relevancy. Typically, this could be done by eg. reservoir sampling where a fixed size of random elements is maintained for the duration of the specified time unit.
Unfortunately, this simple approach would not work for JFR events - each event is associated with its stacktrace at the moment when the event is committed. Committing the event, in turn, writes the event into the global buffer and effectively publishes it. Because of the inability to split the event stack collection and publication a slightly more involved algorithm had to be devised.

JFR Adaptive Sampler

JFR adaptive sampler provides a generic support for controlling emission rate of JFR events. It is an 'online' sampler - meaning that the decision whether an event is to be sampled or not is done immediately without requiring any additional data structures like it is in the case of reservoir sampling. At the same time the sampler dynamically adjusts the effective sampling rate in order not to cross the maximum emission rate within the given time range (eg. per second) but still provide a good number of samples in 'quieter' periods.

Conceptually, the adaptive sampler is based on PID controller theory although it is missing the derivational part making it more similar to PI controller. Also, due to the additional limits imposed by the fact that the sampler is going to be used in latency sensitive parts of JVM the implementation had to be embellished with several heuristics not found in the standard discrete implementation.

1 Quantization

The adaptive sampler work is quantised into a series of discrete time windows. An effective sampling rate (represented as a sample probability 0<= Pw<=1) is assigned to each time window and that rate will stay constant for the duration of that particular time window. That means that the probability by which an event is picked to sample is constant within a time window.

2 Dynamic sample probability

In order to adapt to the changes in the incoming data the effective sample probability is recalculated after each time window using an estimated number of incoming elements (population) in the next window (based on the historical trend), the target number of samples per window (derived from the user supplied target rate) and 'sample budget'.
 
The formula to recompute the sample probability for window 'w' is:

Pw = (Starget + Sbudget) / Nw,k

Pw      - effective sample probability
Nw,k    - estimated population size calculated as exponential weighted moving average over k previous windows
Starget - target number of samples per window = user supplied target rate / window duration in seconds
Sbudget - sample budget = sum(Si - Starget)w-1w-x ; budget is calculated per fixed one second interval and 'x' is the number of windows since the beginning of that interval

3 Sample budget

As already mentioned in the previous paragraph the adaptive sampler is employing the concept of 'sample budget'. This budget serves as a 'shock absorber' for situations when the prediction is diametrically different from the reality. It is exploiting the relaxation in terms of obeying the sampling rate only for time intervals longer than X windows - meaning that the actual effective sampling rates observed in each separate time window within this interval can be significantly higher or lower than the requested sampling rate.
Hence the 'budget' term - windows producing less samples than the requested number will 'store' unused samples which can later be used by other windows to accommodate occasional bursts. 

Conclusion

The introduction of a throughput management mechanism in JFR allows getting fine details about the application behavior without the risk of being overwhelmed by the sheer number of JFR events.
 
The results of our preliminary tests of the setups previously completely unable to run with the allocation profiling events turned on are very exciting - JFR with event emission rate controller is able to provide a clear statistical picture of the allocation activity while keeping the recording size at a very manageable level thanks to the limit imposed on the number of captured TLAB events.

Also, let's ponder the fact that the event rate emission control (throttling) is not TLAB event specific and can be used for other event types as well if there is such demand. Although I am quite sure there will be no more throttled events in JDK 16 which is in its stabilisation phase as of writing of this blog it might be the good time to take a look at the potential candidates for JDK 17.