JFR (JDK Flight Recorded) has been providing the support for on-the fly allocation profiling for a while with the help of ObjectAllocationInNewTLAB and ObjectAllocationOutsideTLAB events.
Short excursion - what is TLAB?
TLAB stands for Thread Local Allocation Buffer and it is a region inside Eden, which is exclusively assigned to a thread. Each thread has its own TLAB. Thanks to that, as long as objects are allocated in TLABs, there is no need for any type of synchronization.
Allocation inside TLAB is a simple pointer bump(that’s why it’s sometimes called pointer bump allocation).
For more detailed info I recommend reading the following articles [https://dzone.com/articles/thread-local-allocation-buffers, https://shipilev.net/jvm/anatomy-quarks/4-tlab-allocation] - or simply search the web for 'java TLAB' and follow one of many blogs, articles and StackOverflow entries available out there.
ObjectAllocationInNewTLAB event is emitted each time the TLAB is filled up and contains the information about the instance type, the instance size and the full stack trace where the allocation happened.
ObjectAllocationOutsideTLAB event is fired when the size of the instance to be allocated is bigger than the TLAB size. Also for this event JFR provides the instance type, size and the stacktrace where the allocation was initiated.
TLAB based allocation sampling
By utilising the properties of these two events, namely the fact that they are emitted only when TLAB is filled up or the instance size is bigger than TLAB we can relatively cheaply obtain a heap allocation profile which can help us to identify allocation hot-spots.
Expectations vs. reality
At DataDog Continous Profiler we have been using the described technique to provide continuous, always on heap allocation profiling with great success. Until we started receiving reports of significant performance degradation from our internal teams. And the performance was restored the moment they disabled the TLAB related events and as such our allocation profiling.
Well, it turns out that some applications can generate a huuuuge amount of TLAB events simply due to the enormous number of allocations. And while collecting stacktraces in JFR is pretty swift, when the number of collections crosses a certain number it becomes certainly visible. The performance problems were made more prominent thanks to a bug in the stacktrace hashode calculation which has been fixed since then by my DataDog colleague Gabriel Reid and backported to all relevant JDK versions (JDK 8u282, 11.0.10 or 15.0.2 mentioning a few).
In addition to this performance regression the TLAB events contributed to a very significant increase in the recording size - again causing problems for our continuous profiler which needs to deal with hundreds of thousands such profiles daily. And to solve the size problem we would need a new way of collecting allocation samples which would guarantee a maximum number of samples per recording (to have a predictable recording size) while still providing statistically accurate picture.
Rate limited allocation profiling
Throttling sampler - implementation details
The main objective is to reduce the number of events emitted per time unit while maintaining statistical relevancy. Typically, this could be done by eg. reservoir sampling where a fixed size of random elements is maintained for the duration of the specified time unit.
Unfortunately, this simple approach would not work for JFR events - each event is associated with its stacktrace at the moment when the event is committed. Committing the event, in turn, writes the event into the global buffer and effectively publishes it. Because of the inability to split the event stack collection and publication a slightly more involved algorithm had to be devised.
JFR Adaptive Sampler
JFR adaptive sampler provides a generic support for controlling emission rate of JFR events. It is an 'online' sampler - meaning that the decision whether an event is to be sampled or not is done immediately without requiring any additional data structures like it is in the case of reservoir sampling. At the same time the sampler dynamically adjusts the effective sampling rate in order not to cross the maximum emission rate within the given time range (eg. per second) but still provide a good number of samples in 'quieter' periods.
Conceptually, the adaptive sampler is based on PID controller theory although it is missing the derivational part making it more similar to PI controller. Also, due to the additional limits imposed by the fact that the sampler is going to be used in latency sensitive parts of JVM the implementation had to be embellished with several heuristics not found in the standard discrete implementation.
1 Quantization
2 Dynamic sample probability
Pw - effective sample probability
3 Sample budget
As already mentioned in the previous paragraph the adaptive sampler is employing the concept of 'sample budget'. This budget serves as a 'shock absorber' for situations when the prediction is diametrically different from the reality. It is exploiting the relaxation in terms of obeying the sampling rate only for time intervals longer than X windows - meaning that the actual effective sampling rates observed in each separate time window within this interval can be significantly higher or lower than the requested sampling rate.
Hence the 'budget' term - windows producing less samples than the requested number will 'store' unused samples which can later be used by other windows to accommodate occasional bursts.