Java is the new C: What drives Full GC duration ?

Sunday, July 28, 2013

What drives Full GC duration ?

Its obvious that, the more memory allocated, the longer a full GC pause will be. However a GC has to track references, so I want to find out if primitve fields (like int, long) also increase Full GC duration. Additionally I'd like to find if the structure of an object graph has impact on GC duration.

The Test

I am creating a number of instances of the following classes:

static class GCWrapper {
GCWrapper(Object referenced) {
this.referenced = referenced;
}
Object referenced;
}

static class GCWrapperMiddle {
int a,b,c,d,e,f,g,h,i,j,k,l;
GCWrapperMiddle(Object referenced) {
this.referenced = referenced;
}
Object referenced;
}

static class GCWrapperFat {
int a,b,c,d,e,f,g,h,i,j,k,l;
long aa,ab,ac,ad,ae,af,ag,ah,ai,jj,ak,al;
GCWrapperFat(Object referenced) {
this.referenced = referenced;
}
Object referenced;
}

they all hold exactly one reference. But "Middle" and "Fat" additionally contain primitive fields, which increases their memory consumption a lot.
I am chaining those objects after creation to a linked list. For locality tests, the reference is pointed to a random other "GCWrapper" Object to create "far" references.

Additionally I am using the class below to test effects of "many references", a highly interlinked Object graph:

static class GCWrapperLinked {
GCWrapperLinked(Object referenced) {
this.r1 = referenced;
}
Object r1,r2,r3,r4,r5;
}

After instantiation, the test does 5 GC's using System.gc() and takes the last value (~3 calls are required until time stabilizes).

Tests are done on an Intel i7 4 core 8 threads mobile processor with a somewhat outdated JDK 7 u 21, as I am in vacation currently with a pretty bad internet connection ...

What has more impact, Memory Size or Number of Object (-References) ?

The test was done with all collectors and different number of Objects on the Heap. Objects were allocated in a linked List like

GCWrapperMiddle res = new GCWrapperMiddle(null);
for ( int i = 0; i < len; i++ ) {
res = new GCWrapperMiddle(res);
}
return res;

The result shows clear linear dependency inbetween the number of Objects and Full GC duration.
Please don't conclude Default GC (ParOldGC) is slowest from the results, because
a) subsequent calls to system.gc() without mutation may enable Collectors to cut corners

b) they might choose to use fewer/more threads during collection

I test all Collectors just to ensure that observed interdependencies are present for all collectors.

Now let's compare effects of Object size on Full GC durations. As described above, I pumped up object size by adding int and long fields to them. Those fields cannot contain references to other objects (no need to scan them) but still consume space. The test creates 6 million objects of the classes shown above. Heap consumption is:

163 Mb for 'small'
427 Mb for 'middle'
1003 Mb for 'large'

Now that's interesting:
Object size has a minor impact on Full GC duration (the differences observed can be attributed mostly to CPU cache misses, i assume). Main driver of FullGC duration is the number of object (-references).

Does GC duration depend on number of objects or the number of object-references ?

To test this, I create an object graph where each gcwrapper object points to another randomly choosen gcwrapper object.

GCWrapper res[] = new GCWrapper[len];
for ( int i = 0; i < len; i++ ) {
res[i] = new GCWrapper(null);
}
for ( int i = 0; i < len; i++ ) {
res[i].referenced = res[((int) (Math.random() * len))];
}
return res;

and ('5 refs')

GCWrapperLinked res[] = new GCWrapperLinked[len];
for (int i = 0; i < res.length; i++) {
res[i] = new GCWrapperLinked(null);
}
for (int i = 0; i < res.length; i++) {
res[i].r1 = res[((int) (Math.random() * len))];
res[i].r2 = res[((int) (Math.random() * len))];
res[i].r3 = res[((int) (Math.random() * len))];
res[i].r4 = res[((int) (Math.random() * len))];
res[i].r5 = res[((int) (Math.random() * len))];
}
return res;

Oops, now that's surprising. It's strange, that DefaultGC (ParOldGC) actually gets faster for objects containing *more* references to (random) other objects !! This is not a test error and reproducable.
You probably now, that todays CPU's have a multi stage cache and that access to main memory is extremely expensive compared to cache hits, so locality (memory wise) is pretty important. Its obvious this also hit's Full GC when iterating the live set of objects.
(see http://www.akkadia.org/drepper/cpumemory.pdf for more information you'd probably want to read about caches ;-) ).

In contradiction to the CMS collector, ParOldGC ('DefaultGC') is able to compact the heap during Full GC. It seems it does this in a way optimizing the randomly linked object graph by moving objects close to objects referencing them (that's my speculative explanation, others are welcome ..).

This observation leads to the following test:
A comparion of FullGC of 6 million GCWrapper object graphs. One setup as linked list (linked in allocation order) and one with randomly choosen referenced objects.

GCWrapper res = new GCWrapper(null);
for ( int i = 0; i < len; i++ ) {
res = new GCWrapper(res);
}
return res;

GCWrapper res[] = new GCWrapper[len];
for ( int i = 0; i < len; i++ ) {
res[i] = new GCWrapper(null);
}
for ( int i = 0; i < len; i++ ) {
res[i].referenced = res[((int) (Math.random() * len))];
}
return res;

Boom ! The impact of locality is massive. This will also have a strong impact on performance of a java application accessing those data structures.

Conclusion:

Full GC duration depends on the number of objects allocated and the locality of their references. It does not depend that much on actual heap size.

Questionable design patterns regarding (OldGen) GC and overall locality

I think one can safely assume, that objects allocated at the same time have a high chance to be "near" each other (Note: CMS heap fragmentation may behave different if the application fragments the heap). So whenever statically allocated data is updated by refering to a newly allocated Object, chance is, that the new Object is far away from the referencee, which will cause cache misses. Especially the massive use of Immutables as favoured by many designs and (newer VM-based languages) actually is deadly regarding locality and OldGen GC.

Lazy initialization/instantiation
Value classes. E.g. GC-wise its better to use a primitive long instead of Date.
Immutables (except preallocated Flyweight's)
Fine grained class design
Preferring Composition over Subclassing (many wrapper, delegate Objects)
JDK's HashMap uses entry objects. Use open addressed Hashmaps instead
Use of Long,Integer,Short instead of primitives in static data (unfortunately Java's Generics encourage this). Even if the VM does some optimization here (e.g. internal caching of Integer's), one can verify easily that these objects still have severe impact on GC duration