about java, groovy, grails, software development and more...: Basics of Java's Garbage Collection

Garbage collection:

Every java program runs as a process and is allocated some memory. The memory allocated to the java process is called process heap.

The java heap i.e. the java object heap, is part of the process heap. The java object heap (also called just java heap) store the java objects i.e. object instances. The java heap can be configured using the several command line parameters.

Garbage collection is the mechanism used by java to reclaim or recollect the space occupied by objects that are not reachable from the program. This is necessary to ensure that the application does not run out of memory.

An object that cannot be reached from any reference within the running program is eligible for garbage collection.

The simplest form of GC iterates through all the references within the running program and marks all the objects that are reachable via those references directly or indirectly. At the end of this process, any object that is not marked, is considered garbage and becomes eligible for collection.

It’s obvious, that in case of this particular algorithm, the duration of the GC is relative of the number of objects referenced by the program.

Generational Collection

In most applications, a larger number of objects are short-lived. i.e. there are relatively few objects that are created when the application is initialized and live until the end; which means, it makes sense for the GC to focus on the objects that are short-lived.

To leverage this common property noticed in most applications, JVM manages the memory in generations i.e. it divides the object memory into groups depending on the age of the objects i.e. short-lived objects would stay in the young-generation and longer-living objects would be moved to the tenured-generation. New objects are allocated in the young-generation and longer-living objects that have survived a good number of collections are then moved to the tenured (older) generation.

There is also a permanent-generation where metadata for classes and methods are stored. The default permanent generation is generally enough, but applications that need to load a large number of classes may want to increase the perm-gen size. This can be done using the XX:MaxPermSize option.

But how does this grouping help?

As we have already seen, the amount of time spent in GC depends on the number of live objects. Since most objects in the young-generation would be short-lived (die quickly after their creation), running a GC on this area should not take very long. There would be only a few live objects that would need to be copied, and hence the GC pause would be significantly small. Also, the young-generation should be optimally sized which would deter the need to garbage collect the objects too soon. If the young-generation is too small, the GC would have to be called too soon, which means a lot of objects in the area would still be alive. The purpose would be defeated. Hence, the young-generation should be big enough to leverage the infant mortality (i.e. most objects will die soon – but not too soon). The young-generation should be around 1/3^rd of the total heap size.

Minor collections:

The young-generation is divided into an area called ‘eden’ where new allocations are made initially and 2 ‘survivor spaces’ which play the part of a copying collector. When the ‘eden’ becomes full, a minor collection takes place.

The minor collection collects the dead objects from eden and any live objects are copied from the eden into one of the ‘survivor space’. If a minor collection runs again, and the eden is again found full, the live objects are copied from eden and the 1^st survivor space to the 2^nd survivor space, so on and so forth.

Thus every minor collection will copy live object from Eden and one of the survivor spaces into the second survivor space. Also, any live objects that have survived a certain number of collections are promoted to the tenured generation.

Note: In case of such generational collection, the collector does not visit the tenured generation objects referred by the young generation objects. There are ways to track inter-generational references (out-of-scope for the purpose of this article)

Major collections:

If the minor collection is not able to reclaim significant amount of memory, a major collection takes place on higher generations. Such collections collect both the generations i.e. the young as well as the tenured.

Note: When we call System.gc(), a major collection is called through. Since major collections take a longer time to run, we should be very careful about using System.gc() since it could have a performance impact.

Collection Strategies:

There are a couple of GC strategies that you can choose from to improve the performance/throughput.

1. Serial collector: What we discussed until now is the serial way of collection also known as ‘Stop-the-world’ collection. As the name suggests, program execution is halted when this kind of collector runs. Advantage is that it guarantees that no new object is allocated and object reachability does not change until the collection finishes. Disadvantage is that program execution is paused. This collection strategy is simplest and least complex. This strategy is useful for applications that are not interactive.

2. Parallel collector: Also called ‘throughput collector’. In this case, parallel collection is done for the young-generation i.e. multiple threads are used for minor collections. The tenured generation collection still remains serial. New object allocations are allowed while collection is being carried out. Various locking mechanisms are used to achieve this kind of parallel behavior. This strategy is preferred in case of multiple processors. This strategy can be enabled using the -XX:+UseParallelGC option. This strategy is suitable for applications with a large set of short-lived data (a large young generation) and multiple processors.

3. Incremental collector: In this strategy, GC is performed in phases. And program execution is resumed between these phases. This strategy can be enabled using the -XX:+UseTrainGC option. This is not being used after Java 1.4 versions.

4. Concurrent collector: In this strategy, most of the collection takes places in parallel with program execution. Pause periods are very short. Concurrent collections are done for both minor and major collections. This strategy can be enabled using the -XX:+UseConcMarkSweepGC option. This strategy is suitable for applications with a large set of long-lived data (a large tenured generation) and multiple processors.

How can GC be monitored?

The following command line options will output the details of GC:

-verbose:gc

-XX:+PrintGCDetails

-Xloggc:gc.log

Heap and generation sizes

Throughput is the percentage of the total run time not spent performing GC.

Pauses are the times when the application execution stops to allow GC to run.

Different applications have different requirements as far as the above two metrics are concerned.

The throughput of the application depends on the memory available. If the generations are small, lesser time will be spent in GC, hence throughput would be higher. Thus throughput is inversely proportional to the amount of memory available. Since collections occur when generations fill up, throughput is inversely proportional to the amount of memory available

Total available memory is the most important factor affecting garbage collection performance.

When the JVM initializes, some space is reserved for the java object heap. This heap size can be specified using the -Xmx option.

-Xms specified the minimum heap size and –Xmx specifies the maximum heap size.

If –Xms is less than –Xmx, then some remainder is allocated to a so-called ‘virtual space’.

This virtual space is adjusted after each collection to maintain the ratio of the ‘available space’ to the ‘space used by live objects’ within a range.

This range can be specified using the - XX:MinHeapFreeRatio and –XX:MaxHeapFreeRatio parameters.

Some other interesting options are :

-XX:NewRatio: ratio of the sizes of the young and old generations.

-XX:NewSize: size of the young generation in bytes

-XX:SurvivorRatio: ratio of the sizes of the eden and the survivor spaces.

-XX:GCTimeRatio: specifies a throughput goal. This option specifies the percentage of time to be spent in program execution instead of garbage collection.

-XX:MaxGCPauseMillis: specifies a max pause goal. Specifies the maximum time the program execution can be halted to allow GC to finish in milliseconds.