January 17, 2007

Don't Sacrifice Object Identity to Cluster Your Java Application

Two objects may have the same value yet not be the same objects.  In Java you test if objects have the same value using equals() and test if they are actually the same object using the == operator.  The == operator compares object IDs to test if the two objects are truly the same.  James Brundege explains, "In the VM you do not get an ID for an object; you simply hold direct references to the object. Behind the scenes, the VM does assign an eight-byte ID, which is what a reference to an object really is." This notion of object identity is fundamental to Java programming semantics and is part of everything from passing objects into methods to referencing objects from other objects.

For example, a shopping cart might have a shipping method set which in turn has a rate used for calculating the shipping charge.


Java Classes for a Simple Cart


You could access the rate object using simple getters with code such as: cart.getShippingMethod().getRate().  This is straightforward for Java developers to code and is very efficient within the JVM.  Compare this approach to the database paradigm which has a radically different set of constraints than Java programming.

Relational databases have tables with rows and columns where information is tied together with keys.  This structure is appropriate when storing data to disk, but it results in an object-relational impedance mismatch when moving between the database world and the Java world.  Keys work well in relational databases to hook up the bits of data, but they create undesirable overhead and complexity when introduced into your application code.  In our example of the shopping cart, you would end up writing special methods for passing around keys and performing lookups in tables for shipping methods and rates instead of using simple getters leveraging the natural semantics of Java.  Or put more simply, Java is well suited for business logic working on transient data and the database is well suited to handling persistent data.

This leads us to a fundamental guideline of building real world applications: Resist with all your ability treating transient application data as persistent data to solve some architectural need.  Only make data persistent when it is truly ready to be stored in the database.  This guideline can even be applied to clustering.

Traditional clustering solutions with APIs to access a clustered HashMap require serialization (implement java.io.Serializable) and cause Java objects to lose their identity which essentially forces the overhead of the database paradigm into the coding for your transient application data.  The API is the interface between the application and the cluster.  Think of the API as the impedance resulting from the object to clustered HashMap mismatch.

The objects in the clustered HashMap are isolated such that re-materializing them on a different node puts the burden of managing the object relationships on the application developer.


Relationships for Cart Objects


If you put a cart object into a clustered HashMap or database and later retrieve the shipping method and the rate, your application has to know how to reconstruct those object relationships.


Distributed HashMaps Wreak Havoc on Object Identity


Open Terracotta is an open source clustering solution that takes a very different approach to clustering — one that enables it to preserve object identity across the cluster.  A shared object is logically the same object across the cluster.


Preserving Object Identity Across the Cluster


The core of Open Terracotta is a technology called DSO (Distributed Shared Objects) which clusters Java objects across servers, coordinates threads among the different JVMs, and performs distributed method invocation.  DSO clusters at the JVM level using bytecode instrumentation enabling it to monitor when your application reads or modifies any shared objects and then only transmit the fields of the object that actually changed when they are needed.  You use an XML configuration file to tell DSO which objects to share.

<dso>
<instrumented-classes>
<include>
<class-expression>com.townsend-camera.cart.*</class-expression>
</include>
</instrumented-classes>
<web-applications>
<web-application>Cart</web-application>
</web-applications>
</dso>

Sample Configuration for Clustering Cart Application


Object sharing effectively takes place below the application, so the application is not specifically aware of or directly involved in the clustering.  DSO materializes the shared objects in the JVMs and keeps the objects coherent, including object identity, across the JVMs.

Preserving object identity makes clustering more transparent for application developers but has a potentially greater impact on clustering of frameworks and tools, such as Spring, Lucene, and Glassbox.  Just like the Java applications that use them, frameworks and tools written in Java are built on objects.  One key difference is that it is often impractical to access and modify the source code of frameworks and tools used by your application.  By weaving the clustering code into classes as they are loaded into the JVM, Open Terracotta does not rely on the source code.


Clustering at the JVM Level


Of course, some classes are inherently unclusterable, like classes with threads and disk I/O.  Writing a sales transaction to disk from one application node should not automatically cause every other application node in the cluster to attempt to do the same.  This brings us back to the contrast of transient application data verses persistent data, and our earlier guideline: Resist with all your ability treating transient application data as persistent data to solve some architectural need.

In our shopping cart example, a cart generally is transient application data until the user performs a checkout and makes a purchase.  Clustering the cart enables scaling-out the application across a server farm and ensures that if an application node fails, the user's cart is preserved and made available on another application node.  Pushing out when data is specifically persisted by the application makes it easier to cluster your application while still preserving object identity across the cluster to simplify your application code.