One of the most interesting problems I worked on was a deadlock between the JVM and the Database.

This blog post is a case-study of some sorts.
I will talk about the symptoms we saw, how we analysed the problem, and the solution.

Technologies involved
A J2EE web application running on Tomcat
Oracle Database

Remind me one more time – what is a deadlock ?
Deadlock happens when two (or more ) processes have acquired  a lock and are now waiting for the other process to release its lock first- before it can proceed further. Since each process would wait for the other process to complete first- it becomes a deadlock- i.e there is no way out unless someone intervenes.Oracle will immediately detect and  break the deadlock by killing one of the process, and I believe JVM also has a similar recovery process.

So what so special about this deadlock ?
When you have a deadlock across the two systems i.e one lock exists in the JVM and another lock in Oracle, it is impossible for either system to detect it- because they simply do not know about the existence of the other lock . Each system considers it as a uni-directional lock. Uni-directional locks aren’t a problem(generally). Deadlocks are . In our case, the lock in the JVM was by the usage of a synchronized block and the lock in Oracle was because of update on one table and a pending commit.  Both these locks would exist for ever. Why? Read on…

Symptoms:

The first symptom we saw was that the application became unresponsive. Any request sent through the browser would never complete. The second symptom we saw was that the

application was running out of connections. The immediate knee-jerk reaction we had was to increase the connection pool size. But that simply delayed the inevitable by some more time.

Growing suspicion:
After we started looking at the code, it was soon clear that there was something very fishy going on. We had a strange legacy code which heavily relied upon JVM locks during database interactions. We soon realized was that the “Out of connections” error was just a red herring. The real cause was something else.

Diagnosis:

There were two things that were instrumental in our analysis.
The first was- JVM Thread-dump. And the second was a report generated against the database which showed hung database sessions.
Both showed a very similar pattern – In the thread dump, there was one thread which had acquired the lock and all other threads would be waiting for that lock to be released.
The database session report showed a similar pattern. One session was doing stuff and all other sessions were waiting for that session to complete.

The Database session reports said something to like this
DB Session 563 is waiting for DB Session 234 to complete


And in the JVM thread dump, we saw
Thread B is waiting for Thread A to release the lock

The proof
The similarity in the pattern between the two reports  coupled with our knowledge about the code, quickly led us to conclude that – it must be a deadlock between the JVM and Oracle. But we didn’t have the smoking gun… not yet.

What we needed was a way to associate the two pieces of information. We needed to stamp the database session with the thread names from the JVM. Oracle provides a mechanism to attach Client Info with a Database session.This is how we used that…

String clientInfo = “<”+Thread.currentThread().getId()+”:”+Thread.currentThread().getName()+”>”;
CallableStatement cs = getConnection().prepareCall(“{ call dbms_application_info.set_client_info(?) } “);

Now, the report from the database session, gave us the proof that we needed,

DB Session 563 (associated with Thread A) is waiting for DB Session 234 (associated with Thread B) to complete

And in the JVM thread dump, we already had
Thread B is waiting for Thread A to release the lock

Q.E.D !!!

The cause and the solution
In our case, it was incorrect use of JVM locks in the application’s legacy code. After very intense analysis , we concluded that the locks in the JVM existed for no good reason and can be simply eliminated

Alternate solution
An alternate solution (suggested by Venkat when we met over lunch at NFJS ) ,that we didn’t pursue  but very closely considered was using the enhanced Java 5 Lock API which allows for a timeout and hence a recovery from such a  situation. I think it is quite an effective strategy to get out of a sticky spot like this. The major problem we anticipated from this was, that it would case the application to occasionally lose a business transaction and without an effective recovery built into the system, the results were quite unknown.

   
© 2011 Technology Cafe Suffusion theme by Sayontan Sinha