[Deploying Sakai] Deadlock in portal/velocity intermittently seen soon after startup

Sat Nov 2 08:49:55 PDT 2013

Hey everyone,
I wanted to run something we've been seeing every other week or so by
you to see if anyone else has run into this or has any insight on the
root cause or avenues for further investigation.

Some background: We restart our six app servers each Saturday morning
partially to keep things from getting too crufty and partially because
our DBAs have to restart Oracle each Saturday to keep it happy and so
we have to restart our app servers anyway.

What we've been seeing, not every weekend, but many weekends,
particularly since we upgraded to 2.9 (although I'm pretty sure this
was happening prior to the upgrade on 2.8 and older as well, just far
less often), is that one or two or three of the app servers will start
to report an ever-growing load average, although CPU utilization
remains low and response time does not slow down. The load might grow
in fits and spurts and if we ignore it because response time is fine,
it will continue to grow and eventually Tomcat does stop responding,
and we're forced to restart.

So the first few times we just restarted Tomcat and went on with our
lives, but it's happening often enough now that I started looking into
it. So if we just look at our httpd access logs (we reverse proxy
Tomcat via Apache), we see 502 timeout errors from Apache after it
gives up on waiting for a response from Tomcat. The number of these
errors corresponds exactly to the base load average number the server
is stuck on. So, for example, if our load average is stuck at 46.xx to
47.xx or so, we will find exactly 46 502 errors in the access log
since the restart.

So clearly, this is an issue of stuck threads. And doing a thread dump
of the Java process confirms this, and we'll see exactly 46 threads
with a stacktrace like:

"tomcat-1" daemon prio=10 tid=... nid=... runnable [0x...]
   java.lang.Thread.State: RUNNABLE
at java.util.HashMap.getEntry(HashMap.java:446)
at java.util.HashMap.get(HashMap.java:405)
at org.apache.velocity.util.introspection.ClassMap$MethodCache.get(ClassMap.java:249)
[...]
at org.apache.velocity.app.VelocityEngine.mergeTemplate(VelocityEngine.java:345)
at org.sakaiproject.portal.charon.velocity.VelocityPortalRenderEngine.render(VelocityPortalRenderEngine.java:231)

All 46 (or whatever number it happens to be at the moment) will be
exactly the same. The final bit of Sakai code is the
VelocityPortalRenderEngine.java line 231 (our code is based off the
2.9.0 tag, but the code hasn't significantly changed in trunk) when it
calls mergeTemplate in the Velocity library.

Going back to the access logs, all of the hung threads were requests
for /portal/site/<site-id> or /portal/site/<site-id>/page/<page-id>.
It didn't affect every request with that pattern, just about a fourth
of them (a number inflated somewhat by the tendency of users to reload
a stuck page a few times before giving up). The hung threads started
about two hours after we had started the server, and of 10 distinct
URIs that were hanging, none of them had ever loaded properly since
the start until the first request that hung, and every subsequent
request to those URIs, whether by the same person or different people,
would hang again.

So there's clearly some sort of resource contention going on here,
whether it's technically a deadlock, although it behaves like it, with
the threads just waiting for something and not taking any CPU time.
Based on what I've learned here, I would surmise that the times in the
past when Tomcat has stopped responding were when all of its threads
were taken up with these hung requests.

So, I don't know whether this is a concurrency issue that could be
fixed in the Sakai code, or whether it's a problem with Velocity, or
what. I'm also at a loss to explain why it happens soon after startup,
but then we don't see it on servers that have been up and running for
a few days. That makes me think it's a cache-loading issue of some
kind, and indeed the velocity stack calls do reference "cache".

Anyway, mainly I'm wondering if anyone out there has run into anything
similar and/or if anyone has any ideas about how to address it. Since
it doesn't crash the server until hours or days after we notice the
increasing load, it's a simple enough matter to drain the app server
of active users and restart it, so this isn't an emergency but it is
happening often enough that it's a pain to deal with. Any advice or
ideas will be appreciated. Thanks!

--
David Adams
Director of Server and Network Operations
Virginia Tech TLOS, Technology-enhanced Learning and Online Strategies