[Building Sakai] 1000s of threads and the org.sakaiproject.site.api.SiteService.userSiteCache

Thu Sep 18 02:32:19 PDT 2014

We've recently gone live with Sakai 10 and are seeing huge numbers (up
to 15,000) of threads being spawned at various points throughout the
day, in a thread dump all the threads are shown as:

"Thread-40770" daemon prio=10 tid=0x00007fd1acca2800 nid=0x4966
runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

As they don't have a Java stack trace there isn't much to debug. My
guess is that they are running C/C++ code and got spawned through JNI,
but we haven't yet managed to capture a thread dump with the -m option
to show any native frames.

Although we have 1000s of runnable threads it only eats up one core of
the machine processing these threads and they slowly all die off (took
4+ hours for 17k threads to die off on a slow machine).

What was a little strange is that these huge spikes in threads
coincide with a steep drop in the number of object in the
org.sakaiproject.site.api.SiteService.userSiteCache, these drops in
the cache are also mirrored across all nodes in the cluster. Now it
turns out we have a quartz job that runs twice a day to update all the
groups that come in through the GroupProvider and the spikes match.
But why should evicting lots of rows from the userSiteCache spawn lots
of threads? Object finalisation is normally done on one thread (that's
well named).

Anyone have any thoughts.

This is Sakai 10.x on Debian on OpenJDK 1.7

-- 
  Matthew Buckett, VLE Developer, IT Services, University of Oxford