[Deploying Sakai] Deployment sizing question

Thu May 14 13:01:07 PDT 2009

In gmane.comp.cms.sakai.production, you wrote:
> Are you using dbcp or c3po for your connection pool? I also wonder if your
> appservers are really not memory-bound - what maximum full GC times do you
> see?

We are using dbcp.

since moving to 6GB heaps we rarely see FullGCs (as in three times on
individual servers in the last year) and typically it's been due to a very
large resultset in a query (Noah can point you to some OSP ones he's found due
to a hibernate bug and Jim Eng can probably point you to some resources
related ones).

> What in your view are the known and yet-to-be-determined reasons for Sakai
> going unresponsive - can you post some JIRA refs (or create new JIRAs if there
> are problems you're aware of that aren't in JIRA, no matter how sketchy)? What
> do the logs say when they're unresponsive?

In most cases, when individual appservers go unresponsive, there is no clear
indication as to what the problem is -- we automatically trigger thread dumps
and capture the sql of active queries from that server to the database. There
is usually nothing obvious on the db side, and the cpu on the appserver is not
pegged. In the thread dump on the appserver logs we see lots of blocked
threads, but by the time we see the alarm it is hard to find the cause. The
last thing in the logs in some cases has been email digest processing (there
are a bunch of synchoronized calls in the digest processing code) and more
recently we saw the OSP-related large resultset/mis-constructed hibernate
queries as probable causes.

Part of the problem is this doesn't happen all the time or often enough to
suggest a pattern. There are some edge cases or seldom used functions in
currently seldom used tools which likely cause thread lockups, but by the time
we can we look there is enough other activity that it is hard to tell which
thread originated the problems...

Most recently, we had end of semester problems most probably related to some
set of queries causing the db to go unresponsive, but as soon as the
appservers were restarted, the db returned to normal suggesting some sort of
locking problem or concurrency related lockup. However we still haven't found
anything definitive and there has been discussion about this on the -devel
list so I won't go into it here.

Adi