[Deploying Sakai] [Building Sakai] Strange database connection memory issue at U-M
botimer at umich.edu
Thu Sep 17 12:24:05 PDT 2009
I think we would do well to consider a simple work queue. We would
push things that are expected to run long there. I'd say start with a
generous estimate (10 sec at our scale, maybe?). Instrumenting the
pool to know how long connections are held would help identify
candidates to be taken out of immediate mode into queued mode. Some
of these pieces would need asynchronous handling / notification at
Once there, we could incrementally tune our attention threshold
closer to the practical limits of instantaneous vs. queued work
(maybe 2 seconds per thread?). As the set of things we'd break by
imposing thread timeouts dwindled, we could consider being vicious to
threads running unexpectedly long. With a reasonable, consistent
infrastructure for deciding which mode something should use, the
decision is much less daunting and the rules of "fast, queued, or
killed" are tolerable.
In this model, we would gain practical expertise in how to deal with
failures more gracefully. I believe our current model is too
optimistic, generally assuming complete success of every operation
attempted, which adds brittleness to the app and data. It is worth
studying each of the big cloud solutions here to see patterns and
The important thing to note is that there is a very gentle ramp
available here. A little tooling for deferred/notified work and
tracking ill-behaved transactions wouldn't mandate immediate overhaul
of all code. It would just give us a way to start moving toward a
more resilient platform without the all-or-nothing paralysis that is
easy to find in these topics.
Please also note that this exercise need not be restricted to 2.x or
3.x. There are real scalability needs to consider on either platform
and the basic techniques and wisdom (if not code) are transferrable.
On Sep 17, 2009, at 2:53 PM, Speelmon, Lance Day wrote:
> Perhaps - probably the right place to start looking. The problem is
> that the tools to help troubleshoot these kind of leaks cannot be used
> in production environments. For example, if you tell the connection
> pool to force return connections to the pool after a set amount of
> time - it interferes with batch activities. So, then the question
> becomes: can these leaks be triggered in a test environment? Or maybe
> batch should be running out of a separate pool to allow this kind of
> troubleshooting? L
> Lance Speelmon
> Scholarly Technologist
> On Sep 17, 2009, at 2:31 PM, Seth Theriault wrote:
>> Sean DeMonner wrote:
>>> We've wondered that too, but haven't found anything that is
>>> running at those times. The jumps are also not consistently
>>> spaced so we haven't seen a pattern.
>> I am thinking that there is some new JDBC-related code in 2.6 and
>> the database connections are not being closed properly (I guess
>> in this case, it would mean releasing the conns back to the
>> pool). I know that Michigan has the capacity to have a large sets
>> of results returned from any query.
>> Just a thought.
>> production mailing list
>> production at collab.sakaiproject.org
>> TO UNSUBSCRIBE: send email to production-
>> unsubscribe at collab.sakaiproject.org
>> with a subject of "unsubscribe"
> production mailing list
> production at collab.sakaiproject.org
> TO UNSUBSCRIBE: send email to production-
> unsubscribe at collab.sakaiproject.org with a subject of "unsubscribe"
More information about the production