[Deploying Sakai] [Building Sakai] Strange database connection memory issue at U-M

Thu Sep 17 12:24:05 PDT 2009

I think we would do well to consider a simple work queue. We would  
push things that are expected to run long there. I'd say start with a  
generous estimate (10 sec at our scale, maybe?). Instrumenting the  
pool to know how long connections are held would help identify  
candidates to be taken out of immediate mode into queued mode. Some  
of these pieces would need asynchronous handling / notification at  
the UI.

Once there, we could incrementally tune our attention threshold  
closer to the practical limits of instantaneous vs. queued work  
(maybe 2 seconds per thread?). As the set of things we'd break by  
imposing thread timeouts dwindled, we could consider being vicious to  
threads running unexpectedly long. With a reasonable, consistent  
infrastructure for deciding which mode something should use, the  
decision is much less daunting and the rules of "fast, queued, or  
killed" are tolerable.

In this model, we would gain practical expertise in how to deal with  
failures more gracefully. I believe our current model is too  
optimistic, generally assuming complete success of every operation  
attempted, which adds brittleness to the app and data. It is worth  
studying each of the big cloud solutions here to see patterns and  
techniques.

The important thing to note is that there is a very gentle ramp  
available here. A little tooling for deferred/notified work and  
tracking ill-behaved transactions wouldn't mandate immediate overhaul  
of all code. It would just give us a way to start moving toward a  
more resilient platform without the all-or-nothing paralysis that is  
easy to find in these topics.

Please also note that this exercise need not be restricted to 2.x or  
3.x. There are real scalability needs to consider on either platform  
and the basic techniques and wisdom (if not code) are transferrable.

Thanks,
-Noah

On Sep 17, 2009, at 2:53 PM, Speelmon, Lance Day wrote:

> Perhaps - probably the right place to start looking.  The problem is
> that the tools to help troubleshoot these kind of leaks cannot be used
> in production environments.  For example, if you tell the connection
> pool to force return connections to the pool after a set amount of
> time - it interferes with batch activities.  So, then the question
> becomes: can these leaks be triggered in a test environment?  Or maybe
> batch should be running out of a separate pool to allow this kind of
> troubleshooting?  L
>
>
> Lance Speelmon
> Scholarly Technologist
>
> On Sep 17, 2009, at 2:31 PM, Seth Theriault wrote:
>
>> Sean DeMonner wrote:
>>
>>> We've wondered that too, but haven't found anything that is
>>> running at those times. The jumps are also not consistently
>>> spaced so we haven't seen a pattern.
>>
>> I am thinking that there is some new JDBC-related code in 2.6 and
>> the database connections are not being closed properly (I guess
>> in this case, it would mean releasing the conns back to the
>> pool). I know that Michigan has the capacity to have a large sets
>> of results returned from any query.
>>
>> Just a thought.
>>
>> Seth
>>
>> _______________________________________________
>> production mailing list
>> production at collab.sakaiproject.org
>> http://collab.sakaiproject.org/mailman/listinfo/production
>>
>> TO UNSUBSCRIBE: send email to production- 
>> unsubscribe at collab.sakaiproject.org
>>  with a subject of "unsubscribe"
>
> _______________________________________________
> production mailing list
> production at collab.sakaiproject.org
> http://collab.sakaiproject.org/mailman/listinfo/production
>
> TO UNSUBSCRIBE: send email to production- 
> unsubscribe at collab.sakaiproject.org with a subject of "unsubscribe"
>
>