[Deploying Sakai] Deployment sizing question

Sat May 16 05:19:21 PDT 2009

Though we saw it there, the large result set problem is not really  
inherent in OSP. The problem was that I was extracting data in a new  
way with some direct HQL queries. There is a Hibernate bug [1] that  
turned a query for a reasonable set of Forms into a huge cross  
product (millions of rows). It is not yet fixed in their main line,  
so it could happen where we write HQL.

How this huge dataset causes a problem is that it takes a while to  
run the query but, more importantly, a .list() call to the result set  
tries to pull an enormous amount of data into memory and never  
aborts. We are not aware of anywhere this is in the main or contrib  
Sakai code. I just hit a subtle typo case that happens to crash  
servers as fast as you can fill up your memory (about 15 minutes on  
our heaps).

Thanks,
-Noah

[1] http://opensource.atlassian.com/projects/hibernate/browse/HHH-2647

On May 14, 2009, at 4:01 PM, R.P. Aditya wrote:

> In gmane.comp.cms.sakai.production, you wrote:
>> Are you using dbcp or c3po for your connection pool? I also wonder  
>> if your
>> appservers are really not memory-bound - what maximum full GC  
>> times do you
>> see?
>
> We are using dbcp.
>
> since moving to 6GB heaps we rarely see FullGCs (as in three times on
> individual servers in the last year) and typically it's been due to  
> a very
> large resultset in a query (Noah can point you to some OSP ones  
> he's found due
> to a hibernate bug and Jim Eng can probably point you to some  
> resources
> related ones).
>
>> What in your view are the known and yet-to-be-determined reasons  
>> for Sakai
>> going unresponsive - can you post some JIRA refs (or create new  
>> JIRAs if there
>> are problems you're aware of that aren't in JIRA, no matter how  
>> sketchy)? What
>> do the logs say when they're unresponsive?
>
> In most cases, when individual appservers go unresponsive, there is  
> no clear
> indication as to what the problem is -- we automatically trigger  
> thread dumps
> and capture the sql of active queries from that server to the  
> database. There
> is usually nothing obvious on the db side, and the cpu on the  
> appserver is not
> pegged. In the thread dump on the appserver logs we see lots of  
> blocked
> threads, but by the time we see the alarm it is hard to find the  
> cause. The
> last thing in the logs in some cases has been email digest  
> processing (there
> are a bunch of synchoronized calls in the digest processing code)  
> and more
> recently we saw the OSP-related large resultset/mis-constructed  
> hibernate
> queries as probable causes.
>
> Part of the problem is this doesn't happen all the time or often  
> enough to
> suggest a pattern. There are some edge cases or seldom used  
> functions in
> currently seldom used tools which likely cause thread lockups, but  
> by the time
> we can we look there is enough other activity that it is hard to  
> tell which
> thread originated the problems...
>
> Most recently, we had end of semester problems most probably  
> related to some
> set of queries causing the db to go unresponsive, but as soon as the
> appservers were restarted, the db returned to normal suggesting  
> some sort of
> locking problem or concurrency related lockup. However we still  
> haven't found
> anything definitive and there has been discussion about this on the  
> -devel
> list so I won't go into it here.
>
> Adi
> _______________________________________________
> production mailing list
> production at collab.sakaiproject.org
> http://collab.sakaiproject.org/mailman/listinfo/production
>
> TO UNSUBSCRIBE: send email to production- 
> unsubscribe at collab.sakaiproject.org with a subject of "unsubscribe"
>
>