[Building Sakai] production disaster apparently due to chat

Noah Botimer botimer at umich.edu
Thu Nov 3 17:16:40 PDT 2011


I'm about to go beyond the first pass of optimization (reducing queries and time). Be warned that this may be wandering in the woods...


Is the determinism of making our own cache for message data keyed on ID worth it? Better than just using the Hibernate L2? With people bouncing around, do we know how likely it is that there are stranded messages that never get picked up? Can you see the heap in production to see how many ChatDelivery/ChatMessage objects are floating?

Thinking about the GC problem...

The ChatDelivery objects always embed the full message eventually, and it's an attached Hibernate entity, which could be causing overhead (and the L2 won't save us here, since it clones). If these hang around, and especially with higher concurrency in a room, it could be a lot of heap.

Is there a reasonable way to do a flyweight for the ChatDelivery objects? They embed the address (session ID + chat room ID), so there has to be an object per user/window. Otherwise, they are the same for each user and duplicated. I'm not sure how long they live on the heap but presumably they get collected relatively easily post-delivery. If they are thinner, the heap churn will be reduced.

It might be possible to make a thinner Delivery type that just has the address and a reference to a shared instance (possibly with an evicted/transient message to avoid Hibernate session management). The shared stuff should get collected after all deliveries have been made but we should probably keep the raw message data cached for a while.

Thinking about it, there may be a more direct route to a truly shared ChatDelivery instance. The only thing that varies is the address, and there is a getAddress(). This could be overridden to always be request-bound and defer to SessionManager.getCurrentSession(). This may be safe if it is always called in a request thread and doesn't get some unwanted disposal treatment -- both of these appear to be true. BaseDelivery.act() is a no-op and picking up a delivery just removes the instance from an address list. It would be a matter of holding the delivery instances in the manager, keyed by message ID, to be able to use the shared object in receivedMessage().

I'll stop here. I am willing to poke at this some, though. Any of this sound like a good idea?

Thanks,
-Noah

On Nov 3, 2011, at 7:00 PM, Hedrick Charles wrote:

> ugh. I just realized a tradeoff. If you have any windows that are observing the chat session but for some reason the courier is not running, you can have a buildup in the delivery queue. If you distribute the message, the amount of memory used by this is larger. My suspicion is that this is not enough to make it a bad tradeoff.
> 
> On Nov 3, 2011, at 6:49:12 PM, Charles Hedrick wrote:
> 
>> Please look at the latest. I still think the cache is a good idea, but I believe it's best to send around a single copy of the message, and avoid having Hibernate fetch the same object more than once in the first place. With the fix to update, the rest of the fixes, including the one I did yesterday, aren't actually needed.
>> 
>> I'm going to put some version of this in production tomorrow. Not sure whether I'll use the level 2 cache or not.
>> 
>> 
>> On Nov 3, 2011, at 5:49 PM, Noah Botimer wrote:
>> 
>>> I think this is the right approach, but it will need some monitoring before it's ready for 2.8.x. It sounds like you're willing to give that attention.
>>> 
>>> There are two things to look out for with the L2 cache:
>>> 
>>>   1. The nodes can get desynchronized because we're using Ehcache in non-cluster mode.
>>>   2. There is, by default, some overflow to disk, and the whole lot has to be serializable or you'll get loads of errors in the logs.
>>> 
>>> In this case, I think we have very low possibility of really being desynchronized and nonstrict-read-write appears to be the right choice since these are almost exclusively write-once entities. Also, thanks to John Hall for the tip about #2 -- he just mentioned turning the overflow off for a handful of things. We definitely use the L2 cache in OSP and it was throwing nonserializable errors for him.
>>> 
>>> Minus a few whitespace issues, the patches look good to me. Nice work.
>>> 
>>> Thanks,
>>> -Noah
>>> 
>>> On Nov 3, 2011, at 5:19 PM, Charles Hedrick wrote:
>>> 
>>>> Yup. Enabling level 2 cache turns N * 2 selects per message (one of which is pretty hairy) into 1 simple select per message.
>>>> 
>>>> See https://jira.sakaiproject.org/browse/SAK-21353 the second patch.
>>>> 
>>>> Can you think of any reason not to do this?
>>>> 
>>>> Note setting cacheable in the template. Some tools forget to do this. Adding it to the XML file isn't enough. Also note how it's done. Some tools do it for each query, which isn't considered kosher.
>>>> 
>>>> I've run the first patch in production, but not yet this one. I probably won't be able to deploy it until tomorrow.
>>>> 
>>>>  Should this Jira be elevated to blocker? This is a pretty massive performance hole.
>>>> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://collab.sakaiproject.org/pipermail/sakai-dev/attachments/20111103/0d7a822e/attachment.html 


More information about the sakai-dev mailing list