[Building Sakai] production disaster apparently due to chat

Thu Nov 3 17:31:28 PDT 2011

If you're right, than my latest code may be about right, except that it might be worth evicting the messages. I would never have thought of that. I'm going to deploy the patches from SAK-21353 tomorrow. The original code is pretty clearly a disaster waiting to happen.

However I agree that a more serious look at the design might produce better results. The delivery object seems to need the address, the text of the message, the author and the date. The ChatMessage object doesn't seem too bad, except maybe for the ChatChannel reference, but hopefully there won't be many different ChatChannel objects. ChatDelivery has a bit more overhead, but it may not be too much.

I would appreciate it if you'd think about this. Sakai chat is primarily useful for things like online office hours. AIM, Skype, etc, is more interesting for individual communications. In big places like Rutgers, chat is always going to end up being used for very large classes. It's been a continuing source of performance problems. Heaven help us if the whole senior class decide to hold a group chat, which is not impossible… We're starting to see sites created automatically based on LDAP queries. These tend to be things like all majors, all students in a college, etc. I.e. very big memberships.

On Nov 3, 2011, at 8:16:40 PM, Noah Botimer wrote:

> I'm about to go beyond the first pass of optimization (reducing queries and time). Be warned that this may be wandering in the woods...
> 
> 
> Is the determinism of making our own cache for message data keyed on ID worth it? Better than just using the Hibernate L2? With people bouncing around, do we know how likely it is that there are stranded messages that never get picked up? Can you see the heap in production to see how many ChatDelivery/ChatMessage objects are floating?
> 
> Thinking about the GC problem...
> 
> The ChatDelivery objects always embed the full message eventually, and it's an attached Hibernate entity, which could be causing overhead (and the L2 won't save us here, since it clones). If these hang around, and especially with higher concurrency in a room, it could be a lot of heap.
> 
> Is there a reasonable way to do a flyweight for the ChatDelivery objects? They embed the address (session ID + chat room ID), so there has to be an object per user/window. Otherwise, they are the same for each user and duplicated. I'm not sure how long they live on the heap but presumably they get collected relatively easily post-delivery. If they are thinner, the heap churn will be reduced.
> 
> It might be possible to make a thinner Delivery type that just has the address and a reference to a shared instance (possibly with an evicted/transient message to avoid Hibernate session management). The shared stuff should get collected after all deliveries have been made but we should probably keep the raw message data cached for a while.
> 
> Thinking about it, there may be a more direct route to a truly shared ChatDelivery instance. The only thing that varies is the address, and there is a getAddress(). This could be overridden to always be request-bound and defer to SessionManager.getCurrentSession(). This may be safe if it is always called in a request thread and doesn't get some unwanted disposal treatment -- both of these appear to be true. BaseDelivery.act() is a no-op and picking up a delivery just removes the instance from an address list. It would be a matter of holding the delivery instances in the manager, keyed by message ID, to be able to use the shared object in receivedMessage().
> 
> I'll stop here. I am willing to poke at this some, though. Any of this sound like a good idea?
> 
> Thanks,
> -Noah
> 
> On Nov 3, 2011, at 7:00 PM, Hedrick Charles wrote:
> 
>> ugh. I just realized a tradeoff. If you have any windows that are observing the chat session but for some reason the courier is not running, you can have a buildup in the delivery queue. If you distribute the message, the amount of memory used by this is larger. My suspicion is that this is not enough to make it a bad tradeoff.
>> 
>> On Nov 3, 2011, at 6:49:12 PM, Charles Hedrick wrote:
>> 
>>> Please look at the latest. I still think the cache is a good idea, but I believe it's best to send around a single copy of the message, and avoid having Hibernate fetch the same object more than once in the first place. With the fix to update, the rest of the fixes, including the one I did yesterday, aren't actually needed.
>>> 
>>> I'm going to put some version of this in production tomorrow. Not sure whether I'll use the level 2 cache or not.
>>> 
>>> 
>>> On Nov 3, 2011, at 5:49 PM, Noah Botimer wrote:
>>> 
>>>> I think this is the right approach, but it will need some monitoring before it's ready for 2.8.x. It sounds like you're willing to give that attention.
>>>> 
>>>> There are two things to look out for with the L2 cache:
>>>> 
>>>>   1. The nodes can get desynchronized because we're using Ehcache in non-cluster mode.
>>>>   2. There is, by default, some overflow to disk, and the whole lot has to be serializable or you'll get loads of errors in the logs.
>>>> 
>>>> In this case, I think we have very low possibility of really being desynchronized and nonstrict-read-write appears to be the right choice since these are almost exclusively write-once entities. Also, thanks to John Hall for the tip about #2 -- he just mentioned turning the overflow off for a handful of things. We definitely use the L2 cache in OSP and it was throwing nonserializable errors for him.
>>>> 
>>>> Minus a few whitespace issues, the patches look good to me. Nice work.
>>>> 
>>>> Thanks,
>>>> -Noah
>>>> 
>>>> On Nov 3, 2011, at 5:19 PM, Charles Hedrick wrote:
>>>> 
>>>>> Yup. Enabling level 2 cache turns N * 2 selects per message (one of which is pretty hairy) into 1 simple select per message.
>>>>> 
>>>>> See https://jira.sakaiproject.org/browse/SAK-21353 the second patch.
>>>>> 
>>>>> Can you think of any reason not to do this?
>>>>> 
>>>>> Note setting cacheable in the template. Some tools forget to do this. Adding it to the XML file isn't enough. Also note how it's done. Some tools do it for each query, which isn't considered kosher.
>>>>> 
>>>>> I've run the first patch in production, but not yet this one. I probably won't be able to deploy it until tomorrow.
>>>>> 
>>>>>  Should this Jira be elevated to blocker? This is a pretty massive performance hole.
>>>>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://collab.sakaiproject.org/pipermail/sakai-dev/attachments/20111103/ef10c95c/attachment.html