[Building Sakai] production disaster apparently due to chat

Hedrick Charles hedrick at rutgers.edu
Thu Nov 3 16:00:00 PDT 2011


ugh. I just realized a tradeoff. If you have any windows that are observing the chat session but for some reason the courier is not running, you can have a buildup in the delivery queue. If you distribute the message, the amount of memory used by this is larger. My suspicion is that this is not enough to make it a bad tradeoff.

On Nov 3, 2011, at 6:49:12 PM, Charles Hedrick wrote:

> Please look at the latest. I still think the cache is a good idea, but I believe it's best to send around a single copy of the message, and avoid having Hibernate fetch the same object more than once in the first place. With the fix to update, the rest of the fixes, including the one I did yesterday, aren't actually needed.
> 
> I'm going to put some version of this in production tomorrow. Not sure whether I'll use the level 2 cache or not.
> 
> 
> On Nov 3, 2011, at 5:49 PM, Noah Botimer wrote:
> 
>> I think this is the right approach, but it will need some monitoring before it's ready for 2.8.x. It sounds like you're willing to give that attention.
>> 
>> There are two things to look out for with the L2 cache:
>> 
>>   1. The nodes can get desynchronized because we're using Ehcache in non-cluster mode.
>>   2. There is, by default, some overflow to disk, and the whole lot has to be serializable or you'll get loads of errors in the logs.
>> 
>> In this case, I think we have very low possibility of really being desynchronized and nonstrict-read-write appears to be the right choice since these are almost exclusively write-once entities. Also, thanks to John Hall for the tip about #2 -- he just mentioned turning the overflow off for a handful of things. We definitely use the L2 cache in OSP and it was throwing nonserializable errors for him.
>> 
>> Minus a few whitespace issues, the patches look good to me. Nice work.
>> 
>> Thanks,
>> -Noah
>> 
>> On Nov 3, 2011, at 5:19 PM, Charles Hedrick wrote:
>> 
>>> Yup. Enabling level 2 cache turns N * 2 selects per message (one of which is pretty hairy) into 1 simple select per message.
>>> 
>>> See https://jira.sakaiproject.org/browse/SAK-21353 the second patch.
>>> 
>>> Can you think of any reason not to do this?
>>> 
>>> Note setting cacheable in the template. Some tools forget to do this. Adding it to the XML file isn't enough. Also note how it's done. Some tools do it for each query, which isn't considered kosher.
>>> 
>>> I've run the first patch in production, but not yet this one. I probably won't be able to deploy it until tomorrow.
>>> 
>>>  Should this Jira be elevated to blocker? This is a pretty massive performance hole.
>>> 
>>> 
>>> On Nov 3, 2011, at 1:05 PM, Noah Botimer wrote:
>>> 
>>>> Charles,
>>>> 
>>>> I think your analysis is sound -- avoiding the call to the service entirely during .equals() is best -- so the propagation settings may not be implicated in this case. The L2 cache should also be picking these up since they are ID lookups -- if not, it's probably fixable with configuration. But I did want to circulate this to make folks aware:
>>>> 
>>>> 
>>>> I recently found a serious oversimplification we use all over. The Spring / Hibernate proxies for service/manager classes are generally marked with PROPAGATION_REQUIRED for all methods.
>>>> 
>>>> What this means in practical terms is that each method call at this service boundary (e.g., looped tool calls to the service) spawns both Spring / Hibernate and JDBC transactions. This can cause lots of time wasted making null transactions and also flushing the Hibernate session (especially with lazy collections and .list() / N+1s).
>>>> 
>>>> If a method is truly read-only, it should be marked as PROPAGATION_SUPPORTS, which will use a transaction if established, but not require one.
>>>> 
>>>> One other workaround I found (when you've got a busy tool method that goes to a proxied service often) is to add the @Transactional annotation to the tool method to start up an enclosing transaction. This is helpful if it isn't practical to mark specific service methods as non-transactional. It requires one addition to the bean XML file: <tx:annotation-driven transaction-manager="org.sakaiproject.springframework.orm.hibernate.GlobalTransactionManager" />.
>>>> 
>>>> Thanks,
>>>> -Noah
>>>> 
>>>> On Nov 2, 2011, at 9:53 PM, Hedrick Charles wrote:
>>>> 
>>>>> I see two problems. I've just fixed one, and will be rushing it into production. I haven't fixed the other.
>>>>> 
>>>>> When delivering a chat message, a list of ChatDelivery is created per user session (more or less). ChatDelivery has either a message ID or the actual message. When adding to the list, it checks whether the message is already there. That check uses equals. One thinks of equals as quick. Unfortunatley equals for a ChatDelivery checks whether the underlying messages are equal, as follows:
>>>>> 
>>>>>   this.getMessage.getId() .equals o.getMessage.getId()
>>>>> 
>>>>> The problem is that normally what you have is the message ID, a simple string. This code forces the message itself to be fetched, just to get the ID, which is of course the string that you already had. The solution is to add a new function getMessageId() which if you have the ID just returns it, while if you have the whole message it returns getId() of it.
>>>>> 
>>>>> That should reduce the database I/O occurring in ClusterEventTracking, where of course it's all happening in one thread, serialized. However when delivering to the user it is still going to have to fetch the message itself. But it's going to do a separate fetch per Delivery, i.e. per user in the site. This is unnecessary. You only need to fetch each message once. The ChatDelivery object shouldn't get its own copy of the message (or you should enable the cache in hibernate for this tool). The advantage of my fix is that those fetches are not serialized inside ClusterEventTracking, but we've still got way too much database activity. I need to get a fix up now, so I'm going to do the fix just outlined and worry later about fetching each message just once for delivery.
>>>>> 
>>>>> 
>>>>> On Nov 2, 2011, at 7:50:54 PM, Hedrick Charles wrote:
>>>>> 
>>>>>> This evening we had a disaster that appears to be due to chat.
>>>>>> 
>>>>>> I found 2 of our 3 front end systems in more or less continuous GC. I also found ClusterEventTracking at 100%. I assume this is not coincidence. The backtrace looked like this. I'm continuing to investigate.
>>>>>> 
>>>>>> com.mysql.jdbc.ResultSetImpl.findColumn(ResultSetImpl.java:1106)
>>>>>> com.mysql.jdbc.ResultSetImpl.getString(ResultSetImpl.java:5617)
>>>>>> org.apache.commons.dbcp.DelegatingResultSet.getString(DelegatingResultSet.java:225)
>>>>>> org.hibernate.type.StringType.get(StringType.java:18)
>>>>>> org.hibernate.type.NullableType.nullSafeGet(NullableType.java:163)
>>>>>> org.hibernate.type.NullableType.nullSafeGet(NullableType.java:154)
>>>>>> org.hibernate.loader.Loader.getKeyFromResultSet(Loader.java:1097)
>>>>>> org.hibernate.loader.Loader.getRowFromResultSet(Loader.java:565)
>>>>>> org.hibernate.loader.Loader.doQuery(Loader.java:701)
>>>>>> org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236)
>>>>>> org.hibernate.loader.Loader.loadEntity(Loader.java:1864)
>>>>>> org.hibernate.loader.entity.AbstractEntityLoader.load(AbstractEntityLoader.java:48)
>>>>>> org.hibernate.loader.entity.AbstractEntityLoader.load(AbstractEntityLoader.java:42)
>>>>>> org.hibernate.persister.entity.AbstractEntityPersister.load(AbstractEntityPersister.java:3039)
>>>>>> org.hibernate.event.def.DefaultLoadEventListener.loadFromDatasource(DefaultLoadEventListener.java:399)
>>>>>> org.hibernate.event.def.DefaultLoadEventListener.doLoad(DefaultLoadEventListener.java:375)
>>>>>> org.hibernate.event.def.DefaultLoadEventListener.load(DefaultLoadEventListener.java:139)
>>>>>> org.hibernate.event.def.DefaultLoadEventListener.proxyOrLoad(DefaultLoadEventListener.java:195)
>>>>>> org.hibernate.event.def.DefaultLoadEventListener.onLoad(DefaultLoadEventListener.java:103)
>>>>>> org.hibernate.impl.SessionImpl.fireLoad(SessionImpl.java:879)
>>>>>> org.hibernate.impl.SessionImpl.internalLoad(SessionImpl.java:847)
>>>>>> org.hibernate.type.EntityType.resolveIdentifier(EntityType.java:567)
>>>>>> org.hibernate.type.EntityType.resolve(EntityType.java:389)
>>>>>> org.hibernate.engine.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:120)
>>>>>> org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:854)
>>>>>> org.hibernate.loader.Loader.doQuery(Loader.java:729)
>>>>>> org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236)
>>>>>> org.hibernate.loader.Loader.loadEntity(Loader.java:1864)
>>>>>> org.hibernate.loader.entity.AbstractEntityLoader.load(AbstractEntityLoader.java:48)
>>>>>> org.hibernate.loader.entity.AbstractEntityLoader.load(AbstractEntityLoader.java:42)
>>>>>> org.hibernate.persister.entity.AbstractEntityPersister.load(AbstractEntityPersister.java:3039)
>>>>>> org.hibernate.event.def.DefaultLoadEventListener.loadFromDatasource(DefaultLoadEventListener.java:399)
>>>>>> org.hibernate.event.def.DefaultLoadEventListener.doLoad(DefaultLoadEventListener.java:375)
>>>>>> org.hibernate.event.def.DefaultLoadEventListener.load(DefaultLoadEventListener.java:139)
>>>>>> org.hibernate.event.def.DefaultLoadEventListener.proxyOrLoad(DefaultLoadEventListener.java:195)
>>>>>> org.hibernate.event.def.DefaultLoadEventListener.onLoad(DefaultLoadEventListener.java:103)
>>>>>> org.hibernate.impl.SessionImpl.fireLoad(SessionImpl.java:879)
>>>>>> org.hibernate.impl.SessionImpl.get(SessionImpl.java:816)
>>>>>> org.hibernate.impl.SessionImpl.get(SessionImpl.java:809)
>>>>>> org.springframework.orm.hibernate3.HibernateTemplate$1.doInHibernate(HibernateTemplate.java:531)
>>>>>> org.springframework.orm.hibernate3.HibernateTemplate.doExecute(HibernateTemplate.java:419)
>>>>>> org.springframework.orm.hibernate3.HibernateTemplate.executeWithNativeSession(HibernateTemplate.java:374)
>>>>>> org.springframework.orm.hibernate3.HibernateTemplate.get(HibernateTemplate.java:525)
>>>>>> org.springframework.orm.hibernate3.HibernateTemplate.get(HibernateTemplate.java:519)
>>>>>> org.sakaiproject.chat2.model.impl.ChatManagerImpl.getMessage(ChatManagerImpl.java:563)
>>>>>> sun.reflect.GeneratedMethodAccessor5759.invoke(Unknown Source)
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>> java.lang.reflect.Method.invoke(Method.java:597)
>>>>>> org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:307)
>>>>>> org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:182)
>>>>>> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:149)
>>>>>> org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:106)
>>>>>> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
>>>>>> org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
>>>>>> $Proxy48.getMessage(Unknown Source)
>>>>>> org.sakaiproject.chat2.tool.ChatDelivery.getMessage(ChatDelivery.java:80)
>>>>>> org.sakaiproject.chat2.tool.ChatDelivery.equals(ChatDelivery.java:191)
>>>>>> java.util.ArrayList.indexOf(ArrayList.java:216)
>>>>>> java.util.ArrayList.contains(ArrayList.java:199)
>>>>>> org.sakaiproject.courier.impl.BasicCourierService.deliver(BasicCourierService.java:137)
>>>>>> org.sakaiproject.chat2.tool.ChatTool.receivedMessage(ChatTool.java:393)
>>>>>> org.sakaiproject.chat2.model.impl.ChatManagerImpl.update(ChatManagerImpl.java:868)
>>>>>> java.util.Observable.notifyObservers(Observable.java:142)
>>>>>> org.sakaiproject.event.impl.BaseEventTrackingService.notifyObservers(BaseEventTrackingService.java:120)
>>>>>> org.sakaiproject.event.impl.ClusterEventTracking.run(ClusterEventTracking.java:682)
>>>>>> java.lang.Thread.run(Thread.java:662)
>>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> sakai-dev mailing list
>>>>> sakai-dev at collab.sakaiproject.org
>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>> 
>>>>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of "unsubscribe"
>>>> 
>>> 
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://collab.sakaiproject.org/pipermail/sakai-dev/attachments/20111103/a23fdf27/attachment.html 


More information about the sakai-dev mailing list