[Building Sakai] my conclusions from yesterday

Fri Oct 4 09:02:01 PDT 2013

Also, two web pages you might find interesting:

real time performace: https://sakai.rutgers.edu/sakai-summary.jsp
historical performance: https://sakai.rutgers.edu/stats/

On Oct 4, 2013, at 11:42 AM, Charles Hedrick <hedrick at rutgers.edu> wrote:

> This was a weird couple of days. I still don’t know what happened. In general we’ve been quite happy with 2.9.1. In particular, database load is a lot lower than in our 2.9.0 beta, and I believe also 2.8. We’ve had 1500 person tests in Samigo with no problem at due dates. When this happened there was nothing odd going on that we can see. What’s worse, restarting the systems didn’t fix it: we got right back into it.
> 
> I have seen in the past that adding an applications server can handle odd performance problems, even when there’s no clear reason to think it would be needed. So we’ve done that. But with a strange, unreproducible problem it’s hard to know when it’s fixed.
> 
> We’re using virtual machines under Xen. 
> 
> Front ends: 8 virtual CPU (4 cores — Xen sees each core as two hyprthreads), 20 GB memory, Xmx15000m -Xms15000m -Xmn2500m -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=80 -XX:MaxPermSize=700m -XX:PermSize=700m. Centos 6.4
> 
> Database: 20 virtual CPU (we’ll probably move to 16 — we just don’t need that), 32 GB, with mysqld (actually mariaDB 5.5) taking about 27 GB. Centos 64.
> 
> Storage for the database is a Dell iSCSI array. The Centos image is doing the iSCSI. I.e. Xen is just passing network packets. I’m not impressed with Xen’s iSCSI handling.
> 
> We have 6 front ends. Normally 3 are in use at a time, but we’ve just moved to 4 as a result of this problem. That will complicate deploying new code.
> 
> Typically DB runs around 10% of CPU, with gusts to 15%, front ends 20 - 30% going up to maybe 40%. We are not overcommitted on virtual CPUs, i.e. we have as much physical hyper threads as virtual CPUs. That’s likely to change this summer.
> 
> We have a redundant pair of F5 load balancers. They’re configured so we can take an app server out of production without disturbing existing jobs. That’s how we do new deploys. Users should never see Sakai down, except when we have a disaster like this one.
> 
> 
> On Oct 4, 2013, at 11:14 AM, Curtis Van-Osch <curtis.van-osch at hec.ca> wrote:
> 
>> Hi Charles, 
>> I'm just wondering if you would willing to give me a little more information on your production server infrastructure for Sakai.  We're looking at improving our own due to some downtime at the beginning of this semester, after upgrading to Sakai 2.9.1.  I'm trying to get an idea of what hardware others are running and how many peak concurrent sessions they can handle.
>> 
>> Thanks in advance for any information you can provide us.
>> 
>> Best Regards,
>> 
>> <HEC_149px.png>
>> Curtis Van Osch
>> Analyste-Programmeur
>> Direction des technologies de l'information
>> <hecca.png>
>> <3A-2008.png>	
>> 3000, chemin de la Côte‑Sainte‑Catherine, Montréal (Québec) H3T 2A7
>> Téléphone : 514 340-6000, poste 2029 
>> 
>> <Feuille_25px.jpg> 	 	
>> Pensons à l'environnement
>> avant d'imprimer
>> 
>> -------- Message original --------
>> Sujet: [Building Sakai] my conclusions from yesterday
>> De : Hedrick Charles <hedrick at rutgers.edu>
>> Pour : sakai dev <sakai-dev at collab.sakaiproject.org>
>> Date : 2013-09-30 18:28
>>> I believe what happened last night was simply running out of memory. In sakai-summary.jsp, there's a line labelled "men". It has the amount of memory in GB in the last GC. If it's 12 you've got a problem with that VM.
>>> 
>>> Last night all three systems that were up were at 13. They were unusable, so the load balancer took them out of use. To be honest there are lots of issues remaining about exacty how the LB works. Does it sometimes say "Sakai is down" when it's just one bad server? Maybe. But last night all 3 were unusable at once.
>>> 
>>> Recommendation: watch memory usage. When it gets to 11 or 12, take that system out of service and restart it.
>>> 
>>> If we get into the situation of last night, about all you can do is restart. I'd also try adding servers. Depending upon the underlying issue that may well help. Short of that, try watching the systems closely enough that you can reboot them one at a time before the whole system fails.
>>> 
>>> We need to look for known memory issues. KNL-1037 look particularly interesting. It might be worth installing.
>>> 
>>> _______________________________________________
>>> sakai-dev mailing list
>>> 
>>> sakai-dev at collab.sakaiproject.org
>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>> 
>>> 
>>> TO UNSUBSCRIBE: send email to 
>>> sakai-dev-unsubscribe at collab.sakaiproject.org
>>> with a subject of "unsubscribe"
>>> 
>>> 
>> 
>