[Deploying Sakai] Deployment sizing question [Rutgers data and plans]

Tue May 19 09:00:07 PDT 2009

Data from Rutgers, 2 of our 3 campuses.

Note that Continuous Education uses eCollege, and it is also available  
for on-campus use. The third campus uses Blackboard. We moved from  
WebCT 4 to Sakai. WebCT has been decommissioned for a year.

40,000 students, about 36,000 people use Sakai within a month,  
including up to 600 high school students in a district that has a  
contract with us to run Sakai for them. (They are on the same servers.)

About 45% of our undergrad sections use Sakai. We do not precreate  
sites, so this should be a real number. At least half of our sites are  
non-instructional. It is used for all activities, including  
politically sensitive activities involving the upper level  
administration. We also use OSP for portfolios, though that's not in  
heavy use yet. We're slowly phasing in the evaluation system for our  
course evaluations.

I've seen 2500 users on, but only once. That is a conservative number.  
Sakai would show a lot more sessions. This is the number of distinct  
IP addresses in a netstat -n -a, which should represent the number of  
people doing queries in the last 2 minutes.

Infrastructure:

A pair of Barracuda load balancers, with auto failover.

DB:

A pair of Sun X4150s. 2 x Intel 5355, 2.66 GHz, total of 8 cores. 16  
GB of memory

We run Mysql. The second machine is a slave, maintained in sync with  
the primary. The slave is not normally used. It's there in case the  
primary fails. The slave also hosts the database for our test  
infrastructure.

Front ends:

5 Sun X4100, 2 x Opteron 275, 2.2 GHz, total of 4 cores, 16 GB of  
memory. Only 4 of these are in production. The 5th, with the same code  
and pointing at the production database, is used when we need to put  
up a fix or new feature for one faculty member, or we want to verify  
that something works in the production configuration, but don't want  
to redeploy the public systems.

We run a single 64-bit JVM on each, using about 13 GB of memory. The  
JVM typically stays up for at least a month. We've just begun  
restarting them after a month of uptime, though it's not clear whether  
this is needed.

We've seen Sakai become unresponsive for two reasons:

* a problem with a specific application. Very rare. I think it's  
happened once this semester.
* very long GC's, over 3 minutes. One or two a week. This is long  
enough that it sometimes triggers the load balancer to mark the system  
down, requiring users to login again.

I'm still worried about the long GC's. As everyone says a number of GC  
bugs have been fixed, as we move to 2.6 tomorrow we're also moving to  
Java 1.6.0 update 13 (still building under Java 5, but this may be  
paranoia). We've done the last 2 weeks of testing for 2.6 under this  
Java.

If they release update 14 fairly soon, I am hoping to go to the new  
garbage collector, G1, for fall. We'll deploy it slowly, first on test  
systems, then on one of the front ends. The current GCs are simply not  
designed for very large JVM's. I don't think we'll ever see trouble- 
free operation with them.