[Deploying Sakai] java 6 update

Sat Nov 7 05:35:17 PST 2009

I'm now reasonably convinced that Java 6 is OK to use, and is in fact  
nearly identical to Java 5, except for needing - 
Dsun.lang.ClassLoader.allowArraySyntax=true, and having somewhat  
different defaults.

It's nearly certain that our problem with long GCs was due to heavy  
paging. During a GC, memory is accessed randomly. You really need the  
whole system to be in memory, or it's intolerably slow. For some  
reason, Solaris sometimes uses more memory than I would expect.

We just restarted our system to switch database servers. Before taking  
it down I did "pmap" on Java. That tells the actual physical memory  
usage. (It turns out that this agrees almost exactly with the usage  
printed by "ps aux", as you'd hope.)  Free memory comes from "vmstat."  
On one system (which happens to be running Java 6), the amount of  
memory used by Java plus free memory is about 15 GB. That's reasonable  
for a 16 GB machine. But on the other, they are about 13 GB. So we  
have about 2 GB less free memory than I had expected. Where is it? Who  
knows? But the upshot is that I have to configure the JVM to be 2 GB  
smaller than I had expected. This difference does not seem to depend  
upon the version of Java. At the moment the one with the missing 2 GB  
is Java 5.

Here are the options I'm currently using with Java 6:

JAVA_OPTS=" -d64 -Dsun.lang.ClassLoader.allowArraySyntax=true - 
Xmx10500m -Xms10500m -Xmn2560m -XX:+UseConcMarkSweepGC -XX: 
+UseParNewGC -XX:CMSInitiatingOccupancyFraction=75 - 
XX:MaxPermSize=512m -XX:PermSize=512m -XX:+DisableExplicitGC -XX: 
+DoEscapeAnalysis "

Also, I strongly recommend adding -Dfile.encoding=UTF-8, but I do that  
in the second JAVA_OPTS declaration.

I'd really rather allocate more memory. We're tight on both new and  
old. But until we go through a heavy period that's all I feel safe  
with at the moment. Some comments on configuration:

* I don't think you need to configure the size of survivor spaces. The  
new default looks fine.
* You can probably allow new (-Xmn) to default, but I prefer a  
slightly larger new than that, because we sometimes have 1.1GB  
objects, and I'd like them to fit in new.
* I think it's necessary to configure - 
XX:CMSInitiatingOccupancyFraction=75, but YMMV. In Java 5, the default  
is 68%, which is fine. However in Java 6 the default is dynamic,but  
seems to work out to a much higher threshold. At one point I thought  
that led to more concurrent mode failures.
* I'm using DisableExplicitGC because there are a few calls to  
System.gc(), and I don't really want that to happen. Another  
possibility would be -XX:+ExplicitGCInvokesConcurrent
- I'm using escape analysis in the hopes that it will reduce  
fragmentation. At any rate, it seems like a good idea, and I haven't  
seen any recent bugs involving it, so it looks like it's safe.
- I start perm out at 512, because expanding it seems to cause a full  
GC. I'd like to avoid full GCs as much as possible.

I'm still seeing a few full GC's, like one a week per front end. They  
take 24 - 28 sec. They occur on both Java 5 and 6. The ones I've  
tracked down have been due to

* sometimes the system continued to use permanent space. Eventually a  
GC is forced to reclaim space in perm. By default this is a full GC.  
It might be worth experimenting with -XX:+CMSClassUnloadingEnabled,  
which in Java 5 should be combined with -XX: 
+CMSPermGenSweepingEnabled. But we've done enough experiments for a  
while.

* in one case we got a concurrent mode failure. From looking at jstat,  
I'd say someone was building up a 1.1 GB data structure incrementally.  
New couldn't quite handle it, so the big object was put in old. Old  
obviously didn't have enough contiguous space. This seems to be rare.  
I'm assuming this is the result of uploading a big file, but I thought  
that had been fixed so it didn't put the whole file in memory.

As far as I know, these would be issues on both versions of Java. I  
consider 24- 28 sec a bit slow, but from a user point of view, once a  
pause gets to 10 sec, I'm not sure how much difference going to 20  
makes. What killed us was the pauses of several minutes. That caused  
the load balancer to consider the machine down, and of course users  
would probably give up too. But I now think that was paging. We have  
fairly old servers. The next generation should do full GCs a lot  
faster. The ultimate solution is probably G1, the new GC. However it  
doesn't seem to be ready for production yet.

MONITORING

I use several tools:

* I have a program that tails catalina.out, and pulls out any GC that  
takes more than 10 sec. That shows up in my status summary.
* I run "jstat -gc PID 5000" continuously (actually for a minute at a  
time, so I can put time stamps in the log)
* I run "vmstat 5" continuously (for a minute at a time, with time  
stamps). Free memory from vmstat also goes into my status summary.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2421 bytes
Desc: not available
Url : http://collab.sakaiproject.org/pipermail/production/attachments/20091107/c72cf59a/attachment.bin