[Deploying Sakai] java 6 update
Charles Hedrick
hedrick at rutgers.edu
Sat Nov 7 05:35:17 PST 2009
I'm now reasonably convinced that Java 6 is OK to use, and is in fact
nearly identical to Java 5, except for needing -
Dsun.lang.ClassLoader.allowArraySyntax=true, and having somewhat
different defaults.
It's nearly certain that our problem with long GCs was due to heavy
paging. During a GC, memory is accessed randomly. You really need the
whole system to be in memory, or it's intolerably slow. For some
reason, Solaris sometimes uses more memory than I would expect.
We just restarted our system to switch database servers. Before taking
it down I did "pmap" on Java. That tells the actual physical memory
usage. (It turns out that this agrees almost exactly with the usage
printed by "ps aux", as you'd hope.) Free memory comes from "vmstat."
On one system (which happens to be running Java 6), the amount of
memory used by Java plus free memory is about 15 GB. That's reasonable
for a 16 GB machine. But on the other, they are about 13 GB. So we
have about 2 GB less free memory than I had expected. Where is it? Who
knows? But the upshot is that I have to configure the JVM to be 2 GB
smaller than I had expected. This difference does not seem to depend
upon the version of Java. At the moment the one with the missing 2 GB
is Java 5.
Here are the options I'm currently using with Java 6:
JAVA_OPTS=" -d64 -Dsun.lang.ClassLoader.allowArraySyntax=true -
Xmx10500m -Xms10500m -Xmn2560m -XX:+UseConcMarkSweepGC -XX:
+UseParNewGC -XX:CMSInitiatingOccupancyFraction=75 -
XX:MaxPermSize=512m -XX:PermSize=512m -XX:+DisableExplicitGC -XX:
+DoEscapeAnalysis "
Also, I strongly recommend adding -Dfile.encoding=UTF-8, but I do that
in the second JAVA_OPTS declaration.
I'd really rather allocate more memory. We're tight on both new and
old. But until we go through a heavy period that's all I feel safe
with at the moment. Some comments on configuration:
* I don't think you need to configure the size of survivor spaces. The
new default looks fine.
* You can probably allow new (-Xmn) to default, but I prefer a
slightly larger new than that, because we sometimes have 1.1GB
objects, and I'd like them to fit in new.
* I think it's necessary to configure -
XX:CMSInitiatingOccupancyFraction=75, but YMMV. In Java 5, the default
is 68%, which is fine. However in Java 6 the default is dynamic,but
seems to work out to a much higher threshold. At one point I thought
that led to more concurrent mode failures.
* I'm using DisableExplicitGC because there are a few calls to
System.gc(), and I don't really want that to happen. Another
possibility would be -XX:+ExplicitGCInvokesConcurrent
- I'm using escape analysis in the hopes that it will reduce
fragmentation. At any rate, it seems like a good idea, and I haven't
seen any recent bugs involving it, so it looks like it's safe.
- I start perm out at 512, because expanding it seems to cause a full
GC. I'd like to avoid full GCs as much as possible.
I'm still seeing a few full GC's, like one a week per front end. They
take 24 - 28 sec. They occur on both Java 5 and 6. The ones I've
tracked down have been due to
* sometimes the system continued to use permanent space. Eventually a
GC is forced to reclaim space in perm. By default this is a full GC.
It might be worth experimenting with -XX:+CMSClassUnloadingEnabled,
which in Java 5 should be combined with -XX:
+CMSPermGenSweepingEnabled. But we've done enough experiments for a
while.
* in one case we got a concurrent mode failure. From looking at jstat,
I'd say someone was building up a 1.1 GB data structure incrementally.
New couldn't quite handle it, so the big object was put in old. Old
obviously didn't have enough contiguous space. This seems to be rare.
I'm assuming this is the result of uploading a big file, but I thought
that had been fixed so it didn't put the whole file in memory.
As far as I know, these would be issues on both versions of Java. I
consider 24- 28 sec a bit slow, but from a user point of view, once a
pause gets to 10 sec, I'm not sure how much difference going to 20
makes. What killed us was the pauses of several minutes. That caused
the load balancer to consider the machine down, and of course users
would probably give up too. But I now think that was paging. We have
fairly old servers. The next generation should do full GCs a lot
faster. The ultimate solution is probably G1, the new GC. However it
doesn't seem to be ready for production yet.
MONITORING
I use several tools:
* I have a program that tails catalina.out, and pulls out any GC that
takes more than 10 sec. That shows up in my status summary.
* I run "jstat -gc PID 5000" continuously (actually for a minute at a
time, so I can put time stamps in the log)
* I run "vmstat 5" continuously (for a minute at a time, with time
stamps). Free memory from vmstat also goes into my status summary.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2421 bytes
Desc: not available
Url : http://collab.sakaiproject.org/pipermail/production/attachments/20091107/c72cf59a/attachment.bin
More information about the production
mailing list