[Building Sakai] Search tool: memory problem in rebuilding indexes.

Aaron Zeckoski aaronz at vt.edu
Thu Nov 12 02:35:19 PST 2009


POI handling of .???x files is notoriously memory intensive because of
the use of xml beans among other things
(https://issues.apache.org/bugzilla/show_bug.cgi?id=46774). I was
working with xslx recently and noticed that the exact same document
uses up a lot more memory while running tests than a normal xls
document. I did not check to see the exact memory but a 300x100 grid
XL document used up all the available memory in maven (max of 512m in
my configuration). That's mostly annecdotal evidence and should be
confirmed with real tests.

-AZ


On Thu, Nov 12, 2009 at 10:18 AM, Stephen Marquard
<stephen.marquard at uct.ac.za> wrote:
> Your 2.5 production will have earlier versions of POI that don't index the OOXML types (xslx, docx) so you wouldn't see the problem then.
>
> We first noticed it when we deployed 2-6-x with updated POIs that indexed xlsx and docx files. Search indexing has been the single biggest impact on production performance in our 2-6-x system (basically erratic response times from increased GC activity when digesting is taking place).
>
> I'm sure there's some way to instrument exactly what's happening with memory use for a particular document to create a reproducible test case, but we haven't got that far. My guess is that documents with complex internal XML representations are causing the problem, because POI is reading them into a large DOM or something (in fact something even lower than POI, I think ooxml4j).
>
> Cheers
> Stephen
>
>>>> Ian Boston <ian at caret.cam.ac.uk> 2009/11/12 11:18 AM >>>
> Ok, thanks.
> I think we did this when testing Jackrabbit in < 1G without issues,
> but I will check with those who *might* have done it.
>  (btw our 2.5 production is 32 bit (1.5G) and we have rebuilt the
> index several times over the past 24 months, we might have moved to
> 64bit in the last few months)
> Ian
>
> On 12 Nov 2009, at 05:29, Stephen Marquard wrote:
>
>> The easiest way to reproduce it is to extract all the docx and xlsx
>> files your production system's content hosting, and feed it to the
>> indexer/digester.
>>
>> The visible effects can be reproduced almost immediately (e.g. with
>> a sample set of 500 files or so) by watching the GC activity (even
>> with production-level JVM settings, e.g. 6G total memory for a 64-
>> bit jvm).
>>
>> Regards
>> Stephen
>>
>>>>> Ian Boston <ian at caret.cam.ac.uk> 11/11/2009 10:11 PM >>>
>> Do you have any example documents that cause the problem, so I can see
>> if Jackrabbit exhibits the same behavior ?
>>
>> Thanks
>> Ian
>>
>> On 11 Nov 2009, at 19:56, Stephen Marquard wrote:
>>
>>> We have current versions of POI and they don't fix the problem.
>>>
>>> Regards
>>> Stephen
>>>
>>>>>> Ian Boston <ian at caret.cam.ac.uk> 11/11/2009 9:13 PM >>>
>>>
>>> On 11 Nov 2009, at 10:41, Stephen Marquard wrote:
>>>
>>>> Hi,
>>>>
>>>> I believe we saw something similar. There may be a fix in trunk
>>>> though I don't have a JIRA reference handy. If you search recent
>>>> JIRAs for Search you may find it, otherwise David Horwitz can tell
>>>> you more though he's away until mid next week.
>>>>
>>>> Also the POI digesters for OOXML (Office 2007+ docx, xlsx, pptx,
>>>> etc.) are particularly bad at using memory - digesting content with
>>>> these digesters _significantly_ increases GC activity.
>>>>
>>>> We haven't yet found a solution to this except to minimize the
>>>> impact through restricting indexing to a single app server.
>>>>
>>>> This is likely to be an issue in Sakai 3 as well AFAIK, as the same
>>>> underlying libraries are used.
>>>
>>>
>>> I think Sakai 2 uses older versions of POI.
>>>
>>> The indexers in Sakai3 (Jackrabbit) are more up to date, not least
>>> because there are committers on POI and Lucene working on or in close
>>> contact with the Jackrabbit team, so the use of Lucene we way way way
>>> more advanced than in Sakai Search.
>>>
>>> The other thing to note is a) Apache Tika is becoming and b) POI is
>>> starting to do releases again, so taking a later version of POI will
>>> almost certainly fix these problems.
>>> IIUC
>>> Ian
>>>
>>>
>>
>>
>>
>
>
> _______________________________________________
> sakai-dev mailing list
> sakai-dev at collab.sakaiproject.org
> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>
> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of "unsubscribe"
>



-- 
Aaron Zeckoski (azeckoski (at) vt.edu)
Senior Research Engineer - CARET - University of Cambridge
https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski
http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile


More information about the sakai-dev mailing list