[Building Sakai] Search tool: memory problem in rebuilding indexes.

Stephen Marquard stephen.marquard at uct.ac.za
Thu Nov 12 02:18:27 PST 2009


Your 2.5 production will have earlier versions of POI that don't index the OOXML types (xslx, docx) so you wouldn't see the problem then.

We first noticed it when we deployed 2-6-x with updated POIs that indexed xlsx and docx files. Search indexing has been the single biggest impact on production performance in our 2-6-x system (basically erratic response times from increased GC activity when digesting is taking place).

I'm sure there's some way to instrument exactly what's happening with memory use for a particular document to create a reproducible test case, but we haven't got that far. My guess is that documents with complex internal XML representations are causing the problem, because POI is reading them into a large DOM or something (in fact something even lower than POI, I think ooxml4j).

Cheers
Stephen

>>> Ian Boston <ian at caret.cam.ac.uk> 2009/11/12 11:18 AM >>>
Ok, thanks.
I think we did this when testing Jackrabbit in < 1G without issues,  
but I will check with those who *might* have done it.
  (btw our 2.5 production is 32 bit (1.5G) and we have rebuilt the  
index several times over the past 24 months, we might have moved to  
64bit in the last few months)
Ian

On 12 Nov 2009, at 05:29, Stephen Marquard wrote:

> The easiest way to reproduce it is to extract all the docx and xlsx  
> files your production system's content hosting, and feed it to the  
> indexer/digester.
>
> The visible effects can be reproduced almost immediately (e.g. with  
> a sample set of 500 files or so) by watching the GC activity (even  
> with production-level JVM settings, e.g. 6G total memory for a 64- 
> bit jvm).
>
> Regards
> Stephen
>
>>>> Ian Boston <ian at caret.cam.ac.uk> 11/11/2009 10:11 PM >>>
> Do you have any example documents that cause the problem, so I can see
> if Jackrabbit exhibits the same behavior ?
>
> Thanks
> Ian
>
> On 11 Nov 2009, at 19:56, Stephen Marquard wrote:
>
>> We have current versions of POI and they don't fix the problem.
>>
>> Regards
>> Stephen
>>
>>>>> Ian Boston <ian at caret.cam.ac.uk> 11/11/2009 9:13 PM >>>
>>
>> On 11 Nov 2009, at 10:41, Stephen Marquard wrote:
>>
>>> Hi,
>>>
>>> I believe we saw something similar. There may be a fix in trunk
>>> though I don't have a JIRA reference handy. If you search recent
>>> JIRAs for Search you may find it, otherwise David Horwitz can tell
>>> you more though he's away until mid next week.
>>>
>>> Also the POI digesters for OOXML (Office 2007+ docx, xlsx, pptx,
>>> etc.) are particularly bad at using memory - digesting content with
>>> these digesters _significantly_ increases GC activity.
>>>
>>> We haven't yet found a solution to this except to minimize the
>>> impact through restricting indexing to a single app server.
>>>
>>> This is likely to be an issue in Sakai 3 as well AFAIK, as the same
>>> underlying libraries are used.
>>
>>
>> I think Sakai 2 uses older versions of POI.
>>
>> The indexers in Sakai3 (Jackrabbit) are more up to date, not least
>> because there are committers on POI and Lucene working on or in close
>> contact with the Jackrabbit team, so the use of Lucene we way way way
>> more advanced than in Sakai Search.
>>
>> The other thing to note is a) Apache Tika is becoming and b) POI is
>> starting to do releases again, so taking a later version of POI will
>> almost certainly fix these problems.
>> IIUC
>> Ian
>>
>>
>
>
>




More information about the sakai-dev mailing list