[Building Sakai] Search tool: memory problem in rebuilding indexes.

Ian Boston ian at caret.cam.ac.uk
Thu Nov 12 01:18:49 PST 2009


Ok, thanks.
I think we did this when testing Jackrabbit in < 1G without issues,  
but I will check with those who *might* have done it.
  (btw our 2.5 production is 32 bit (1.5G) and we have rebuilt the  
index several times over the past 24 months, we might have moved to  
64bit in the last few months)
Ian

On 12 Nov 2009, at 05:29, Stephen Marquard wrote:

> The easiest way to reproduce it is to extract all the docx and xlsx  
> files your production system's content hosting, and feed it to the  
> indexer/digester.
>
> The visible effects can be reproduced almost immediately (e.g. with  
> a sample set of 500 files or so) by watching the GC activity (even  
> with production-level JVM settings, e.g. 6G total memory for a 64- 
> bit jvm).
>
> Regards
> Stephen
>
>>>> Ian Boston <ian at caret.cam.ac.uk> 11/11/2009 10:11 PM >>>
> Do you have any example documents that cause the problem, so I can see
> if Jackrabbit exhibits the same behavior ?
>
> Thanks
> Ian
>
> On 11 Nov 2009, at 19:56, Stephen Marquard wrote:
>
>> We have current versions of POI and they don't fix the problem.
>>
>> Regards
>> Stephen
>>
>>>>> Ian Boston <ian at caret.cam.ac.uk> 11/11/2009 9:13 PM >>>
>>
>> On 11 Nov 2009, at 10:41, Stephen Marquard wrote:
>>
>>> Hi,
>>>
>>> I believe we saw something similar. There may be a fix in trunk
>>> though I don't have a JIRA reference handy. If you search recent
>>> JIRAs for Search you may find it, otherwise David Horwitz can tell
>>> you more though he's away until mid next week.
>>>
>>> Also the POI digesters for OOXML (Office 2007+ docx, xlsx, pptx,
>>> etc.) are particularly bad at using memory - digesting content with
>>> these digesters _significantly_ increases GC activity.
>>>
>>> We haven't yet found a solution to this except to minimize the
>>> impact through restricting indexing to a single app server.
>>>
>>> This is likely to be an issue in Sakai 3 as well AFAIK, as the same
>>> underlying libraries are used.
>>
>>
>> I think Sakai 2 uses older versions of POI.
>>
>> The indexers in Sakai3 (Jackrabbit) are more up to date, not least
>> because there are committers on POI and Lucene working on or in close
>> contact with the Jackrabbit team, so the use of Lucene we way way way
>> more advanced than in Sakai Search.
>>
>> The other thing to note is a) Apache Tika is becoming and b) POI is
>> starting to do releases again, so taking a later version of POI will
>> almost certainly fix these problems.
>> IIUC
>> Ian
>>
>>
>
>
>



More information about the sakai-dev mailing list