[Building Sakai] Search tool: memory problem in rebuilding indexes.

Stephen Marquard stephen.marquard at uct.ac.za
Wed Nov 11 21:29:58 PST 2009


The easiest way to reproduce it is to extract all the docx and xlsx files your production system's content hosting, and feed it to the indexer/digester.

The visible effects can be reproduced almost immediately (e.g. with a sample set of 500 files or so) by watching the GC activity (even with production-level JVM settings, e.g. 6G total memory for a 64-bit jvm).

Regards
Stephen 
 
>>> Ian Boston <ian at caret.cam.ac.uk> 11/11/2009 10:11 PM >>> 
Do you have any example documents that cause the problem, so I can see  
if Jackrabbit exhibits the same behavior ?

Thanks
Ian

On 11 Nov 2009, at 19:56, Stephen Marquard wrote:

> We have current versions of POI and they don't fix the problem.
>
> Regards
> Stephen
>
>>>> Ian Boston <ian at caret.cam.ac.uk> 11/11/2009 9:13 PM >>>
>
> On 11 Nov 2009, at 10:41, Stephen Marquard wrote:
>
>> Hi,
>>
>> I believe we saw something similar. There may be a fix in trunk
>> though I don't have a JIRA reference handy. If you search recent
>> JIRAs for Search you may find it, otherwise David Horwitz can tell
>> you more though he's away until mid next week.
>>
>> Also the POI digesters for OOXML (Office 2007+ docx, xlsx, pptx,
>> etc.) are particularly bad at using memory - digesting content with
>> these digesters _significantly_ increases GC activity.
>>
>> We haven't yet found a solution to this except to minimize the
>> impact through restricting indexing to a single app server.
>>
>> This is likely to be an issue in Sakai 3 as well AFAIK, as the same
>> underlying libraries are used.
>
>
> I think Sakai 2 uses older versions of POI.
>
> The indexers in Sakai3 (Jackrabbit) are more up to date, not least
> because there are committers on POI and Lucene working on or in close
> contact with the Jackrabbit team, so the use of Lucene we way way way
> more advanced than in Sakai Search.
>
> The other thing to note is a) Apache Tika is becoming and b) POI is
> starting to do releases again, so taking a later version of POI will
> almost certainly fix these problems.
> IIUC
> Ian
>
>





More information about the sakai-dev mailing list