[Building Sakai] Search tool: memory problem in rebuilding indexes.

David Horwitz david.horwitz at uct.ac.za
Wed Nov 18 00:33:34 PST 2009


For clarity there are 2 possible memory issues:

1) Digesting certain word oxml docs can lead to gc issues (this is what
Stephen is talking about)
2) Doing a full rebuild on a large installation leads to memory issues
(see SAK-17117)

Some details on 2:
1) It happens while Sakai is building the list of objects to index
2) It is not related to lucene, poi or any other Apache technology
3) On our hardware it becomes noticeable after about 1 million documents
have been added to the document queue
4) Slowing down the rate at which the list is built mitigates but does
not resolve

>From 3 I suspect this is related to gc and/or the Sakai caches.

David
Stephen Marquard wrote:
> The easiest way to reproduce it is to extract all the docx and xlsx files your production system's content hosting, and feed it to the indexer/digester.
>
> The visible effects can be reproduced almost immediately (e.g. with a sample set of 500 files or so) by watching the GC activity (even with production-level JVM settings, e.g. 6G total memory for a 64-bit jvm).
>
> Regards
> Stephen 
>  
>   
>>>> Ian Boston <ian at caret.cam.ac.uk> 11/11/2009 10:11 PM >>> 
>>>>         
> Do you have any example documents that cause the problem, so I can see  
> if Jackrabbit exhibits the same behavior ?
>
> Thanks
> Ian
>
> On 11 Nov 2009, at 19:56, Stephen Marquard wrote:
>
>   
>> We have current versions of POI and they don't fix the problem.
>>
>> Regards
>> Stephen
>>
>>     
>>>>> Ian Boston <ian at caret.cam.ac.uk> 11/11/2009 9:13 PM >>>
>>>>>           
>> On 11 Nov 2009, at 10:41, Stephen Marquard wrote:
>>
>>     
>>> Hi,
>>>
>>> I believe we saw something similar. There may be a fix in trunk
>>> though I don't have a JIRA reference handy. If you search recent
>>> JIRAs for Search you may find it, otherwise David Horwitz can tell
>>> you more though he's away until mid next week.
>>>
>>> Also the POI digesters for OOXML (Office 2007+ docx, xlsx, pptx,
>>> etc.) are particularly bad at using memory - digesting content with
>>> these digesters _significantly_ increases GC activity.
>>>
>>> We haven't yet found a solution to this except to minimize the
>>> impact through restricting indexing to a single app server.
>>>
>>> This is likely to be an issue in Sakai 3 as well AFAIK, as the same
>>> underlying libraries are used.
>>>       
>> I think Sakai 2 uses older versions of POI.
>>
>> The indexers in Sakai3 (Jackrabbit) are more up to date, not least
>> because there are committers on POI and Lucene working on or in close
>> contact with the Jackrabbit team, so the use of Lucene we way way way
>> more advanced than in Sakai Search.
>>
>> The other thing to note is a) Apache Tika is becoming and b) POI is
>> starting to do releases again, so taking a later version of POI will
>> almost certainly fix these problems.
>> IIUC
>> Ian
>>
>>
>>     
>
>
>
> _______________________________________________
> sakai-dev mailing list
> sakai-dev at collab.sakaiproject.org
> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>
> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of "unsubscribe"
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://collab.sakaiproject.org/pipermail/sakai-dev/attachments/20091118/52a05f78/attachment.html 


More information about the sakai-dev mailing list