[Building Sakai] ElasticSearch Testing

John Bush john.bush at rsmart.com
Tue Feb 26 08:33:56 PST 2013


Thanks, so I'm going to work on a better query today, because it
dawned on me that a good majority of stuff isn't actually indexed, and
that typically is the stuff that is big (media files, etc.).  I'm
still doing some configuration discovery and then I'll put a
confluence page up with some more concrete info about tuning and
config.

One thing I've been considering is that maybe we need some more
granual ways to control what to index when you do a full index.  For
example, maybe we need a way to only index current terms, or currently
used sites, or something like that.  There is no sense indexing a
bunch of stuff that isn't really used anymore.

On Tue, Feb 26, 2013 at 9:06 AM, Zhen Qian <zqian at umich.edu> wrote:
> Hi, John:
>
> Here is the result for CTools in UMich, with 10+ years of data:
>
> select count(resource_id) from content_resource
> 15652880
>
> select sum(file_size) from content_resource
>
> 15565111245559
>
> So it is 15 million doc with 16T in size. Last time (7/2012) when the system
> was re-indexed with Sakai 2.7 search, it took 6 hours. I don't have the
> search turnaround data at hand, though.
>
> Search is among the top of our load testing candidates here in UMich. I hope
> we can do the load test soon. I think it is fine to get the basic search
> (without facet support) working first.
>
> BTW, is there a wiki page for the elastic search project on Sakai confluence
> site?
>
> Thanks,
>
> - Zhen
>
>
> On Fri, Feb 22, 2013 at 12:15 PM, John Bush <john.bush at rsmart.com> wrote:
>>
>> I've been spending the last few weeks tweeking Sakai's elasticsearch
>> impl in order to better scale.  It would be helpful if folks could
>> give me an idea of the number of docs in their sakai repos, and the
>> total size.  I'm sure this varies, but in general for our clients,
>> especially those that have been using Sakai for a bit, I'm seeing
>> around 400-500k docs and nearly a 1/2 terabyte of data.
>>
>> You can simply run these queries to collect that info:
>>
>> select count(resource_id) from content_resource
>> select sum(file_size) from content_resource
>>
>> Currently using 4 medium size nodes in aws, with 35k docs and a repo
>> of 20GB, I'm getting search response times on average around 150ms and
>> often much faster.  I'm going along doubling the repository size and
>> so far not seeing much of any impact in performance, although I
>> imagine there is a point that changes.
>>
>> The code in trunk does not scale well, so I will be making a big
>> commit once I have all the kinks ironed out.  It turns out that the
>> highlighting in ElasticSearch is slow, and it also greatly increased
>> the size of the repo.  I had to rewrite that piece to do my own
>> highlighting, similar to what we were doing in the legacy search.  The
>> side affect of that is that we no longer need to store the whole
>> source doc in ES, the index size has dramatically dropped as such.
>> Right now I'm seeing an index size that is about half the size of the
>> repo.  I think I can get that down further, but its significantly
>> better than triple the repo which I was seeing before.
>>
>> I still have some work to do, to fine turn or eliminate the use of
>> facets.  Facets add an enormous memory requirement.  I think I can
>> eliminate this by some more careful indexing, which may end up
>> increasing the size of the index again, but I think that is a fair
>> tradeout vs requiring significantly more ram.  There is supposed to be
>> a way to reel in the memory consumption of these guys, but I have yet
>> to get that configuration working in practice.
>>
>> Attached is a screenshot, 17,618 hits in 0.118 seconds, not bad.
>>
>> --
>> John Bush
>> 602-490-0470
>>
>> _______________________________________________
>> sakai-dev mailing list
>> sakai-dev at collab.sakaiproject.org
>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>
>> TO UNSUBSCRIBE: send email to
>> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>> "unsubscribe"
>
>



--
John Bush
602-490-0470


More information about the sakai-dev mailing list