[Building Sakai] ElasticSearch Testing

Zhen Qian zqian at umich.edu
Tue Feb 26 09:21:13 PST 2013


An correction on the previously reported data:

Jeff pointed out that CTools only has a handful of site which are
search-enabled.

With the following setting for CTools, the search reindex will only target
those sites with search tool enabled.

onlyIndexSearchToolSites at org.sakaiproject.search.api.SearchIndexBuilder=true
So here are the new queries and results:

select  count(*)
from sakai_site_tool
where registration='sakai.search'

112

select  count(t1.resource_id), sum(t1.file_size)
from content_resource t1, sakai_site_tool t2
where t2.registration='sakai.search'
and t1.context=t2.site_id

55253 329016147091

There are 112 sites with Search tool, which translates to 55K docs, and
0.3T of data. It took system 6 hours to reindex the whole data.

John, what's the reindex time in your setup with the new search code?

Thanks,

- Zhen




On Tue, Feb 26, 2013 at 11:32 AM, Jeff Cousineau <cousinea at umich.edu> wrote:

> Hi Zhen,
>
> Just a quick question.  Since we don't have site search enabled for all
> sites, does the indexing process _only_ index content for those sites that
> it _is_ enabled for or does it index all content regardless?  Guess I was
> assuming the search index was only generated for the former, meaning it
> took ~6 hours to index content for a very small subset of all
> sites/resources.
>
> Thanks!
> Jeff
>
> On Feb 26, 2013, at 11:06 AM, Zhen Qian <zqian at umich.edu> wrote:
>
> Hi, John:
>
> Here is the result for CTools in UMich, with 10+ years of data:
>
> select count(resource_id) from content_resource
> 15652880
>
> select sum(file_size) from content_resource
>
> 15565111245559
>
> So it is 15 million doc with 16T in size. Last time (7/2012) when the
> system was re-indexed with Sakai 2.7 search, it took 6 hours. I don't have
> the search turnaround data at hand, though.
>
> Search is among the top of our load testing candidates here in UMich. I
> hope we can do the load test soon. I think it is fine to get the basic
> search (without facet support) working first.
>
> BTW, is there a wiki page for the elastic search project on Sakai
> confluence site?
>
> Thanks,
>
> - Zhen
>
>
> On Fri, Feb 22, 2013 at 12:15 PM, John Bush <john.bush at rsmart.com> wrote:
>
>> I've been spending the last few weeks tweeking Sakai's elasticsearch
>> impl in order to better scale.  It would be helpful if folks could
>> give me an idea of the number of docs in their sakai repos, and the
>> total size.  I'm sure this varies, but in general for our clients,
>> especially those that have been using Sakai for a bit, I'm seeing
>> around 400-500k docs and nearly a 1/2 terabyte of data.
>>
>> You can simply run these queries to collect that info:
>>
>> select count(resource_id) from content_resource
>> select sum(file_size) from content_resource
>>
>> Currently using 4 medium size nodes in aws, with 35k docs and a repo
>> of 20GB, I'm getting search response times on average around 150ms and
>> often much faster.  I'm going along doubling the repository size and
>> so far not seeing much of any impact in performance, although I
>> imagine there is a point that changes.
>>
>> The code in trunk does not scale well, so I will be making a big
>> commit once I have all the kinks ironed out.  It turns out that the
>> highlighting in ElasticSearch is slow, and it also greatly increased
>> the size of the repo.  I had to rewrite that piece to do my own
>> highlighting, similar to what we were doing in the legacy search.  The
>> side affect of that is that we no longer need to store the whole
>> source doc in ES, the index size has dramatically dropped as such.
>> Right now I'm seeing an index size that is about half the size of the
>> repo.  I think I can get that down further, but its significantly
>> better than triple the repo which I was seeing before.
>>
>> I still have some work to do, to fine turn or eliminate the use of
>> facets.  Facets add an enormous memory requirement.  I think I can
>> eliminate this by some more careful indexing, which may end up
>> increasing the size of the index again, but I think that is a fair
>> tradeout vs requiring significantly more ram.  There is supposed to be
>> a way to reel in the memory consumption of these guys, but I have yet
>> to get that configuration working in practice.
>>
>> Attached is a screenshot, 17,618 hits in 0.118 seconds, not bad.
>>
>> --
>> John Bush
>> 602-490-0470
>>
>> _______________________________________________
>> sakai-dev mailing list
>> sakai-dev at collab.sakaiproject.org
>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>
>> TO UNSUBSCRIBE: send email to
>> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>> "unsubscribe"
>>
>
> _______________________________________________
> sakai-dev mailing list
> sakai-dev at collab.sakaiproject.org
> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>
> TO UNSUBSCRIBE: send email to
> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
> "unsubscribe"
>
>
> --
>
> Jeff Cousineau
> Application Systems Administrator Senior
> Information and Technology Services
> University of Michigan
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://collab.sakaiproject.org/pipermail/sakai-dev/attachments/20130226/23b3369b/attachment.html 


More information about the sakai-dev mailing list