[Building Sakai] ElasticSearch Testing

Tue Feb 26 09:59:38 PST 2013

Ok, yeah, I was wondering how you did 6 hours index with that much
data.  I'm still trying to get a timing for that with the full cluster
enabled, my belief is the index time will be about that same as it was
with the old search, as the bulk of the time is spent digesting the
content and that code isn't changing.   I think at one point I
calculated about 6-8 hours for a terabyte with 6-8 nodes.

So here are a few more queries you can run to narrow down the content,
for mysql:

SELECT
    REPLACE(RIGHT(resource_id, 4),'.','') as type ,
    sum(file_size)/1073741824 as size, count(*)
FROM
    content_resource
GROUP BY
    type
ORDER BY
    size desc;

or in oracle:

select lower(substr(resource_id, instr(resource_id, '.',-1) + 1)),
sum(file_size)/1073741824, count(*) from content_resource having
count(*)>100 group by lower(substr(resource_id, instr(resource_id,
'.',-1) + 1)) order by count(*) desc;

This will give you a break down by file ext.  I'm seeing a lot of
stuff typically in pdfs.  I would imagine the majority of pdf content
is not actually text but images and stuff like that.  So even that is
misleading.

On Tue, Feb 26, 2013 at 10:21 AM, Zhen Qian <zqian at umich.edu> wrote:
> An correction on the previously reported data:
>
> Jeff pointed out that CTools only has a handful of site which are
> search-enabled.
>
> With the following setting for CTools, the search reindex will only target
> those sites with search tool enabled.
>
> onlyIndexSearchToolSites at org.sakaiproject.search.api.SearchIndexBuilder=true
>
> So here are the new queries and results:
>
> select  count(*)
> from sakai_site_tool
> where registration='sakai.search'
>
> 112
>
> select  count(t1.resource_id), sum(t1.file_size)
> from content_resource t1, sakai_site_tool t2
> where t2.registration='sakai.search'
> and t1.context=t2.site_id
>
> 55253 329016147091
>
> There are 112 sites with Search tool, which translates to 55K docs, and 0.3T
> of data. It took system 6 hours to reindex the whole data.
>
> John, what's the reindex time in your setup with the new search code?
>
> Thanks,
>
> - Zhen
>
>
>
>
> On Tue, Feb 26, 2013 at 11:32 AM, Jeff Cousineau <cousinea at umich.edu> wrote:
>>
>> Hi Zhen,
>>
>> Just a quick question.  Since we don't have site search enabled for all
>> sites, does the indexing process _only_ index content for those sites that
>> it _is_ enabled for or does it index all content regardless?  Guess I was
>> assuming the search index was only generated for the former, meaning it took
>> ~6 hours to index content for a very small subset of all sites/resources.
>>
>> Thanks!
>> Jeff
>>
>> On Feb 26, 2013, at 11:06 AM, Zhen Qian <zqian at umich.edu> wrote:
>>
>> Hi, John:
>>
>> Here is the result for CTools in UMich, with 10+ years of data:
>>
>> select count(resource_id) from content_resource
>> 15652880
>>
>> select sum(file_size) from content_resource
>>
>> 15565111245559
>>
>> So it is 15 million doc with 16T in size. Last time (7/2012) when the
>> system was re-indexed with Sakai 2.7 search, it took 6 hours. I don't have
>> the search turnaround data at hand, though.
>>
>> Search is among the top of our load testing candidates here in UMich. I
>> hope we can do the load test soon. I think it is fine to get the basic
>> search (without facet support) working first.
>>
>> BTW, is there a wiki page for the elastic search project on Sakai
>> confluence site?
>>
>> Thanks,
>>
>> - Zhen
>>
>>
>> On Fri, Feb 22, 2013 at 12:15 PM, John Bush <john.bush at rsmart.com> wrote:
>>>
>>> I've been spending the last few weeks tweeking Sakai's elasticsearch
>>> impl in order to better scale.  It would be helpful if folks could
>>> give me an idea of the number of docs in their sakai repos, and the
>>> total size.  I'm sure this varies, but in general for our clients,
>>> especially those that have been using Sakai for a bit, I'm seeing
>>> around 400-500k docs and nearly a 1/2 terabyte of data.
>>>
>>> You can simply run these queries to collect that info:
>>>
>>> select count(resource_id) from content_resource
>>> select sum(file_size) from content_resource
>>>
>>> Currently using 4 medium size nodes in aws, with 35k docs and a repo
>>> of 20GB, I'm getting search response times on average around 150ms and
>>> often much faster.  I'm going along doubling the repository size and
>>> so far not seeing much of any impact in performance, although I
>>> imagine there is a point that changes.
>>>
>>> The code in trunk does not scale well, so I will be making a big
>>> commit once I have all the kinks ironed out.  It turns out that the
>>> highlighting in ElasticSearch is slow, and it also greatly increased
>>> the size of the repo.  I had to rewrite that piece to do my own
>>> highlighting, similar to what we were doing in the legacy search.  The
>>> side affect of that is that we no longer need to store the whole
>>> source doc in ES, the index size has dramatically dropped as such.
>>> Right now I'm seeing an index size that is about half the size of the
>>> repo.  I think I can get that down further, but its significantly
>>> better than triple the repo which I was seeing before.
>>>
>>> I still have some work to do, to fine turn or eliminate the use of
>>> facets.  Facets add an enormous memory requirement.  I think I can
>>> eliminate this by some more careful indexing, which may end up
>>> increasing the size of the index again, but I think that is a fair
>>> tradeout vs requiring significantly more ram.  There is supposed to be
>>> a way to reel in the memory consumption of these guys, but I have yet
>>> to get that configuration working in practice.
>>>
>>> Attached is a screenshot, 17,618 hits in 0.118 seconds, not bad.
>>>
>>> --
>>> John Bush
>>> 602-490-0470
>>>
>>> _______________________________________________
>>> sakai-dev mailing list
>>> sakai-dev at collab.sakaiproject.org
>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>
>>> TO UNSUBSCRIBE: send email to
>>> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>>> "unsubscribe"
>>
>>
>> _______________________________________________
>> sakai-dev mailing list
>> sakai-dev at collab.sakaiproject.org
>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>
>> TO UNSUBSCRIBE: send email to
>> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>> "unsubscribe"
>>
>>
>> --
>>
>> Jeff Cousineau
>> Application Systems Administrator Senior
>> Information and Technology Services
>> University of Michigan
>>
>

--
John Bush
602-490-0470

On Tue, Feb 26, 2013 at 10:21 AM, Zhen Qian <zqian at umich.edu> wrote:
> An correction on the previously reported data:
>
> Jeff pointed out that CTools only has a handful of site which are
> search-enabled.
>
> With the following setting for CTools, the search reindex will only target
> those sites with search tool enabled.
>
> onlyIndexSearchToolSites at org.sakaiproject.search.api.SearchIndexBuilder=true
>
> So here are the new queries and results:
>
> select  count(*)
> from sakai_site_tool
> where registration='sakai.search'
>
> 112
>
> select  count(t1.resource_id), sum(t1.file_size)
> from content_resource t1, sakai_site_tool t2
> where t2.registration='sakai.search'
> and t1.context=t2.site_id
>
> 55253 329016147091
>
> There are 112 sites with Search tool, which translates to 55K docs, and 0.3T
> of data. It took system 6 hours to reindex the whole data.
>
> John, what's the reindex time in your setup with the new search code?
>
> Thanks,
>
> - Zhen
>
>
>
>
> On Tue, Feb 26, 2013 at 11:32 AM, Jeff Cousineau <cousinea at umich.edu> wrote:
>>
>> Hi Zhen,
>>
>> Just a quick question.  Since we don't have site search enabled for all
>> sites, does the indexing process _only_ index content for those sites that
>> it _is_ enabled for or does it index all content regardless?  Guess I was
>> assuming the search index was only generated for the former, meaning it took
>> ~6 hours to index content for a very small subset of all sites/resources.
>>
>> Thanks!
>> Jeff
>>
>> On Feb 26, 2013, at 11:06 AM, Zhen Qian <zqian at umich.edu> wrote:
>>
>> Hi, John:
>>
>> Here is the result for CTools in UMich, with 10+ years of data:
>>
>> select count(resource_id) from content_resource
>> 15652880
>>
>> select sum(file_size) from content_resource
>>
>> 15565111245559
>>
>> So it is 15 million doc with 16T in size. Last time (7/2012) when the
>> system was re-indexed with Sakai 2.7 search, it took 6 hours. I don't have
>> the search turnaround data at hand, though.
>>
>> Search is among the top of our load testing candidates here in UMich. I
>> hope we can do the load test soon. I think it is fine to get the basic
>> search (without facet support) working first.
>>
>> BTW, is there a wiki page for the elastic search project on Sakai
>> confluence site?
>>
>> Thanks,
>>
>> - Zhen
>>
>>
>> On Fri, Feb 22, 2013 at 12:15 PM, John Bush <john.bush at rsmart.com> wrote:
>>>
>>> I've been spending the last few weeks tweeking Sakai's elasticsearch
>>> impl in order to better scale.  It would be helpful if folks could
>>> give me an idea of the number of docs in their sakai repos, and the
>>> total size.  I'm sure this varies, but in general for our clients,
>>> especially those that have been using Sakai for a bit, I'm seeing
>>> around 400-500k docs and nearly a 1/2 terabyte of data.
>>>
>>> You can simply run these queries to collect that info:
>>>
>>> select count(resource_id) from content_resource
>>> select sum(file_size) from content_resource
>>>
>>> Currently using 4 medium size nodes in aws, with 35k docs and a repo
>>> of 20GB, I'm getting search response times on average around 150ms and
>>> often much faster.  I'm going along doubling the repository size and
>>> so far not seeing much of any impact in performance, although I
>>> imagine there is a point that changes.
>>>
>>> The code in trunk does not scale well, so I will be making a big
>>> commit once I have all the kinks ironed out.  It turns out that the
>>> highlighting in ElasticSearch is slow, and it also greatly increased
>>> the size of the repo.  I had to rewrite that piece to do my own
>>> highlighting, similar to what we were doing in the legacy search.  The
>>> side affect of that is that we no longer need to store the whole
>>> source doc in ES, the index size has dramatically dropped as such.
>>> Right now I'm seeing an index size that is about half the size of the
>>> repo.  I think I can get that down further, but its significantly
>>> better than triple the repo which I was seeing before.
>>>
>>> I still have some work to do, to fine turn or eliminate the use of
>>> facets.  Facets add an enormous memory requirement.  I think I can
>>> eliminate this by some more careful indexing, which may end up
>>> increasing the size of the index again, but I think that is a fair
>>> tradeout vs requiring significantly more ram.  There is supposed to be
>>> a way to reel in the memory consumption of these guys, but I have yet
>>> to get that configuration working in practice.
>>>
>>> Attached is a screenshot, 17,618 hits in 0.118 seconds, not bad.
>>>
>>> --
>>> John Bush
>>> 602-490-0470
>>>
>>> _______________________________________________
>>> sakai-dev mailing list
>>> sakai-dev at collab.sakaiproject.org
>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>
>>> TO UNSUBSCRIBE: send email to
>>> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>>> "unsubscribe"
>>
>>
>> _______________________________________________
>> sakai-dev mailing list
>> sakai-dev at collab.sakaiproject.org
>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>
>> TO UNSUBSCRIBE: send email to
>> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>> "unsubscribe"
>>
>>
>> --
>>
>> Jeff Cousineau
>> Application Systems Administrator Senior
>> Information and Technology Services
>> University of Michigan
>>
>

--
John Bush
602-490-0470