[Building Sakai] ElasticSearch Testing

Zhen Qian zqian at umich.edu
Tue Feb 26 10:29:08 PST 2013


Hi, John:

Here is the result. Most items are doc, jpg, htm, pdf, ppt, etc. However,
the media type mp4, mp3, m4b counts for the major portion of data size. Can
Sakai search into the multimedia content?

Thanks,

- Zhen

select lower(substr(t1.resource_id, instr(t1.resource_id, '.',-1) + 1)),
sum(t1.file_size)/1073741824, count(*)
from content_resource t1, sakai_site_tool t2
where t2.registration='sakai.search'
and t1.context=t2.site_id
having count(*)>100
group by lower(substr(t1.resource_id, instr(t1.resource_id,'.',-1) + 1))
order by count(*) desc;


doc 2.485681333579123020172119140625 7362
jpg 0.9510437212884426116943359375 6153
htm 0.011429451406002044677734375 5379
pdf 7.185246647335588932037353515625 4917
ppt 39.79564681090414524078369140625 4352
html 0.012742788530886173248291015625 4162
docx 0.63168414495885372161865234375 3202
mp3 20.19908924587070941925048828125 1897
mp4 203.55557925440371036529541015625 1694
m4b 19.748263649642467498779296875 1670
swf 0.19064153917133808135986328125 1442
xls 0.5870045758783817291259765625 1435
xlsx 0.268787070177495479583740234375 1233
pptx 1.4153042249381542205810546875 968
sis 0.000144752673804759979248046875 909
flv 1.9373848550021648406982421875 663
gif 0.00711118616163730621337890625 566
xml 0.002142951823770999908447265625 452
url 0.000029064714908599853515625 410
rtf 0.01623086817562580108642578125 404
png 0.032393396832048892974853515625 342
js 0.003008228726685047149658203125 321
fla 0.051361083984375 248
mpp 0.1664981842041015625 221
txt 0.147780408151447772979736328125 203
rb 0.000320333056151866912841796875 144
vsd 0.065235137939453125 142
java 0.00031330995261669158935546875 136
mov 0.98333104513585567474365234375 125
css 0.000439577735960483551025390625 103



On Tue, Feb 26, 2013 at 12:59 PM, John Bush <john.bush at rsmart.com> wrote:

> Ok, yeah, I was wondering how you did 6 hours index with that much
> data.  I'm still trying to get a timing for that with the full cluster
> enabled, my belief is the index time will be about that same as it was
> with the old search, as the bulk of the time is spent digesting the
> content and that code isn't changing.   I think at one point I
> calculated about 6-8 hours for a terabyte with 6-8 nodes.
>
> So here are a few more queries you can run to narrow down the content,
> for mysql:
>
> SELECT
>     REPLACE(RIGHT(resource_id, 4),'.','') as type ,
>     sum(file_size)/1073741824 as size, count(*)
> FROM
>     content_resource
> GROUP BY
>     type
> ORDER BY
>     size desc;
>
> or in oracle:
>
> select lower(substr(resource_id, instr(resource_id, '.',-1) + 1)),
> sum(file_size)/1073741824, count(*) from content_resource having
> count(*)>100 group by lower(substr(resource_id, instr(resource_id,
> '.',-1) + 1)) order by count(*) desc;
>
> This will give you a break down by file ext.  I'm seeing a lot of
> stuff typically in pdfs.  I would imagine the majority of pdf content
> is not actually text but images and stuff like that.  So even that is
> misleading.
>
> On Tue, Feb 26, 2013 at 10:21 AM, Zhen Qian <zqian at umich.edu> wrote:
> > An correction on the previously reported data:
> >
> > Jeff pointed out that CTools only has a handful of site which are
> > search-enabled.
> >
> > With the following setting for CTools, the search reindex will only
> target
> > those sites with search tool enabled.
> >
> > onlyIndexSearchToolSites at org.sakaiproject.search.api.SearchIndexBuilder
> =true
> >
> > So here are the new queries and results:
> >
> > select  count(*)
> > from sakai_site_tool
> > where registration='sakai.search'
> >
> > 112
> >
> > select  count(t1.resource_id), sum(t1.file_size)
> > from content_resource t1, sakai_site_tool t2
> > where t2.registration='sakai.search'
> > and t1.context=t2.site_id
> >
> > 55253 329016147091
> >
> > There are 112 sites with Search tool, which translates to 55K docs, and
> 0.3T
> > of data. It took system 6 hours to reindex the whole data.
> >
> > John, what's the reindex time in your setup with the new search code?
> >
> > Thanks,
> >
> > - Zhen
> >
> >
> >
> >
> > On Tue, Feb 26, 2013 at 11:32 AM, Jeff Cousineau <cousinea at umich.edu>
> wrote:
> >>
> >> Hi Zhen,
> >>
> >> Just a quick question.  Since we don't have site search enabled for all
> >> sites, does the indexing process _only_ index content for those sites
> that
> >> it _is_ enabled for or does it index all content regardless?  Guess I
> was
> >> assuming the search index was only generated for the former, meaning it
> took
> >> ~6 hours to index content for a very small subset of all
> sites/resources.
> >>
> >> Thanks!
> >> Jeff
> >>
> >> On Feb 26, 2013, at 11:06 AM, Zhen Qian <zqian at umich.edu> wrote:
> >>
> >> Hi, John:
> >>
> >> Here is the result for CTools in UMich, with 10+ years of data:
> >>
> >> select count(resource_id) from content_resource
> >> 15652880
> >>
> >> select sum(file_size) from content_resource
> >>
> >> 15565111245559
> >>
> >> So it is 15 million doc with 16T in size. Last time (7/2012) when the
> >> system was re-indexed with Sakai 2.7 search, it took 6 hours. I don't
> have
> >> the search turnaround data at hand, though.
> >>
> >> Search is among the top of our load testing candidates here in UMich. I
> >> hope we can do the load test soon. I think it is fine to get the basic
> >> search (without facet support) working first.
> >>
> >> BTW, is there a wiki page for the elastic search project on Sakai
> >> confluence site?
> >>
> >> Thanks,
> >>
> >> - Zhen
> >>
> >>
> >> On Fri, Feb 22, 2013 at 12:15 PM, John Bush <john.bush at rsmart.com>
> wrote:
> >>>
> >>> I've been spending the last few weeks tweeking Sakai's elasticsearch
> >>> impl in order to better scale.  It would be helpful if folks could
> >>> give me an idea of the number of docs in their sakai repos, and the
> >>> total size.  I'm sure this varies, but in general for our clients,
> >>> especially those that have been using Sakai for a bit, I'm seeing
> >>> around 400-500k docs and nearly a 1/2 terabyte of data.
> >>>
> >>> You can simply run these queries to collect that info:
> >>>
> >>> select count(resource_id) from content_resource
> >>> select sum(file_size) from content_resource
> >>>
> >>> Currently using 4 medium size nodes in aws, with 35k docs and a repo
> >>> of 20GB, I'm getting search response times on average around 150ms and
> >>> often much faster.  I'm going along doubling the repository size and
> >>> so far not seeing much of any impact in performance, although I
> >>> imagine there is a point that changes.
> >>>
> >>> The code in trunk does not scale well, so I will be making a big
> >>> commit once I have all the kinks ironed out.  It turns out that the
> >>> highlighting in ElasticSearch is slow, and it also greatly increased
> >>> the size of the repo.  I had to rewrite that piece to do my own
> >>> highlighting, similar to what we were doing in the legacy search.  The
> >>> side affect of that is that we no longer need to store the whole
> >>> source doc in ES, the index size has dramatically dropped as such.
> >>> Right now I'm seeing an index size that is about half the size of the
> >>> repo.  I think I can get that down further, but its significantly
> >>> better than triple the repo which I was seeing before.
> >>>
> >>> I still have some work to do, to fine turn or eliminate the use of
> >>> facets.  Facets add an enormous memory requirement.  I think I can
> >>> eliminate this by some more careful indexing, which may end up
> >>> increasing the size of the index again, but I think that is a fair
> >>> tradeout vs requiring significantly more ram.  There is supposed to be
> >>> a way to reel in the memory consumption of these guys, but I have yet
> >>> to get that configuration working in practice.
> >>>
> >>> Attached is a screenshot, 17,618 hits in 0.118 seconds, not bad.
> >>>
> >>> --
> >>> John Bush
> >>> 602-490-0470
> >>>
> >>> _______________________________________________
> >>> sakai-dev mailing list
> >>> sakai-dev at collab.sakaiproject.org
> >>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
> >>>
> >>> TO UNSUBSCRIBE: send email to
> >>> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
> >>> "unsubscribe"
> >>
> >>
> >> _______________________________________________
> >> sakai-dev mailing list
> >> sakai-dev at collab.sakaiproject.org
> >> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
> >>
> >> TO UNSUBSCRIBE: send email to
> >> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
> >> "unsubscribe"
> >>
> >>
> >> --
> >>
> >> Jeff Cousineau
> >> Application Systems Administrator Senior
> >> Information and Technology Services
> >> University of Michigan
> >>
> >
>
>
>
> --
> John Bush
> 602-490-0470
>
> On Tue, Feb 26, 2013 at 10:21 AM, Zhen Qian <zqian at umich.edu> wrote:
> > An correction on the previously reported data:
> >
> > Jeff pointed out that CTools only has a handful of site which are
> > search-enabled.
> >
> > With the following setting for CTools, the search reindex will only
> target
> > those sites with search tool enabled.
> >
> > onlyIndexSearchToolSites at org.sakaiproject.search.api.SearchIndexBuilder
> =true
> >
> > So here are the new queries and results:
> >
> > select  count(*)
> > from sakai_site_tool
> > where registration='sakai.search'
> >
> > 112
> >
> > select  count(t1.resource_id), sum(t1.file_size)
> > from content_resource t1, sakai_site_tool t2
> > where t2.registration='sakai.search'
> > and t1.context=t2.site_id
> >
> > 55253 329016147091
> >
> > There are 112 sites with Search tool, which translates to 55K docs, and
> 0.3T
> > of data. It took system 6 hours to reindex the whole data.
> >
> > John, what's the reindex time in your setup with the new search code?
> >
> > Thanks,
> >
> > - Zhen
> >
> >
> >
> >
> > On Tue, Feb 26, 2013 at 11:32 AM, Jeff Cousineau <cousinea at umich.edu>
> wrote:
> >>
> >> Hi Zhen,
> >>
> >> Just a quick question.  Since we don't have site search enabled for all
> >> sites, does the indexing process _only_ index content for those sites
> that
> >> it _is_ enabled for or does it index all content regardless?  Guess I
> was
> >> assuming the search index was only generated for the former, meaning it
> took
> >> ~6 hours to index content for a very small subset of all
> sites/resources.
> >>
> >> Thanks!
> >> Jeff
> >>
> >> On Feb 26, 2013, at 11:06 AM, Zhen Qian <zqian at umich.edu> wrote:
> >>
> >> Hi, John:
> >>
> >> Here is the result for CTools in UMich, with 10+ years of data:
> >>
> >> select count(resource_id) from content_resource
> >> 15652880
> >>
> >> select sum(file_size) from content_resource
> >>
> >> 15565111245559
> >>
> >> So it is 15 million doc with 16T in size. Last time (7/2012) when the
> >> system was re-indexed with Sakai 2.7 search, it took 6 hours. I don't
> have
> >> the search turnaround data at hand, though.
> >>
> >> Search is among the top of our load testing candidates here in UMich. I
> >> hope we can do the load test soon. I think it is fine to get the basic
> >> search (without facet support) working first.
> >>
> >> BTW, is there a wiki page for the elastic search project on Sakai
> >> confluence site?
> >>
> >> Thanks,
> >>
> >> - Zhen
> >>
> >>
> >> On Fri, Feb 22, 2013 at 12:15 PM, John Bush <john.bush at rsmart.com>
> wrote:
> >>>
> >>> I've been spending the last few weeks tweeking Sakai's elasticsearch
> >>> impl in order to better scale.  It would be helpful if folks could
> >>> give me an idea of the number of docs in their sakai repos, and the
> >>> total size.  I'm sure this varies, but in general for our clients,
> >>> especially those that have been using Sakai for a bit, I'm seeing
> >>> around 400-500k docs and nearly a 1/2 terabyte of data.
> >>>
> >>> You can simply run these queries to collect that info:
> >>>
> >>> select count(resource_id) from content_resource
> >>> select sum(file_size) from content_resource
> >>>
> >>> Currently using 4 medium size nodes in aws, with 35k docs and a repo
> >>> of 20GB, I'm getting search response times on average around 150ms and
> >>> often much faster.  I'm going along doubling the repository size and
> >>> so far not seeing much of any impact in performance, although I
> >>> imagine there is a point that changes.
> >>>
> >>> The code in trunk does not scale well, so I will be making a big
> >>> commit once I have all the kinks ironed out.  It turns out that the
> >>> highlighting in ElasticSearch is slow, and it also greatly increased
> >>> the size of the repo.  I had to rewrite that piece to do my own
> >>> highlighting, similar to what we were doing in the legacy search.  The
> >>> side affect of that is that we no longer need to store the whole
> >>> source doc in ES, the index size has dramatically dropped as such.
> >>> Right now I'm seeing an index size that is about half the size of the
> >>> repo.  I think I can get that down further, but its significantly
> >>> better than triple the repo which I was seeing before.
> >>>
> >>> I still have some work to do, to fine turn or eliminate the use of
> >>> facets.  Facets add an enormous memory requirement.  I think I can
> >>> eliminate this by some more careful indexing, which may end up
> >>> increasing the size of the index again, but I think that is a fair
> >>> tradeout vs requiring significantly more ram.  There is supposed to be
> >>> a way to reel in the memory consumption of these guys, but I have yet
> >>> to get that configuration working in practice.
> >>>
> >>> Attached is a screenshot, 17,618 hits in 0.118 seconds, not bad.
> >>>
> >>> --
> >>> John Bush
> >>> 602-490-0470
> >>>
> >>> _______________________________________________
> >>> sakai-dev mailing list
> >>> sakai-dev at collab.sakaiproject.org
> >>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
> >>>
> >>> TO UNSUBSCRIBE: send email to
> >>> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
> >>> "unsubscribe"
> >>
> >>
> >> _______________________________________________
> >> sakai-dev mailing list
> >> sakai-dev at collab.sakaiproject.org
> >> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
> >>
> >> TO UNSUBSCRIBE: send email to
> >> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
> >> "unsubscribe"
> >>
> >>
> >> --
> >>
> >> Jeff Cousineau
> >> Application Systems Administrator Senior
> >> Information and Technology Services
> >> University of Michigan
> >>
> >
>
>
>
> --
> John Bush
> 602-490-0470
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://collab.sakaiproject.org/pipermail/sakai-dev/attachments/20130226/d5229a7e/attachment.html 


More information about the sakai-dev mailing list