[Building Sakai] ElasticSearch Testing

Tue Feb 26 10:37:52 PST 2013

sakai uses the tika api, ootb its wired up to support the following:

<entry key="application/vnd.oasis.opendocument.text"><value>application/vnd.oasis.opendocument.text</value></entry>
           <entry
key="application/vnd.oasis.opendocument.text-template"><value>application/vnd.oasis.opendocument.text-template</value></entry>
           <entry
key="application/vnd.oasis.opendocument.text-web"><value>application/vnd.oasis.opendocument.text-web</value></entry>
           <entry
key="application/vnd.oasis.opendocument.text-master"><value>application/vnd.oasis.opendocument.text-master</value></entry>
           <entry
key="application/vnd.oasis.opendocument.graphics"><value>application/vnd.oasis.opendocument.graphics</value></entry>
           <entry
key="application/vnd.oasis.opendocument.graphics-template"><value>application/vnd.oasis.opendocument.graphics-template</value></entry>
           <entry
key="application/vnd.oasis.opendocument.presentation"><value>application/vnd.oasis.opendocument.presentation</value></entry>
           <entry
key="application/vnd.oasis.opendocument.presentation-template"><value>application/vnd.oasis.opendocument.presentation-template</value></entry>
           <entry
key="application/vnd.oasis.opendocument.spreadsheet"><value>application/vnd.oasis.opendocument.spreadsheet</value></entry>
           <entry
key="application/vnd.oasis.opendocument.spreadsheet-template"><value>application/vnd.oasis.opendocument.spreadsheet-template</value></entry>
           <entry
key="application/vnd.oasis.opendocument.chart"><value>application/vnd.oasis.opendocument.chart</value></entry>
           <entry
key="application/vnd.oasis.opendocument.formula"><value>application/vnd.oasis.opendocument.formula</value></entry>
           <entry
key="application/vnd.oasis.opendocument.database"><value>application/vnd.oasis.opendocument.database</value></entry>
           <entry
key="application/vnd.oasis.opendocument.image"><value>application/vnd.oasis.opendocument.image</value></entry>
           <entry
key="application/vnd.openofficeorg.extension"><value>application/vnd.openofficeorg.extension</value></entry>
           <entry key="audio/mpeg"><value>audio/mpeg</value> </entry>
           <entry key="audio/midi"><value>audio/midi</value> </entry>

So it will get mpeg files.  I believe it essentially just reports on
whatever meta data exists, which I would assume is in the header and
it doesn't have to actually munge the whole stream to get at that.
I'm wouldn't expect there to actually be a lot of indexed data in
these cases.

On Tue, Feb 26, 2013 at 11:29 AM, Zhen Qian <zqian at umich.edu> wrote:
> Hi, John:
>
> Here is the result. Most items are doc, jpg, htm, pdf, ppt, etc. However,
> the media type mp4, mp3, m4b counts for the major portion of data size. Can
> Sakai search into the multimedia content?
>
> Thanks,
>
> - Zhen
>
> select lower(substr(t1.resource_id, instr(t1.resource_id, '.',-1) + 1)),
> sum(t1.file_size)/1073741824, count(*)
> from content_resource t1, sakai_site_tool t2
> where t2.registration='sakai.search'
> and t1.context=t2.site_id
> having count(*)>100
> group by lower(substr(t1.resource_id, instr(t1.resource_id,'.',-1) + 1))
> order by count(*) desc;
>
>
> doc 2.485681333579123020172119140625 7362
> jpg 0.9510437212884426116943359375 6153
> htm 0.011429451406002044677734375 5379
> pdf 7.185246647335588932037353515625 4917
> ppt 39.79564681090414524078369140625 4352
> html 0.012742788530886173248291015625 4162
> docx 0.63168414495885372161865234375 3202
> mp3 20.19908924587070941925048828125 1897
> mp4 203.55557925440371036529541015625 1694
> m4b 19.748263649642467498779296875 1670
> swf 0.19064153917133808135986328125 1442
> xls 0.5870045758783817291259765625 1435
> xlsx 0.268787070177495479583740234375 1233
> pptx 1.4153042249381542205810546875 968
> sis 0.000144752673804759979248046875 909
> flv 1.9373848550021648406982421875 663
> gif 0.00711118616163730621337890625 566
> xml 0.002142951823770999908447265625 452
> url 0.000029064714908599853515625 410
> rtf 0.01623086817562580108642578125 404
> png 0.032393396832048892974853515625 342
> js 0.003008228726685047149658203125 321
> fla 0.051361083984375 248
> mpp 0.1664981842041015625 221
> txt 0.147780408151447772979736328125 203
> rb 0.000320333056151866912841796875 144
> vsd 0.065235137939453125 142
> java 0.00031330995261669158935546875 136
> mov 0.98333104513585567474365234375 125
> css 0.000439577735960483551025390625 103
>
>
>
> On Tue, Feb 26, 2013 at 12:59 PM, John Bush <john.bush at rsmart.com> wrote:
>>
>> Ok, yeah, I was wondering how you did 6 hours index with that much
>> data.  I'm still trying to get a timing for that with the full cluster
>> enabled, my belief is the index time will be about that same as it was
>> with the old search, as the bulk of the time is spent digesting the
>> content and that code isn't changing.   I think at one point I
>> calculated about 6-8 hours for a terabyte with 6-8 nodes.
>>
>> So here are a few more queries you can run to narrow down the content,
>> for mysql:
>>
>> SELECT
>>     REPLACE(RIGHT(resource_id, 4),'.','') as type ,
>>     sum(file_size)/1073741824 as size, count(*)
>> FROM
>>     content_resource
>> GROUP BY
>>     type
>> ORDER BY
>>     size desc;
>>
>> or in oracle:
>>
>> select lower(substr(resource_id, instr(resource_id, '.',-1) + 1)),
>> sum(file_size)/1073741824, count(*) from content_resource having
>> count(*)>100 group by lower(substr(resource_id, instr(resource_id,
>> '.',-1) + 1)) order by count(*) desc;
>>
>> This will give you a break down by file ext.  I'm seeing a lot of
>> stuff typically in pdfs.  I would imagine the majority of pdf content
>> is not actually text but images and stuff like that.  So even that is
>> misleading.
>>
>> On Tue, Feb 26, 2013 at 10:21 AM, Zhen Qian <zqian at umich.edu> wrote:
>> > An correction on the previously reported data:
>> >
>> > Jeff pointed out that CTools only has a handful of site which are
>> > search-enabled.
>> >
>> > With the following setting for CTools, the search reindex will only
>> > target
>> > those sites with search tool enabled.
>> >
>> >
>> > onlyIndexSearchToolSites at org.sakaiproject.search.api.SearchIndexBuilder=true
>> >
>> > So here are the new queries and results:
>> >
>> > select  count(*)
>> > from sakai_site_tool
>> > where registration='sakai.search'
>> >
>> > 112
>> >
>> > select  count(t1.resource_id), sum(t1.file_size)
>> > from content_resource t1, sakai_site_tool t2
>> > where t2.registration='sakai.search'
>> > and t1.context=t2.site_id
>> >
>> > 55253 329016147091
>> >
>> > There are 112 sites with Search tool, which translates to 55K docs, and
>> > 0.3T
>> > of data. It took system 6 hours to reindex the whole data.
>> >
>> > John, what's the reindex time in your setup with the new search code?
>> >
>> > Thanks,
>> >
>> > - Zhen
>> >
>> >
>> >
>> >
>> > On Tue, Feb 26, 2013 at 11:32 AM, Jeff Cousineau <cousinea at umich.edu>
>> > wrote:
>> >>
>> >> Hi Zhen,
>> >>
>> >> Just a quick question.  Since we don't have site search enabled for all
>> >> sites, does the indexing process _only_ index content for those sites
>> >> that
>> >> it _is_ enabled for or does it index all content regardless?  Guess I
>> >> was
>> >> assuming the search index was only generated for the former, meaning it
>> >> took
>> >> ~6 hours to index content for a very small subset of all
>> >> sites/resources.
>> >>
>> >> Thanks!
>> >> Jeff
>> >>
>> >> On Feb 26, 2013, at 11:06 AM, Zhen Qian <zqian at umich.edu> wrote:
>> >>
>> >> Hi, John:
>> >>
>> >> Here is the result for CTools in UMich, with 10+ years of data:
>> >>
>> >> select count(resource_id) from content_resource
>> >> 15652880
>> >>
>> >> select sum(file_size) from content_resource
>> >>
>> >> 15565111245559
>> >>
>> >> So it is 15 million doc with 16T in size. Last time (7/2012) when the
>> >> system was re-indexed with Sakai 2.7 search, it took 6 hours. I don't
>> >> have
>> >> the search turnaround data at hand, though.
>> >>
>> >> Search is among the top of our load testing candidates here in UMich. I
>> >> hope we can do the load test soon. I think it is fine to get the basic
>> >> search (without facet support) working first.
>> >>
>> >> BTW, is there a wiki page for the elastic search project on Sakai
>> >> confluence site?
>> >>
>> >> Thanks,
>> >>
>> >> - Zhen
>> >>
>> >>
>> >> On Fri, Feb 22, 2013 at 12:15 PM, John Bush <john.bush at rsmart.com>
>> >> wrote:
>> >>>
>> >>> I've been spending the last few weeks tweeking Sakai's elasticsearch
>> >>> impl in order to better scale.  It would be helpful if folks could
>> >>> give me an idea of the number of docs in their sakai repos, and the
>> >>> total size.  I'm sure this varies, but in general for our clients,
>> >>> especially those that have been using Sakai for a bit, I'm seeing
>> >>> around 400-500k docs and nearly a 1/2 terabyte of data.
>> >>>
>> >>> You can simply run these queries to collect that info:
>> >>>
>> >>> select count(resource_id) from content_resource
>> >>> select sum(file_size) from content_resource
>> >>>
>> >>> Currently using 4 medium size nodes in aws, with 35k docs and a repo
>> >>> of 20GB, I'm getting search response times on average around 150ms and
>> >>> often much faster.  I'm going along doubling the repository size and
>> >>> so far not seeing much of any impact in performance, although I
>> >>> imagine there is a point that changes.
>> >>>
>> >>> The code in trunk does not scale well, so I will be making a big
>> >>> commit once I have all the kinks ironed out.  It turns out that the
>> >>> highlighting in ElasticSearch is slow, and it also greatly increased
>> >>> the size of the repo.  I had to rewrite that piece to do my own
>> >>> highlighting, similar to what we were doing in the legacy search.  The
>> >>> side affect of that is that we no longer need to store the whole
>> >>> source doc in ES, the index size has dramatically dropped as such.
>> >>> Right now I'm seeing an index size that is about half the size of the
>> >>> repo.  I think I can get that down further, but its significantly
>> >>> better than triple the repo which I was seeing before.
>> >>>
>> >>> I still have some work to do, to fine turn or eliminate the use of
>> >>> facets.  Facets add an enormous memory requirement.  I think I can
>> >>> eliminate this by some more careful indexing, which may end up
>> >>> increasing the size of the index again, but I think that is a fair
>> >>> tradeout vs requiring significantly more ram.  There is supposed to be
>> >>> a way to reel in the memory consumption of these guys, but I have yet
>> >>> to get that configuration working in practice.
>> >>>
>> >>> Attached is a screenshot, 17,618 hits in 0.118 seconds, not bad.
>> >>>
>> >>> --
>> >>> John Bush
>> >>> 602-490-0470
>> >>>
>> >>> _______________________________________________
>> >>> sakai-dev mailing list
>> >>> sakai-dev at collab.sakaiproject.org
>> >>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>> >>>
>> >>> TO UNSUBSCRIBE: send email to
>> >>> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>> >>> "unsubscribe"
>> >>
>> >>
>> >> _______________________________________________
>> >> sakai-dev mailing list
>> >> sakai-dev at collab.sakaiproject.org
>> >> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>> >>
>> >> TO UNSUBSCRIBE: send email to
>> >> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>> >> "unsubscribe"
>> >>
>> >>
>> >> --
>> >>
>> >> Jeff Cousineau
>> >> Application Systems Administrator Senior
>> >> Information and Technology Services
>> >> University of Michigan
>> >>
>> >
>>
>>
>>
>> --
>> John Bush
>> 602-490-0470
>>
>> On Tue, Feb 26, 2013 at 10:21 AM, Zhen Qian <zqian at umich.edu> wrote:
>> > An correction on the previously reported data:
>> >
>> > Jeff pointed out that CTools only has a handful of site which are
>> > search-enabled.
>> >
>> > With the following setting for CTools, the search reindex will only
>> > target
>> > those sites with search tool enabled.
>> >
>> >
>> > onlyIndexSearchToolSites at org.sakaiproject.search.api.SearchIndexBuilder=true
>> >
>> > So here are the new queries and results:
>> >
>> > select  count(*)
>> > from sakai_site_tool
>> > where registration='sakai.search'
>> >
>> > 112
>> >
>> > select  count(t1.resource_id), sum(t1.file_size)
>> > from content_resource t1, sakai_site_tool t2
>> > where t2.registration='sakai.search'
>> > and t1.context=t2.site_id
>> >
>> > 55253 329016147091
>> >
>> > There are 112 sites with Search tool, which translates to 55K docs, and
>> > 0.3T
>> > of data. It took system 6 hours to reindex the whole data.
>> >
>> > John, what's the reindex time in your setup with the new search code?
>> >
>> > Thanks,
>> >
>> > - Zhen
>> >
>> >
>> >
>> >
>> > On Tue, Feb 26, 2013 at 11:32 AM, Jeff Cousineau <cousinea at umich.edu>
>> > wrote:
>> >>
>> >> Hi Zhen,
>> >>
>> >> Just a quick question.  Since we don't have site search enabled for all
>> >> sites, does the indexing process _only_ index content for those sites
>> >> that
>> >> it _is_ enabled for or does it index all content regardless?  Guess I
>> >> was
>> >> assuming the search index was only generated for the former, meaning it
>> >> took
>> >> ~6 hours to index content for a very small subset of all
>> >> sites/resources.
>> >>
>> >> Thanks!
>> >> Jeff
>> >>
>> >> On Feb 26, 2013, at 11:06 AM, Zhen Qian <zqian at umich.edu> wrote:
>> >>
>> >> Hi, John:
>> >>
>> >> Here is the result for CTools in UMich, with 10+ years of data:
>> >>
>> >> select count(resource_id) from content_resource
>> >> 15652880
>> >>
>> >> select sum(file_size) from content_resource
>> >>
>> >> 15565111245559
>> >>
>> >> So it is 15 million doc with 16T in size. Last time (7/2012) when the
>> >> system was re-indexed with Sakai 2.7 search, it took 6 hours. I don't
>> >> have
>> >> the search turnaround data at hand, though.
>> >>
>> >> Search is among the top of our load testing candidates here in UMich. I
>> >> hope we can do the load test soon. I think it is fine to get the basic
>> >> search (without facet support) working first.
>> >>
>> >> BTW, is there a wiki page for the elastic search project on Sakai
>> >> confluence site?
>> >>
>> >> Thanks,
>> >>
>> >> - Zhen
>> >>
>> >>
>> >> On Fri, Feb 22, 2013 at 12:15 PM, John Bush <john.bush at rsmart.com>
>> >> wrote:
>> >>>
>> >>> I've been spending the last few weeks tweeking Sakai's elasticsearch
>> >>> impl in order to better scale.  It would be helpful if folks could
>> >>> give me an idea of the number of docs in their sakai repos, and the
>> >>> total size.  I'm sure this varies, but in general for our clients,
>> >>> especially those that have been using Sakai for a bit, I'm seeing
>> >>> around 400-500k docs and nearly a 1/2 terabyte of data.
>> >>>
>> >>> You can simply run these queries to collect that info:
>> >>>
>> >>> select count(resource_id) from content_resource
>> >>> select sum(file_size) from content_resource
>> >>>
>> >>> Currently using 4 medium size nodes in aws, with 35k docs and a repo
>> >>> of 20GB, I'm getting search response times on average around 150ms and
>> >>> often much faster.  I'm going along doubling the repository size and
>> >>> so far not seeing much of any impact in performance, although I
>> >>> imagine there is a point that changes.
>> >>>
>> >>> The code in trunk does not scale well, so I will be making a big
>> >>> commit once I have all the kinks ironed out.  It turns out that the
>> >>> highlighting in ElasticSearch is slow, and it also greatly increased
>> >>> the size of the repo.  I had to rewrite that piece to do my own
>> >>> highlighting, similar to what we were doing in the legacy search.  The
>> >>> side affect of that is that we no longer need to store the whole
>> >>> source doc in ES, the index size has dramatically dropped as such.
>> >>> Right now I'm seeing an index size that is about half the size of the
>> >>> repo.  I think I can get that down further, but its significantly
>> >>> better than triple the repo which I was seeing before.
>> >>>
>> >>> I still have some work to do, to fine turn or eliminate the use of
>> >>> facets.  Facets add an enormous memory requirement.  I think I can
>> >>> eliminate this by some more careful indexing, which may end up
>> >>> increasing the size of the index again, but I think that is a fair
>> >>> tradeout vs requiring significantly more ram.  There is supposed to be
>> >>> a way to reel in the memory consumption of these guys, but I have yet
>> >>> to get that configuration working in practice.
>> >>>
>> >>> Attached is a screenshot, 17,618 hits in 0.118 seconds, not bad.
>> >>>
>> >>> --
>> >>> John Bush
>> >>> 602-490-0470
>> >>>
>> >>> _______________________________________________
>> >>> sakai-dev mailing list
>> >>> sakai-dev at collab.sakaiproject.org
>> >>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>> >>>
>> >>> TO UNSUBSCRIBE: send email to
>> >>> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>> >>> "unsubscribe"
>> >>
>> >>
>> >> _______________________________________________
>> >> sakai-dev mailing list
>> >> sakai-dev at collab.sakaiproject.org
>> >> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>> >>
>> >> TO UNSUBSCRIBE: send email to
>> >> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>> >> "unsubscribe"
>> >>
>> >>
>> >> --
>> >>
>> >> Jeff Cousineau
>> >> Application Systems Administrator Senior
>> >> Information and Technology Services
>> >> University of Michigan
>> >>
>> >
>>
>>
>>
>> --
>> John Bush
>> 602-490-0470
>
>

--
John Bush
602-490-0470