[Building Sakai] Elastic Search (SRCH-111)

Wed Feb 27 19:04:28 PST 2013

I think realms might change quite often, especially if people limit content
based on groups and people move in and out of those groups.

Can you give a bit more context in the search so that the number of results
returned aren't quite so large? Ie show me everything that jsmith26 can see
(would need to take into account all realms). Or show me everything that
jsmith26 can see in site A, limited to resources only.

This would be what Colin was saying above, where the engine does the
filtering to ensure the user can see the particular item returned in the
search. However that is accomplished.

Or are you saying there is a faster and easier way? How fast is a reindex
of a site with a lot of stuff?

cheers,
Steve

On Thu, Feb 28, 2013 at 3:03 AM, John Bush <john.bush at rsmart.com> wrote:

> i think you just capture realm changes and update docs to manage that even
> if you have to reindex the whole site typically that goes pretty fast.
>  Steve, the problem with trying to filter is that you may gets results on
> the order of 50k hits or something large.  You can't really cycle through
> all that to adjust the counts, it will be really, really slow.
>
> What I was thinking that is different that OAE is that we only need to
> store the groups, not individual users.  So really, you only have to adust
> acls when a realm group is added or changed that affects a content.read
> permission, that probably actually doesn't happen very often.
>
> Not sent with my iphone.
> On Feb 23, 2013 2:17 AM, "Steve Swinsburg" <steve.swinsburg at gmail.com>
> wrote:
>
>> You'd have to deal with group changes and reindexing wouldn't you? With
>> response times as fast as what you posted before, just filter them post
>> search?
>>
>> Gesendent von meinem iPhone
>>
>> On 23/02/2013, at 5:53, John Bush <john.bush at rsmart.com> wrote:
>>
>> > Zhen, yes I think that is correct regarding acls., there is not
>> > support for that yet. The legacy search did not deal with ACL's
>> > either.   I do have plans to support that.  Ian blogged about some
>> > fancy stuff OAE had to do regarding this here:
>> >
>> http://blog.tfd.co.uk/2012/02/14/search-acls-part-2-simple-is-always-best/
>> >
>> > I'm not sure we need to go this far, since in Sakai a principal is not
>> > a group or a user, I think we could simply just add to the index the
>> > groups that have access to doc, and then simply filter on that.  I
>> > don't think that list would ever get so large that we would reach a
>> > query clause boundary as was the case in OAE.  ES has decent support
>> > for putting lists into fields, although we will have to test how fast
>> > that is.
>> >
>> > I think this might be fairly straightforward to implement,  its next
>> > on my list once I get the basics to scale to a point I'm happy with.
>> > One consideration is we will have to capture realm changes in order to
>> > keep the ACLs in sync in the index, but I think that is just another
>> > event we need to capture and update docs as necessary, it might not be
>> > too bad.
>> >
>> > On Fri, Jan 25, 2013 at 10:22 PM, Zhen Qian <zqian at umich.edu> wrote:
>> >> Thanks, John. It does look like an empirical problem.
>> >>
>> >> Another question: how do you handle the acl of searchable items? I
>> briefly
>> >> looked through the source code, and it looks to me the permission
>> control is
>> >> on site level, and there is no support for group permission yet?
>> >>
>> >> Here is a question for Adrian: Is your SOLR integration work based on
>> the
>> >> recent Solr 4 release, which brings in many scalability improvements?
>> >>
>> >> Thanks,
>> >>
>> >> - Zhen
>> >>
>> >>
>> >> On Fri, Jan 25, 2013 at 6:03 PM, John Bush <john.bush at rsmart.com>
>> wrote:
>> >>>
>> >>> yes it uses the defaults which based on what I read might be
>> >>> reasonable for most people.  I believe that is 5 shards and 1 replica.
>> >>> There is a lot of discussion about picking the optimal numbers if you
>> >>> google around.
>> >>>
>> >>> You set them in your sakai.properties like this:
>> >>>
>> >>> elasticsearch.index.number_of_shards=5
>> >>> elasticsearch.index.number_of_replicas=1
>> >>>
>> >>> In most cases I believe you want the number of shards to be set to
>> >>> around about how many nodes you might grow to.  You can't change that
>> >>> number without a full reindex.  The number of replicas is adjustable
>> >>> at runtime, so you could change that on the fly using the JSON api and
>> >>> curl for example.
>> >>>
>> >>> This video does a good job explaining the dynamics if you have the
>> >>> time check it out:
>> >>>
>> >>>
>> http://www.elasticsearch.org/videos/2012/06/05/big-data-search-and-analytics.html
>> >>>
>> >>> On Fri, Jan 25, 2013 at 1:41 PM, Zhen Qian <zqian at umich.edu> wrote:
>> >>>> John:
>> >>>>
>> >>>> I am new to Elastic Search. When I look at the elastic search impl
>> code,
>> >>>> I
>> >>>> cannot find the settings for shards or replicas per node. Is it using
>> >>>> the
>> >>>> default setting of ElasticSearch?
>> >>>>
>> >>>> Thanks,
>> >>>>
>> >>>> - Zhen
>> >>>>
>> >>>>
>> >>>> On Fri, Jan 25, 2013 at 11:29 AM, John Bush <john.bush at rsmart.com>
>> >>>> wrote:
>> >>>>>
>> >>>>>>
>> >>>>>> If that's the case, then I agree entirely. It seems mad to be
>> forced
>> >>>>>> to
>> >>>>>> cluster your sakai app servers just to scale your search thing. I'm
>> >>>>>> not
>> >>>>>> sure
>> >>>>>> that's what he is saying though. ...Anybody serious about search
>> will
>> >>>>>> need an external search thing.
>> >>>>>
>> >>>>> Unless the use cases for search changes dramatically, I find it hard
>> >>>>> to imagine a case where you would need to add nodes just to handle
>> >>>>> search.  Once things are indexed that work load is not really the
>> >>>>> significant, and really as you pointed out as content is being
>> created
>> >>>>> and indexed on the fly its not very significant either.  So I
>> disagree
>> >>>>> that an embedded approach can't just scale with the normal user load
>> >>>>> the only change being perhaps how many users you can fit on a node
>> or
>> >>>>> RAM.
>> >>>>>
>> >>>>> Maybe pounds work differently that dollars, but at the end of the
>> day
>> >>>>> this is all about cost.  If I was to go to any sane operations or IT
>> >>>>> manager and say if you want search to work in Sakai you can add some
>> >>>>> more RAM to your existing app server nodes (or maybe do nothing), or
>> >>>>> you can setup a new server and a potentially a new cluster.  Which
>> >>>>> option do you think they'd take ?  Configuration, server deployment,
>> >>>>> procurement of the machines, the knowledge around all that stuff all
>> >>>>> amounts to cost.  So this argument is not solely about what
>> >>>>> architectures we like more or think might scale better, at the end
>> of
>> >>>>> the day its about cost.  Personally, I think an embedded approach is
>> >>>>> more cost effective.  For rSmart which literally has hundreds of
>> Sakai
>> >>>>> nodes a change in the cost structure of that magnitude is very
>> >>>>> significant.  I realize for others the situation is different.
>> >>>>>
>> >>>>> The idea that search is somehow the bottleneck of the system that
>> >>>>> warrants a new app node or that search activity is so great that it
>> >>>>> poses overall risk to the node just isn't consistent with my
>> >>>>> experience.  If you really wanted to protect users from risk, I'd
>> >>>>> start with externalizing msgcntr and samigo.
>> >>>>>
>> >>>>>> Surely the integration is using the REST
>> >>>>>> api, not the internal Java one? I think the embedded/external
>> >>>>>> argument
>> >>>>>> is
>> >>>>>> moot.
>> >>>>>
>> >>>>> The integration uses the internal Java APIs, but that doesn't mean
>> you
>> >>>>> couldn't conceivably run ES as a separate server.  The code as is
>> >>>>> doesn't support that yet, but its certainly possible, but not
>> >>>>> something I was ever planning on personally implementing, but I
>> don't
>> >>>>> see the usefulness of such a design.  Understand that even when ES
>> is
>> >>>>> embedded you can access the REST app directly with curl or whatever,
>> >>>>> this is in fact how I typically work to create queries or do
>> anything
>> >>>>> administrative.
>> >>>>>
>> >>>>> --
>> >>>>> John Bush
>> >>>>> 602-490-0470
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> sakai-dev mailing list
>> >>>>> sakai-dev at collab.sakaiproject.org
>> >>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>> >>>>>
>> >>>>> TO UNSUBSCRIBE: send email to
>> >>>>> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>> >>>>> "unsubscribe"
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> John Bush
>> >>> 602-490-0470
>> >
>> >
>> >
>> > --
>> > John Bush
>> > 602-490-0470
>> >
>> > On Fri, Jan 25, 2013 at 10:22 PM, Zhen Qian <zqian at umich.edu> wrote:
>> >> Thanks, John. It does look like an empirical problem.
>> >>
>> >> Another question: how do you handle the acl of searchable items? I
>> briefly
>> >> looked through the source code, and it looks to me the permission
>> control is
>> >> on site level, and there is no support for group permission yet?
>> >>
>> >> Here is a question for Adrian: Is your SOLR integration work based on
>> the
>> >> recent Solr 4 release, which brings in many scalability improvements?
>> >>
>> >> Thanks,
>> >>
>> >> - Zhen
>> >>
>> >>
>> >> On Fri, Jan 25, 2013 at 6:03 PM, John Bush <john.bush at rsmart.com>
>> wrote:
>> >>>
>> >>> yes it uses the defaults which based on what I read might be
>> >>> reasonable for most people.  I believe that is 5 shards and 1 replica.
>> >>> There is a lot of discussion about picking the optimal numbers if you
>> >>> google around.
>> >>>
>> >>> You set them in your sakai.properties like this:
>> >>>
>> >>> elasticsearch.index.number_of_shards=5
>> >>> elasticsearch.index.number_of_replicas=1
>> >>>
>> >>> In most cases I believe you want the number of shards to be set to
>> >>> around about how many nodes you might grow to.  You can't change that
>> >>> number without a full reindex.  The number of replicas is adjustable
>> >>> at runtime, so you could change that on the fly using the JSON api and
>> >>> curl for example.
>> >>>
>> >>> This video does a good job explaining the dynamics if you have the
>> >>> time check it out:
>> >>>
>> >>>
>> http://www.elasticsearch.org/videos/2012/06/05/big-data-search-and-analytics.html
>> >>>
>> >>> On Fri, Jan 25, 2013 at 1:41 PM, Zhen Qian <zqian at umich.edu> wrote:
>> >>>> John:
>> >>>>
>> >>>> I am new to Elastic Search. When I look at the elastic search impl
>> code,
>> >>>> I
>> >>>> cannot find the settings for shards or replicas per node. Is it using
>> >>>> the
>> >>>> default setting of ElasticSearch?
>> >>>>
>> >>>> Thanks,
>> >>>>
>> >>>> - Zhen
>> >>>>
>> >>>>
>> >>>> On Fri, Jan 25, 2013 at 11:29 AM, John Bush <john.bush at rsmart.com>
>> >>>> wrote:
>> >>>>>
>> >>>>>>
>> >>>>>> If that's the case, then I agree entirely. It seems mad to be
>> forced
>> >>>>>> to
>> >>>>>> cluster your sakai app servers just to scale your search thing. I'm
>> >>>>>> not
>> >>>>>> sure
>> >>>>>> that's what he is saying though. ...Anybody serious about search
>> will
>> >>>>>> need an external search thing.
>> >>>>>
>> >>>>> Unless the use cases for search changes dramatically, I find it hard
>> >>>>> to imagine a case where you would need to add nodes just to handle
>> >>>>> search.  Once things are indexed that work load is not really the
>> >>>>> significant, and really as you pointed out as content is being
>> created
>> >>>>> and indexed on the fly its not very significant either.  So I
>> disagree
>> >>>>> that an embedded approach can't just scale with the normal user load
>> >>>>> the only change being perhaps how many users you can fit on a node
>> or
>> >>>>> RAM.
>> >>>>>
>> >>>>> Maybe pounds work differently that dollars, but at the end of the
>> day
>> >>>>> this is all about cost.  If I was to go to any sane operations or IT
>> >>>>> manager and say if you want search to work in Sakai you can add some
>> >>>>> more RAM to your existing app server nodes (or maybe do nothing), or
>> >>>>> you can setup a new server and a potentially a new cluster.  Which
>> >>>>> option do you think they'd take ?  Configuration, server deployment,
>> >>>>> procurement of the machines, the knowledge around all that stuff all
>> >>>>> amounts to cost.  So this argument is not solely about what
>> >>>>> architectures we like more or think might scale better, at the end
>> of
>> >>>>> the day its about cost.  Personally, I think an embedded approach is
>> >>>>> more cost effective.  For rSmart which literally has hundreds of
>> Sakai
>> >>>>> nodes a change in the cost structure of that magnitude is very
>> >>>>> significant.  I realize for others the situation is different.
>> >>>>>
>> >>>>> The idea that search is somehow the bottleneck of the system that
>> >>>>> warrants a new app node or that search activity is so great that it
>> >>>>> poses overall risk to the node just isn't consistent with my
>> >>>>> experience.  If you really wanted to protect users from risk, I'd
>> >>>>> start with externalizing msgcntr and samigo.
>> >>>>>
>> >>>>>> Surely the integration is using the REST
>> >>>>>> api, not the internal Java one? I think the embedded/external
>> >>>>>> argument
>> >>>>>> is
>> >>>>>> moot.
>> >>>>>
>> >>>>> The integration uses the internal Java APIs, but that doesn't mean
>> you
>> >>>>> couldn't conceivably run ES as a separate server.  The code as is
>> >>>>> doesn't support that yet, but its certainly possible, but not
>> >>>>> something I was ever planning on personally implementing, but I
>> don't
>> >>>>> see the usefulness of such a design.  Understand that even when ES
>> is
>> >>>>> embedded you can access the REST app directly with curl or whatever,
>> >>>>> this is in fact how I typically work to create queries or do
>> anything
>> >>>>> administrative.
>> >>>>>
>> >>>>> --
>> >>>>> John Bush
>> >>>>> 602-490-0470
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> sakai-dev mailing list
>> >>>>> sakai-dev at collab.sakaiproject.org
>> >>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>> >>>>>
>> >>>>> TO UNSUBSCRIBE: send email to
>> >>>>> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>> >>>>> "unsubscribe"
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> John Bush
>> >>> 602-490-0470
>> >
>> >
>> >
>> > --
>> > John Bush
>> > 602-490-0470
>> >
>> > On Fri, Jan 25, 2013 at 10:22 PM, Zhen Qian <zqian at umich.edu> wrote:
>> >> Thanks, John. It does look like an empirical problem.
>> >>
>> >> Another question: how do you handle the acl of searchable items? I
>> briefly
>> >> looked through the source code, and it looks to me the permission
>> control is
>> >> on site level, and there is no support for group permission yet?
>> >>
>> >> Here is a question for Adrian: Is your SOLR integration work based on
>> the
>> >> recent Solr 4 release, which brings in many scalability improvements?
>> >>
>> >> Thanks,
>> >>
>> >> - Zhen
>> >>
>> >>
>> >> On Fri, Jan 25, 2013 at 6:03 PM, John Bush <john.bush at rsmart.com>
>> wrote:
>> >>>
>> >>> yes it uses the defaults which based on what I read might be
>> >>> reasonable for most people.  I believe that is 5 shards and 1 replica.
>> >>> There is a lot of discussion about picking the optimal numbers if you
>> >>> google around.
>> >>>
>> >>> You set them in your sakai.properties like this:
>> >>>
>> >>> elasticsearch.index.number_of_shards=5
>> >>> elasticsearch.index.number_of_replicas=1
>> >>>
>> >>> In most cases I believe you want the number of shards to be set to
>> >>> around about how many nodes you might grow to.  You can't change that
>> >>> number without a full reindex.  The number of replicas is adjustable
>> >>> at runtime, so you could change that on the fly using the JSON api and
>> >>> curl for example.
>> >>>
>> >>> This video does a good job explaining the dynamics if you have the
>> >>> time check it out:
>> >>>
>> >>>
>> http://www.elasticsearch.org/videos/2012/06/05/big-data-search-and-analytics.html
>> >>>
>> >>> On Fri, Jan 25, 2013 at 1:41 PM, Zhen Qian <zqian at umich.edu> wrote:
>> >>>> John:
>> >>>>
>> >>>> I am new to Elastic Search. When I look at the elastic search impl
>> code,
>> >>>> I
>> >>>> cannot find the settings for shards or replicas per node. Is it using
>> >>>> the
>> >>>> default setting of ElasticSearch?
>> >>>>
>> >>>> Thanks,
>> >>>>
>> >>>> - Zhen
>> >>>>
>> >>>>
>> >>>> On Fri, Jan 25, 2013 at 11:29 AM, John Bush <john.bush at rsmart.com>
>> >>>> wrote:
>> >>>>>
>> >>>>>>
>> >>>>>> If that's the case, then I agree entirely. It seems mad to be
>> forced
>> >>>>>> to
>> >>>>>> cluster your sakai app servers just to scale your search thing. I'm
>> >>>>>> not
>> >>>>>> sure
>> >>>>>> that's what he is saying though. ...Anybody serious about search
>> will
>> >>>>>> need an external search thing.
>> >>>>>
>> >>>>> Unless the use cases for search changes dramatically, I find it hard
>> >>>>> to imagine a case where you would need to add nodes just to handle
>> >>>>> search.  Once things are indexed that work load is not really the
>> >>>>> significant, and really as you pointed out as content is being
>> created
>> >>>>> and indexed on the fly its not very significant either.  So I
>> disagree
>> >>>>> that an embedded approach can't just scale with the normal user load
>> >>>>> the only change being perhaps how many users you can fit on a node
>> or
>> >>>>> RAM.
>> >>>>>
>> >>>>> Maybe pounds work differently that dollars, but at the end of the
>> day
>> >>>>> this is all about cost.  If I was to go to any sane operations or IT
>> >>>>> manager and say if you want search to work in Sakai you can add some
>> >>>>> more RAM to your existing app server nodes (or maybe do nothing), or
>> >>>>> you can setup a new server and a potentially a new cluster.  Which
>> >>>>> option do you think they'd take ?  Configuration, server deployment,
>> >>>>> procurement of the machines, the knowledge around all that stuff all
>> >>>>> amounts to cost.  So this argument is not solely about what
>> >>>>> architectures we like more or think might scale better, at the end
>> of
>> >>>>> the day its about cost.  Personally, I think an embedded approach is
>> >>>>> more cost effective.  For rSmart which literally has hundreds of
>> Sakai
>> >>>>> nodes a change in the cost structure of that magnitude is very
>> >>>>> significant.  I realize for others the situation is different.
>> >>>>>
>> >>>>> The idea that search is somehow the bottleneck of the system that
>> >>>>> warrants a new app node or that search activity is so great that it
>> >>>>> poses overall risk to the node just isn't consistent with my
>> >>>>> experience.  If you really wanted to protect users from risk, I'd
>> >>>>> start with externalizing msgcntr and samigo.
>> >>>>>
>> >>>>>> Surely the integration is using the REST
>> >>>>>> api, not the internal Java one? I think the embedded/external
>> >>>>>> argument
>> >>>>>> is
>> >>>>>> moot.
>> >>>>>
>> >>>>> The integration uses the internal Java APIs, but that doesn't mean
>> you
>> >>>>> couldn't conceivably run ES as a separate server.  The code as is
>> >>>>> doesn't support that yet, but its certainly possible, but not
>> >>>>> something I was ever planning on personally implementing, but I
>> don't
>> >>>>> see the usefulness of such a design.  Understand that even when ES
>> is
>> >>>>> embedded you can access the REST app directly with curl or whatever,
>> >>>>> this is in fact how I typically work to create queries or do
>> anything
>> >>>>> administrative.
>> >>>>>
>> >>>>> --
>> >>>>> John Bush
>> >>>>> 602-490-0470
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> sakai-dev mailing list
>> >>>>> sakai-dev at collab.sakaiproject.org
>> >>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>> >>>>>
>> >>>>> TO UNSUBSCRIBE: send email to
>> >>>>> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>> >>>>> "unsubscribe"
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> John Bush
>> >>> 602-490-0470
>> >
>> >
>> >
>> > --
>> > John Bush
>> > 602-490-0470
>> > _______________________________________________
>> > sakai-dev mailing list
>> > sakai-dev at collab.sakaiproject.org
>> > http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>> >
>> > TO UNSUBSCRIBE: send email to
>> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>> "unsubscribe"
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://collab.sakaiproject.org/pipermail/sakai-dev/attachments/20130228/c92e4fdd/attachment.html