[Building Sakai] Elastic Search (SRCH-111)

Thu Feb 28 04:15:17 PST 2013

I did this before with a video portal system and SOLR and I was
familiar with the OAE way of doing things. I would say that indexing
the groups/subgroup (role in CLE) along with restricted content would
pretty much cover you so your plan sounds solid to me.

-AZ

On Wed, Feb 27, 2013 at 11:03 AM, John Bush <john.bush at rsmart.com> wrote:
> i think you just capture realm changes and update docs to manage that even
> if you have to reindex the whole site typically that goes pretty fast.
> Steve, the problem with trying to filter is that you may gets results on the
> order of 50k hits or something large.  You can't really cycle through all
> that to adjust the counts, it will be really, really slow.
>
> What I was thinking that is different that OAE is that we only need to store
> the groups, not individual users.  So really, you only have to adust acls
> when a realm group is added or changed that affects a content.read
> permission, that probably actually doesn't happen very often.
>
> Not sent with my iphone.
>
> On Feb 23, 2013 2:17 AM, "Steve Swinsburg" <steve.swinsburg at gmail.com>
> wrote:
>>
>> You'd have to deal with group changes and reindexing wouldn't you? With
>> response times as fast as what you posted before, just filter them post
>> search?
>>
>> Gesendent von meinem iPhone
>>
>> On 23/02/2013, at 5:53, John Bush <john.bush at rsmart.com> wrote:
>>
>> > Zhen, yes I think that is correct regarding acls., there is not
>> > support for that yet. The legacy search did not deal with ACL's
>> > either.   I do have plans to support that.  Ian blogged about some
>> > fancy stuff OAE had to do regarding this here:
>> >
>> > http://blog.tfd.co.uk/2012/02/14/search-acls-part-2-simple-is-always-best/
>> >
>> > I'm not sure we need to go this far, since in Sakai a principal is not
>> > a group or a user, I think we could simply just add to the index the
>> > groups that have access to doc, and then simply filter on that.  I
>> > don't think that list would ever get so large that we would reach a
>> > query clause boundary as was the case in OAE.  ES has decent support
>> > for putting lists into fields, although we will have to test how fast
>> > that is.
>> >
>> > I think this might be fairly straightforward to implement,  its next
>> > on my list once I get the basics to scale to a point I'm happy with.
>> > One consideration is we will have to capture realm changes in order to
>> > keep the ACLs in sync in the index, but I think that is just another
>> > event we need to capture and update docs as necessary, it might not be
>> > too bad.
>> >
>> > On Fri, Jan 25, 2013 at 10:22 PM, Zhen Qian <zqian at umich.edu> wrote:
>> >> Thanks, John. It does look like an empirical problem.
>> >>
>> >> Another question: how do you handle the acl of searchable items? I
>> >> briefly
>> >> looked through the source code, and it looks to me the permission
>> >> control is
>> >> on site level, and there is no support for group permission yet?
>> >>
>> >> Here is a question for Adrian: Is your SOLR integration work based on
>> >> the
>> >> recent Solr 4 release, which brings in many scalability improvements?
>> >>
>> >> Thanks,
>> >>
>> >> - Zhen
>> >>
>> >>
>> >> On Fri, Jan 25, 2013 at 6:03 PM, John Bush <john.bush at rsmart.com>
>> >> wrote:
>> >>>
>> >>> yes it uses the defaults which based on what I read might be
>> >>> reasonable for most people.  I believe that is 5 shards and 1 replica.
>> >>> There is a lot of discussion about picking the optimal numbers if you
>> >>> google around.
>> >>>
>> >>> You set them in your sakai.properties like this:
>> >>>
>> >>> elasticsearch.index.number_of_shards=5
>> >>> elasticsearch.index.number_of_replicas=1
>> >>>
>> >>> In most cases I believe you want the number of shards to be set to
>> >>> around about how many nodes you might grow to.  You can't change that
>> >>> number without a full reindex.  The number of replicas is adjustable
>> >>> at runtime, so you could change that on the fly using the JSON api and
>> >>> curl for example.
>> >>>
>> >>> This video does a good job explaining the dynamics if you have the
>> >>> time check it out:
>> >>>
>> >>>
>> >>> http://www.elasticsearch.org/videos/2012/06/05/big-data-search-and-analytics.html
>> >>>
>> >>> On Fri, Jan 25, 2013 at 1:41 PM, Zhen Qian <zqian at umich.edu> wrote:
>> >>>> John:
>> >>>>
>> >>>> I am new to Elastic Search. When I look at the elastic search impl
>> >>>> code,
>> >>>> I
>> >>>> cannot find the settings for shards or replicas per node. Is it using
>> >>>> the
>> >>>> default setting of ElasticSearch?
>> >>>>
>> >>>> Thanks,
>> >>>>
>> >>>> - Zhen
>> >>>>
>> >>>>
>> >>>> On Fri, Jan 25, 2013 at 11:29 AM, John Bush <john.bush at rsmart.com>
>> >>>> wrote:
>> >>>>>
>> >>>>>>
>> >>>>>> If that's the case, then I agree entirely. It seems mad to be
>> >>>>>> forced
>> >>>>>> to
>> >>>>>> cluster your sakai app servers just to scale your search thing. I'm
>> >>>>>> not
>> >>>>>> sure
>> >>>>>> that's what he is saying though. ...Anybody serious about search
>> >>>>>> will
>> >>>>>> need an external search thing.
>> >>>>>
>> >>>>> Unless the use cases for search changes dramatically, I find it hard
>> >>>>> to imagine a case where you would need to add nodes just to handle
>> >>>>> search.  Once things are indexed that work load is not really the
>> >>>>> significant, and really as you pointed out as content is being
>> >>>>> created
>> >>>>> and indexed on the fly its not very significant either.  So I
>> >>>>> disagree
>> >>>>> that an embedded approach can't just scale with the normal user load
>> >>>>> the only change being perhaps how many users you can fit on a node
>> >>>>> or
>> >>>>> RAM.
>> >>>>>
>> >>>>> Maybe pounds work differently that dollars, but at the end of the
>> >>>>> day
>> >>>>> this is all about cost.  If I was to go to any sane operations or IT
>> >>>>> manager and say if you want search to work in Sakai you can add some
>> >>>>> more RAM to your existing app server nodes (or maybe do nothing), or
>> >>>>> you can setup a new server and a potentially a new cluster.  Which
>> >>>>> option do you think they'd take ?  Configuration, server deployment,
>> >>>>> procurement of the machines, the knowledge around all that stuff all
>> >>>>> amounts to cost.  So this argument is not solely about what
>> >>>>> architectures we like more or think might scale better, at the end
>> >>>>> of
>> >>>>> the day its about cost.  Personally, I think an embedded approach is
>> >>>>> more cost effective.  For rSmart which literally has hundreds of
>> >>>>> Sakai
>> >>>>> nodes a change in the cost structure of that magnitude is very
>> >>>>> significant.  I realize for others the situation is different.
>> >>>>>
>> >>>>> The idea that search is somehow the bottleneck of the system that
>> >>>>> warrants a new app node or that search activity is so great that it
>> >>>>> poses overall risk to the node just isn't consistent with my
>> >>>>> experience.  If you really wanted to protect users from risk, I'd
>> >>>>> start with externalizing msgcntr and samigo.
>> >>>>>
>> >>>>>> Surely the integration is using the REST
>> >>>>>> api, not the internal Java one? I think the embedded/external
>> >>>>>> argument
>> >>>>>> is
>> >>>>>> moot.
>> >>>>>
>> >>>>> The integration uses the internal Java APIs, but that doesn't mean
>> >>>>> you
>> >>>>> couldn't conceivably run ES as a separate server.  The code as is
>> >>>>> doesn't support that yet, but its certainly possible, but not
>> >>>>> something I was ever planning on personally implementing, but I
>> >>>>> don't
>> >>>>> see the usefulness of such a design.  Understand that even when ES
>> >>>>> is
>> >>>>> embedded you can access the REST app directly with curl or whatever,
>> >>>>> this is in fact how I typically work to create queries or do
>> >>>>> anything
>> >>>>> administrative.
>> >>>>>
>> >>>>> --
>> >>>>> John Bush
>> >>>>> 602-490-0470
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> sakai-dev mailing list
>> >>>>> sakai-dev at collab.sakaiproject.org
>> >>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>> >>>>>
>> >>>>> TO UNSUBSCRIBE: send email to
>> >>>>> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>> >>>>> "unsubscribe"
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> John Bush
>> >>> 602-490-0470
>> >
>> >
>> >
>> > --
>> > John Bush
>> > 602-490-0470
>> >
>> > On Fri, Jan 25, 2013 at 10:22 PM, Zhen Qian <zqian at umich.edu> wrote:
>> >> Thanks, John. It does look like an empirical problem.
>> >>
>> >> Another question: how do you handle the acl of searchable items? I
>> >> briefly
>> >> looked through the source code, and it looks to me the permission
>> >> control is
>> >> on site level, and there is no support for group permission yet?
>> >>
>> >> Here is a question for Adrian: Is your SOLR integration work based on
>> >> the
>> >> recent Solr 4 release, which brings in many scalability improvements?
>> >>
>> >> Thanks,
>> >>
>> >> - Zhen
>> >>
>> >>
>> >> On Fri, Jan 25, 2013 at 6:03 PM, John Bush <john.bush at rsmart.com>
>> >> wrote:
>> >>>
>> >>> yes it uses the defaults which based on what I read might be
>> >>> reasonable for most people.  I believe that is 5 shards and 1 replica.
>> >>> There is a lot of discussion about picking the optimal numbers if you
>> >>> google around.
>> >>>
>> >>> You set them in your sakai.properties like this:
>> >>>
>> >>> elasticsearch.index.number_of_shards=5
>> >>> elasticsearch.index.number_of_replicas=1
>> >>>
>> >>> In most cases I believe you want the number of shards to be set to
>> >>> around about how many nodes you might grow to.  You can't change that
>> >>> number without a full reindex.  The number of replicas is adjustable
>> >>> at runtime, so you could change that on the fly using the JSON api and
>> >>> curl for example.
>> >>>
>> >>> This video does a good job explaining the dynamics if you have the
>> >>> time check it out:
>> >>>
>> >>>
>> >>> http://www.elasticsearch.org/videos/2012/06/05/big-data-search-and-analytics.html
>> >>>
>> >>> On Fri, Jan 25, 2013 at 1:41 PM, Zhen Qian <zqian at umich.edu> wrote:
>> >>>> John:
>> >>>>
>> >>>> I am new to Elastic Search. When I look at the elastic search impl
>> >>>> code,
>> >>>> I
>> >>>> cannot find the settings for shards or replicas per node. Is it using
>> >>>> the
>> >>>> default setting of ElasticSearch?
>> >>>>
>> >>>> Thanks,
>> >>>>
>> >>>> - Zhen
>> >>>>
>> >>>>
>> >>>> On Fri, Jan 25, 2013 at 11:29 AM, John Bush <john.bush at rsmart.com>
>> >>>> wrote:
>> >>>>>
>> >>>>>>
>> >>>>>> If that's the case, then I agree entirely. It seems mad to be
>> >>>>>> forced
>> >>>>>> to
>> >>>>>> cluster your sakai app servers just to scale your search thing. I'm
>> >>>>>> not
>> >>>>>> sure
>> >>>>>> that's what he is saying though. ...Anybody serious about search
>> >>>>>> will
>> >>>>>> need an external search thing.
>> >>>>>
>> >>>>> Unless the use cases for search changes dramatically, I find it hard
>> >>>>> to imagine a case where you would need to add nodes just to handle
>> >>>>> search.  Once things are indexed that work load is not really the
>> >>>>> significant, and really as you pointed out as content is being
>> >>>>> created
>> >>>>> and indexed on the fly its not very significant either.  So I
>> >>>>> disagree
>> >>>>> that an embedded approach can't just scale with the normal user load
>> >>>>> the only change being perhaps how many users you can fit on a node
>> >>>>> or
>> >>>>> RAM.
>> >>>>>
>> >>>>> Maybe pounds work differently that dollars, but at the end of the
>> >>>>> day
>> >>>>> this is all about cost.  If I was to go to any sane operations or IT
>> >>>>> manager and say if you want search to work in Sakai you can add some
>> >>>>> more RAM to your existing app server nodes (or maybe do nothing), or
>> >>>>> you can setup a new server and a potentially a new cluster.  Which
>> >>>>> option do you think they'd take ?  Configuration, server deployment,
>> >>>>> procurement of the machines, the knowledge around all that stuff all
>> >>>>> amounts to cost.  So this argument is not solely about what
>> >>>>> architectures we like more or think might scale better, at the end
>> >>>>> of
>> >>>>> the day its about cost.  Personally, I think an embedded approach is
>> >>>>> more cost effective.  For rSmart which literally has hundreds of
>> >>>>> Sakai
>> >>>>> nodes a change in the cost structure of that magnitude is very
>> >>>>> significant.  I realize for others the situation is different.
>> >>>>>
>> >>>>> The idea that search is somehow the bottleneck of the system that
>> >>>>> warrants a new app node or that search activity is so great that it
>> >>>>> poses overall risk to the node just isn't consistent with my
>> >>>>> experience.  If you really wanted to protect users from risk, I'd
>> >>>>> start with externalizing msgcntr and samigo.
>> >>>>>
>> >>>>>> Surely the integration is using the REST
>> >>>>>> api, not the internal Java one? I think the embedded/external
>> >>>>>> argument
>> >>>>>> is
>> >>>>>> moot.
>> >>>>>
>> >>>>> The integration uses the internal Java APIs, but that doesn't mean
>> >>>>> you
>> >>>>> couldn't conceivably run ES as a separate server.  The code as is
>> >>>>> doesn't support that yet, but its certainly possible, but not
>> >>>>> something I was ever planning on personally implementing, but I
>> >>>>> don't
>> >>>>> see the usefulness of such a design.  Understand that even when ES
>> >>>>> is
>> >>>>> embedded you can access the REST app directly with curl or whatever,
>> >>>>> this is in fact how I typically work to create queries or do
>> >>>>> anything
>> >>>>> administrative.
>> >>>>>
>> >>>>> --
>> >>>>> John Bush
>> >>>>> 602-490-0470
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> sakai-dev mailing list
>> >>>>> sakai-dev at collab.sakaiproject.org
>> >>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>> >>>>>
>> >>>>> TO UNSUBSCRIBE: send email to
>> >>>>> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>> >>>>> "unsubscribe"
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> John Bush
>> >>> 602-490-0470
>> >
>> >
>> >
>> > --
>> > John Bush
>> > 602-490-0470
>> >
>> > On Fri, Jan 25, 2013 at 10:22 PM, Zhen Qian <zqian at umich.edu> wrote:
>> >> Thanks, John. It does look like an empirical problem.
>> >>
>> >> Another question: how do you handle the acl of searchable items? I
>> >> briefly
>> >> looked through the source code, and it looks to me the permission
>> >> control is
>> >> on site level, and there is no support for group permission yet?
>> >>
>> >> Here is a question for Adrian: Is your SOLR integration work based on
>> >> the
>> >> recent Solr 4 release, which brings in many scalability improvements?
>> >>
>> >> Thanks,
>> >>
>> >> - Zhen
>> >>
>> >>
>> >> On Fri, Jan 25, 2013 at 6:03 PM, John Bush <john.bush at rsmart.com>
>> >> wrote:
>> >>>
>> >>> yes it uses the defaults which based on what I read might be
>> >>> reasonable for most people.  I believe that is 5 shards and 1 replica.
>> >>> There is a lot of discussion about picking the optimal numbers if you
>> >>> google around.
>> >>>
>> >>> You set them in your sakai.properties like this:
>> >>>
>> >>> elasticsearch.index.number_of_shards=5
>> >>> elasticsearch.index.number_of_replicas=1
>> >>>
>> >>> In most cases I believe you want the number of shards to be set to
>> >>> around about how many nodes you might grow to.  You can't change that
>> >>> number without a full reindex.  The number of replicas is adjustable
>> >>> at runtime, so you could change that on the fly using the JSON api and
>> >>> curl for example.
>> >>>
>> >>> This video does a good job explaining the dynamics if you have the
>> >>> time check it out:
>> >>>
>> >>>
>> >>> http://www.elasticsearch.org/videos/2012/06/05/big-data-search-and-analytics.html
>> >>>
>> >>> On Fri, Jan 25, 2013 at 1:41 PM, Zhen Qian <zqian at umich.edu> wrote:
>> >>>> John:
>> >>>>
>> >>>> I am new to Elastic Search. When I look at the elastic search impl
>> >>>> code,
>> >>>> I
>> >>>> cannot find the settings for shards or replicas per node. Is it using
>> >>>> the
>> >>>> default setting of ElasticSearch?
>> >>>>
>> >>>> Thanks,
>> >>>>
>> >>>> - Zhen
>> >>>>
>> >>>>
>> >>>> On Fri, Jan 25, 2013 at 11:29 AM, John Bush <john.bush at rsmart.com>
>> >>>> wrote:
>> >>>>>
>> >>>>>>
>> >>>>>> If that's the case, then I agree entirely. It seems mad to be
>> >>>>>> forced
>> >>>>>> to
>> >>>>>> cluster your sakai app servers just to scale your search thing. I'm
>> >>>>>> not
>> >>>>>> sure
>> >>>>>> that's what he is saying though. ...Anybody serious about search
>> >>>>>> will
>> >>>>>> need an external search thing.
>> >>>>>
>> >>>>> Unless the use cases for search changes dramatically, I find it hard
>> >>>>> to imagine a case where you would need to add nodes just to handle
>> >>>>> search.  Once things are indexed that work load is not really the
>> >>>>> significant, and really as you pointed out as content is being
>> >>>>> created
>> >>>>> and indexed on the fly its not very significant either.  So I
>> >>>>> disagree
>> >>>>> that an embedded approach can't just scale with the normal user load
>> >>>>> the only change being perhaps how many users you can fit on a node
>> >>>>> or
>> >>>>> RAM.
>> >>>>>
>> >>>>> Maybe pounds work differently that dollars, but at the end of the
>> >>>>> day
>> >>>>> this is all about cost.  If I was to go to any sane operations or IT
>> >>>>> manager and say if you want search to work in Sakai you can add some
>> >>>>> more RAM to your existing app server nodes (or maybe do nothing), or
>> >>>>> you can setup a new server and a potentially a new cluster.  Which
>> >>>>> option do you think they'd take ?  Configuration, server deployment,
>> >>>>> procurement of the machines, the knowledge around all that stuff all
>> >>>>> amounts to cost.  So this argument is not solely about what
>> >>>>> architectures we like more or think might scale better, at the end
>> >>>>> of
>> >>>>> the day its about cost.  Personally, I think an embedded approach is
>> >>>>> more cost effective.  For rSmart which literally has hundreds of
>> >>>>> Sakai
>> >>>>> nodes a change in the cost structure of that magnitude is very
>> >>>>> significant.  I realize for others the situation is different.
>> >>>>>
>> >>>>> The idea that search is somehow the bottleneck of the system that
>> >>>>> warrants a new app node or that search activity is so great that it
>> >>>>> poses overall risk to the node just isn't consistent with my
>> >>>>> experience.  If you really wanted to protect users from risk, I'd
>> >>>>> start with externalizing msgcntr and samigo.
>> >>>>>
>> >>>>>> Surely the integration is using the REST
>> >>>>>> api, not the internal Java one? I think the embedded/external
>> >>>>>> argument
>> >>>>>> is
>> >>>>>> moot.
>> >>>>>
>> >>>>> The integration uses the internal Java APIs, but that doesn't mean
>> >>>>> you
>> >>>>> couldn't conceivably run ES as a separate server.  The code as is
>> >>>>> doesn't support that yet, but its certainly possible, but not
>> >>>>> something I was ever planning on personally implementing, but I
>> >>>>> don't
>> >>>>> see the usefulness of such a design.  Understand that even when ES
>> >>>>> is
>> >>>>> embedded you can access the REST app directly with curl or whatever,
>> >>>>> this is in fact how I typically work to create queries or do
>> >>>>> anything
>> >>>>> administrative.
>> >>>>>
>> >>>>> --
>> >>>>> John Bush
>> >>>>> 602-490-0470
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> sakai-dev mailing list
>> >>>>> sakai-dev at collab.sakaiproject.org
>> >>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>> >>>>>
>> >>>>> TO UNSUBSCRIBE: send email to
>> >>>>> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>> >>>>> "unsubscribe"
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> John Bush
>> >>> 602-490-0470
>> >
>> >
>> >
>> > --
>> > John Bush
>> > 602-490-0470
>> > _______________________________________________
>> > sakai-dev mailing list
>> > sakai-dev at collab.sakaiproject.org
>> > http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>> >
>> > TO UNSUBSCRIBE: send email to
>> > sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>> > "unsubscribe"
>
>
> _______________________________________________
> sakai-dev mailing list
> sakai-dev at collab.sakaiproject.org
> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>
> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org
> with a subject of "unsubscribe"

-- 
Aaron Zeckoski - Software Architect - http://tinyurl.com/azprofile