[Building Sakai] Elastic Search (SRCH-111)

Fri Feb 22 10:53:28 PST 2013

Zhen, yes I think that is correct regarding acls., there is not
support for that yet. The legacy search did not deal with ACL's
either.   I do have plans to support that.  Ian blogged about some
fancy stuff OAE had to do regarding this here:
http://blog.tfd.co.uk/2012/02/14/search-acls-part-2-simple-is-always-best/

I'm not sure we need to go this far, since in Sakai a principal is not
a group or a user, I think we could simply just add to the index the
groups that have access to doc, and then simply filter on that.  I
don't think that list would ever get so large that we would reach a
query clause boundary as was the case in OAE.  ES has decent support
for putting lists into fields, although we will have to test how fast
that is.

I think this might be fairly straightforward to implement,  its next
on my list once I get the basics to scale to a point I'm happy with.
One consideration is we will have to capture realm changes in order to
keep the ACLs in sync in the index, but I think that is just another
event we need to capture and update docs as necessary, it might not be
too bad.

On Fri, Jan 25, 2013 at 10:22 PM, Zhen Qian <zqian at umich.edu> wrote:
> Thanks, John. It does look like an empirical problem.
>
> Another question: how do you handle the acl of searchable items? I briefly
> looked through the source code, and it looks to me the permission control is
> on site level, and there is no support for group permission yet?
>
> Here is a question for Adrian: Is your SOLR integration work based on the
> recent Solr 4 release, which brings in many scalability improvements?
>
> Thanks,
>
> - Zhen
>
>
> On Fri, Jan 25, 2013 at 6:03 PM, John Bush <john.bush at rsmart.com> wrote:
>>
>> yes it uses the defaults which based on what I read might be
>> reasonable for most people.  I believe that is 5 shards and 1 replica.
>>  There is a lot of discussion about picking the optimal numbers if you
>> google around.
>>
>> You set them in your sakai.properties like this:
>>
>>  elasticsearch.index.number_of_shards=5
>>  elasticsearch.index.number_of_replicas=1
>>
>> In most cases I believe you want the number of shards to be set to
>> around about how many nodes you might grow to.  You can't change that
>> number without a full reindex.  The number of replicas is adjustable
>> at runtime, so you could change that on the fly using the JSON api and
>> curl for example.
>>
>> This video does a good job explaining the dynamics if you have the
>> time check it out:
>>
>> http://www.elasticsearch.org/videos/2012/06/05/big-data-search-and-analytics.html
>>
>> On Fri, Jan 25, 2013 at 1:41 PM, Zhen Qian <zqian at umich.edu> wrote:
>> > John:
>> >
>> > I am new to Elastic Search. When I look at the elastic search impl code,
>> > I
>> > cannot find the settings for shards or replicas per node. Is it using
>> > the
>> > default setting of ElasticSearch?
>> >
>> > Thanks,
>> >
>> > - Zhen
>> >
>> >
>> > On Fri, Jan 25, 2013 at 11:29 AM, John Bush <john.bush at rsmart.com>
>> > wrote:
>> >>
>> >> >
>> >> > If that's the case, then I agree entirely. It seems mad to be forced
>> >> > to
>> >> > cluster your sakai app servers just to scale your search thing. I'm
>> >> > not
>> >> > sure
>> >> > that's what he is saying though. ...Anybody serious about search will
>> >> > need an external search thing.
>> >>
>> >> Unless the use cases for search changes dramatically, I find it hard
>> >> to imagine a case where you would need to add nodes just to handle
>> >> search.  Once things are indexed that work load is not really the
>> >> significant, and really as you pointed out as content is being created
>> >> and indexed on the fly its not very significant either.  So I disagree
>> >> that an embedded approach can't just scale with the normal user load
>> >> the only change being perhaps how many users you can fit on a node or
>> >> RAM.
>> >>
>> >> Maybe pounds work differently that dollars, but at the end of the day
>> >> this is all about cost.  If I was to go to any sane operations or IT
>> >> manager and say if you want search to work in Sakai you can add some
>> >> more RAM to your existing app server nodes (or maybe do nothing), or
>> >> you can setup a new server and a potentially a new cluster.  Which
>> >> option do you think they'd take ?  Configuration, server deployment,
>> >> procurement of the machines, the knowledge around all that stuff all
>> >> amounts to cost.  So this argument is not solely about what
>> >> architectures we like more or think might scale better, at the end of
>> >> the day its about cost.  Personally, I think an embedded approach is
>> >> more cost effective.  For rSmart which literally has hundreds of Sakai
>> >> nodes a change in the cost structure of that magnitude is very
>> >> significant.  I realize for others the situation is different.
>> >>
>> >> The idea that search is somehow the bottleneck of the system that
>> >> warrants a new app node or that search activity is so great that it
>> >> poses overall risk to the node just isn't consistent with my
>> >> experience.  If you really wanted to protect users from risk, I'd
>> >> start with externalizing msgcntr and samigo.
>> >>
>> >> > Surely the integration is using the REST
>> >> > api, not the internal Java one? I think the embedded/external
>> >> > argument
>> >> > is
>> >> > moot.
>> >>
>> >> The integration uses the internal Java APIs, but that doesn't mean you
>> >> couldn't conceivably run ES as a separate server.  The code as is
>> >> doesn't support that yet, but its certainly possible, but not
>> >> something I was ever planning on personally implementing, but I don't
>> >> see the usefulness of such a design.  Understand that even when ES is
>> >> embedded you can access the REST app directly with curl or whatever,
>> >> this is in fact how I typically work to create queries or do anything
>> >> administrative.
>> >>
>> >> --
>> >> John Bush
>> >> 602-490-0470
>> >>
>> >> _______________________________________________
>> >> sakai-dev mailing list
>> >> sakai-dev at collab.sakaiproject.org
>> >> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>> >>
>> >> TO UNSUBSCRIBE: send email to
>> >> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>> >> "unsubscribe"
>> >
>> >
>>
>>
>>
>> --
>> John Bush
>> 602-490-0470
>
>

--
John Bush
602-490-0470

On Fri, Jan 25, 2013 at 10:22 PM, Zhen Qian <zqian at umich.edu> wrote:
> Thanks, John. It does look like an empirical problem.
>
> Another question: how do you handle the acl of searchable items? I briefly
> looked through the source code, and it looks to me the permission control is
> on site level, and there is no support for group permission yet?
>
> Here is a question for Adrian: Is your SOLR integration work based on the
> recent Solr 4 release, which brings in many scalability improvements?
>
> Thanks,
>
> - Zhen
>
>
> On Fri, Jan 25, 2013 at 6:03 PM, John Bush <john.bush at rsmart.com> wrote:
>>
>> yes it uses the defaults which based on what I read might be
>> reasonable for most people.  I believe that is 5 shards and 1 replica.
>>  There is a lot of discussion about picking the optimal numbers if you
>> google around.
>>
>> You set them in your sakai.properties like this:
>>
>>  elasticsearch.index.number_of_shards=5
>>  elasticsearch.index.number_of_replicas=1
>>
>> In most cases I believe you want the number of shards to be set to
>> around about how many nodes you might grow to.  You can't change that
>> number without a full reindex.  The number of replicas is adjustable
>> at runtime, so you could change that on the fly using the JSON api and
>> curl for example.
>>
>> This video does a good job explaining the dynamics if you have the
>> time check it out:
>>
>> http://www.elasticsearch.org/videos/2012/06/05/big-data-search-and-analytics.html
>>
>> On Fri, Jan 25, 2013 at 1:41 PM, Zhen Qian <zqian at umich.edu> wrote:
>> > John:
>> >
>> > I am new to Elastic Search. When I look at the elastic search impl code,
>> > I
>> > cannot find the settings for shards or replicas per node. Is it using
>> > the
>> > default setting of ElasticSearch?
>> >
>> > Thanks,
>> >
>> > - Zhen
>> >
>> >
>> > On Fri, Jan 25, 2013 at 11:29 AM, John Bush <john.bush at rsmart.com>
>> > wrote:
>> >>
>> >> >
>> >> > If that's the case, then I agree entirely. It seems mad to be forced
>> >> > to
>> >> > cluster your sakai app servers just to scale your search thing. I'm
>> >> > not
>> >> > sure
>> >> > that's what he is saying though. ...Anybody serious about search will
>> >> > need an external search thing.
>> >>
>> >> Unless the use cases for search changes dramatically, I find it hard
>> >> to imagine a case where you would need to add nodes just to handle
>> >> search.  Once things are indexed that work load is not really the
>> >> significant, and really as you pointed out as content is being created
>> >> and indexed on the fly its not very significant either.  So I disagree
>> >> that an embedded approach can't just scale with the normal user load
>> >> the only change being perhaps how many users you can fit on a node or
>> >> RAM.
>> >>
>> >> Maybe pounds work differently that dollars, but at the end of the day
>> >> this is all about cost.  If I was to go to any sane operations or IT
>> >> manager and say if you want search to work in Sakai you can add some
>> >> more RAM to your existing app server nodes (or maybe do nothing), or
>> >> you can setup a new server and a potentially a new cluster.  Which
>> >> option do you think they'd take ?  Configuration, server deployment,
>> >> procurement of the machines, the knowledge around all that stuff all
>> >> amounts to cost.  So this argument is not solely about what
>> >> architectures we like more or think might scale better, at the end of
>> >> the day its about cost.  Personally, I think an embedded approach is
>> >> more cost effective.  For rSmart which literally has hundreds of Sakai
>> >> nodes a change in the cost structure of that magnitude is very
>> >> significant.  I realize for others the situation is different.
>> >>
>> >> The idea that search is somehow the bottleneck of the system that
>> >> warrants a new app node or that search activity is so great that it
>> >> poses overall risk to the node just isn't consistent with my
>> >> experience.  If you really wanted to protect users from risk, I'd
>> >> start with externalizing msgcntr and samigo.
>> >>
>> >> > Surely the integration is using the REST
>> >> > api, not the internal Java one? I think the embedded/external
>> >> > argument
>> >> > is
>> >> > moot.
>> >>
>> >> The integration uses the internal Java APIs, but that doesn't mean you
>> >> couldn't conceivably run ES as a separate server.  The code as is
>> >> doesn't support that yet, but its certainly possible, but not
>> >> something I was ever planning on personally implementing, but I don't
>> >> see the usefulness of such a design.  Understand that even when ES is
>> >> embedded you can access the REST app directly with curl or whatever,
>> >> this is in fact how I typically work to create queries or do anything
>> >> administrative.
>> >>
>> >> --
>> >> John Bush
>> >> 602-490-0470
>> >>
>> >> _______________________________________________
>> >> sakai-dev mailing list
>> >> sakai-dev at collab.sakaiproject.org
>> >> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>> >>
>> >> TO UNSUBSCRIBE: send email to
>> >> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>> >> "unsubscribe"
>> >
>> >
>>
>>
>>
>> --
>> John Bush
>> 602-490-0470
>
>

--
John Bush
602-490-0470

On Fri, Jan 25, 2013 at 10:22 PM, Zhen Qian <zqian at umich.edu> wrote:
> Thanks, John. It does look like an empirical problem.
>
> Another question: how do you handle the acl of searchable items? I briefly
> looked through the source code, and it looks to me the permission control is
> on site level, and there is no support for group permission yet?
>
> Here is a question for Adrian: Is your SOLR integration work based on the
> recent Solr 4 release, which brings in many scalability improvements?
>
> Thanks,
>
> - Zhen
>
>
> On Fri, Jan 25, 2013 at 6:03 PM, John Bush <john.bush at rsmart.com> wrote:
>>
>> yes it uses the defaults which based on what I read might be
>> reasonable for most people.  I believe that is 5 shards and 1 replica.
>>  There is a lot of discussion about picking the optimal numbers if you
>> google around.
>>
>> You set them in your sakai.properties like this:
>>
>>  elasticsearch.index.number_of_shards=5
>>  elasticsearch.index.number_of_replicas=1
>>
>> In most cases I believe you want the number of shards to be set to
>> around about how many nodes you might grow to.  You can't change that
>> number without a full reindex.  The number of replicas is adjustable
>> at runtime, so you could change that on the fly using the JSON api and
>> curl for example.
>>
>> This video does a good job explaining the dynamics if you have the
>> time check it out:
>>
>> http://www.elasticsearch.org/videos/2012/06/05/big-data-search-and-analytics.html
>>
>> On Fri, Jan 25, 2013 at 1:41 PM, Zhen Qian <zqian at umich.edu> wrote:
>> > John:
>> >
>> > I am new to Elastic Search. When I look at the elastic search impl code,
>> > I
>> > cannot find the settings for shards or replicas per node. Is it using
>> > the
>> > default setting of ElasticSearch?
>> >
>> > Thanks,
>> >
>> > - Zhen
>> >
>> >
>> > On Fri, Jan 25, 2013 at 11:29 AM, John Bush <john.bush at rsmart.com>
>> > wrote:
>> >>
>> >> >
>> >> > If that's the case, then I agree entirely. It seems mad to be forced
>> >> > to
>> >> > cluster your sakai app servers just to scale your search thing. I'm
>> >> > not
>> >> > sure
>> >> > that's what he is saying though. ...Anybody serious about search will
>> >> > need an external search thing.
>> >>
>> >> Unless the use cases for search changes dramatically, I find it hard
>> >> to imagine a case where you would need to add nodes just to handle
>> >> search.  Once things are indexed that work load is not really the
>> >> significant, and really as you pointed out as content is being created
>> >> and indexed on the fly its not very significant either.  So I disagree
>> >> that an embedded approach can't just scale with the normal user load
>> >> the only change being perhaps how many users you can fit on a node or
>> >> RAM.
>> >>
>> >> Maybe pounds work differently that dollars, but at the end of the day
>> >> this is all about cost.  If I was to go to any sane operations or IT
>> >> manager and say if you want search to work in Sakai you can add some
>> >> more RAM to your existing app server nodes (or maybe do nothing), or
>> >> you can setup a new server and a potentially a new cluster.  Which
>> >> option do you think they'd take ?  Configuration, server deployment,
>> >> procurement of the machines, the knowledge around all that stuff all
>> >> amounts to cost.  So this argument is not solely about what
>> >> architectures we like more or think might scale better, at the end of
>> >> the day its about cost.  Personally, I think an embedded approach is
>> >> more cost effective.  For rSmart which literally has hundreds of Sakai
>> >> nodes a change in the cost structure of that magnitude is very
>> >> significant.  I realize for others the situation is different.
>> >>
>> >> The idea that search is somehow the bottleneck of the system that
>> >> warrants a new app node or that search activity is so great that it
>> >> poses overall risk to the node just isn't consistent with my
>> >> experience.  If you really wanted to protect users from risk, I'd
>> >> start with externalizing msgcntr and samigo.
>> >>
>> >> > Surely the integration is using the REST
>> >> > api, not the internal Java one? I think the embedded/external
>> >> > argument
>> >> > is
>> >> > moot.
>> >>
>> >> The integration uses the internal Java APIs, but that doesn't mean you
>> >> couldn't conceivably run ES as a separate server.  The code as is
>> >> doesn't support that yet, but its certainly possible, but not
>> >> something I was ever planning on personally implementing, but I don't
>> >> see the usefulness of such a design.  Understand that even when ES is
>> >> embedded you can access the REST app directly with curl or whatever,
>> >> this is in fact how I typically work to create queries or do anything
>> >> administrative.
>> >>
>> >> --
>> >> John Bush
>> >> 602-490-0470
>> >>
>> >> _______________________________________________
>> >> sakai-dev mailing list
>> >> sakai-dev at collab.sakaiproject.org
>> >> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>> >>
>> >> TO UNSUBSCRIBE: send email to
>> >> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>> >> "unsubscribe"
>> >
>> >
>>
>>
>>
>> --
>> John Bush
>> 602-490-0470
>
>

--
John Bush
602-490-0470