[Building Sakai] Elastic Search (SRCH-111)

Colin Hebert colin.hebert at it.ox.ac.uk
Fri Jan 25 03:26:30 PST 2013


This is not different of maintaining the support of different
databases (which is already what is done in sakai).

Regarding the googling, I must say that this is a really bold
statement and would really like to see some sources explaining that. I
have nothing against elastic search, and I think there are some big
advantages compared to Solr on some points (schema-less, easy to
embed), but the idea that it's better when reindexing frequently seems
unfounded to me.
And even if it was the case (I still don't think it is) reindexing
isn't the problem as it's not the case (or rarely) in Sakai, almost
nothing is reindex actually. Most of the time, documents are indexed
only once, and a reindexation should only occur when the index is
corrupted in a way or an other. So I disagree with you saying that
it's the obvious scenario in Sakai. But still, reindexation should be
as fast as possible when a full reindexation is triggered, for obvious
reasons.

When it comes to the current two new implementations of search, they
are far from being incompatible, and both are useful for different use
cases (John might want to jump in and give is point of view on that).

In my opinion the ElasticSearch implementation has the huge benefit of
being embedded, and John is right when stating that it requires less
work to set up. This is perfect for a quick deployment with absolutely
no administration, for example for a development setup. It's really
similar to use an embedded database, to limit the number of steps
required to setup the environment.
On the other hand, when it comes to scaling, it suffers from being too
tied to Sakai, and the index requires a new instance of Sakai to
scale. For the same reason most people end up using a MySQL server (or
an other external database) in production (which is never referred to
as a "MySQL farm" with the negative connotation of "farm", even if
there is multiple servers for load balancing and redundancy) while
they might prefer to have an embedded database in different cases.

The current implementation using Solr isn't embedded, and requires
some setup, which is fine when used in production (you should setup
that only once) but I can understand that it's annoying to setup for
developers (by the way for this kind of problem a Vagrant&Puppet
solution should take care of the "complicated part" of your setup).
But when you need, for some reason, to scale your index (the same way
you would with a database), you just need to setup one more node
instead of adding new instance of Sakai.


I also think that it's strange that people consider a search index as
something completely different from a database. Because in the end,
the index will have the same pros and cons as a database when embedded
or not, and will behave the same way when you need to scale (it's a
data storage system). We all know that the database used by Sakai
doesn't scale the same way Sakai itself scales, there is no reason to
assume that a search index would.

Cheers,
Colin

On 25 January 2013 10:05, Adrian Fish <adrian.r.fish at gmail.com> wrote:
> At 2.10 release, will it still be worth the effort of maintaining two
> interfaces, one to ES and one to SOLR?
>
> It seems, from initial unscientific Googling, that ES is gaining traction
> and that it is the preferred solution for scenarios where reindexing occurs
> frequently. That scenario is obviously the Sakai one, especially when you
> hook up tools like Forums and start indexing forum posts on a large Sakai
> install. If the aim of the ES and SOLR interfaces is to enable the hooking
> up of Sakai to pre-existing, potentially large, SOLR or ES farms, then yes,
> it obviously will be worth the effort to maintain two.
>
> Cheers,
> Adrian.
>
>
> On 24 January 2013 23:42, John Bush <john.bush at rsmart.com> wrote:
>>
>> " Right now if I understand your code correctly, a complete
>> reindexation would be executed on only one instance of Sakai, and the
>> indexation operation isn't run on the least busy instance but on the
>> one catching the event. (To be fair, this is also what we have right
>> now in Oxford because we are queuing Tasks in memory.)
>> But as the Tasks queuing system we're using is abstract enough, it's
>> possible to make an implementation (there is already one, but it's not
>> in production yet) which delegates the queuing to an external system
>> (you don't have to use it but you can and should)."
>>
>> Actually no.  The way it works for either a site index or full index
>> is that node capturing the event will queue up resources for indexing.
>>  This actually goes very fast.  The bulk of the indexing work is
>> actually in digesting the content.  What we do is put in ES the meta
>> data without content.  Then each node a reoccurring job that fires
>> where it looks for the next batch of docs without content.  This way
>> we spread that work over the cluster.  Its true that initial queuing
>> is done on one node.  I think testing will tell us if that is a
>> problem or not, could be.  That is certainly something to consider.  I
>> understand the risk of each node doing indexing work.  I've been able
>> to mitigate that load by simply adjusting the thread priority of the
>> worker doing that.  It seems in practice that works fine to give
>> priority to user request threads and let that work happen in the
>> background.
>>
>> You have some valid points here, certainly there are differences of
>> option regarding system architecture.  We would like to find a way to
>> see Sakai scale horizontally.  Right now the session state issue is
>> the biggest hangup in regards to that.  I've been researching unicon's
>> for on that back from 2008 for awhile, and I'm anxious to find a way
>> to really solve that universally.  Because once sessions transfer
>> across a node, I think a lot of your arguments because less important.
>>
>> On Thu, Jan 24, 2013 at 11:07 AM, Colin Hebert <colin.hebert at it.ox.ac.uk>
>> wrote:
>> > - Using an embedded elastic search by default:
>> > I agree, having an embedded solution by default is nicer than
>> > requiring an external service. This could also be done by Solr, but I
>> > don't really see the point of it if the ES impl is already doing that.
>> >
>> > - It's easier to scale when it's embedded:
>> > I disagree with that one, I can't back that thought with a lot of
>> > experience, but to me it would be really strange to have performance
>> > issues with my indexing server and say "let's add another instance of
>> > Sakai to solve the problem". If the problem comes with the indexing
>> > server, you would need another instance of that service, not another
>> > instance of Sakai.
>> > I would probably not use instances of Derby (or any DB I could embed
>> > and that supports replication) in Sakai as my database just because I
>> > can spawn a new database with each instance of Sakai, and scale my
>> > database by creating more instances of Sakai. I'm not sure that making
>> > Sakai a monolithic application is the right way to go (even if it
>> > scales), that's just my opinion though.
>> > I also do not like the idea of the performances of the index that
>> > could have an impact on an instance of Sakai (using too much
>> > CPU/memory/disk space) and vice-versa, not having embedded instances
>> > of the index allow to have more control over that.
>> > And I don't agree with the fact that it makes it easier to setup
>> > either (it's not that hard in the first place), but that's probably a
>> > subjective opinion.
>> >
>> > - It requieres a server farm and a lot of stuff:
>> > Not really, now with VM and cloud computing it's actually easier to
>> > manage, but then again I'm not a sys-admin and this is not something I
>> > do on a daily basis.
>> > And it doesn't require new services. I recommend to use those services
>> > though. Right now if I understand your code correctly, a complete
>> > reindexation would be executed on only one instance of Sakai, and the
>> > indexation operation isn't run on the least busy instance but on the
>> > one catching the event. (To be fair, this is also what we have right
>> > now in Oxford because we are queuing Tasks in memory.)
>> > But as the Tasks queuing system we're using is abstract enough, it's
>> > possible to make an implementation (there is already one, but it's not
>> > in production yet) which delegates the queuing to an external system
>> > (you don't have to use it but you can and should).
>> > This way, instead of having the current server executing the task, it
>> > will be one of the servers with a low load that will be in charge of
>> > that task. And when it comes to reindex everything, the "reindex
>> > everything" task is split in a lot of smaller tasks "reindex site X"
>> > spread across all the Sakai instances (it's really neat). But as I
>> > said, you don't have to use it, you just can if you think it's
>> > something that will make things faster.
>> >
>> >
>> > Regarding the indexing time, I totally agree with you, one of the
>> > thing I would like to see disappear is the fear of losing the entire
>> > index because a complete reindexation would only take a few
>> > minutes/hours.
>> >
>> >
>> > On 24 January 2013 16:18, John Bush <john.bush at rsmart.com> wrote:
>> >> Based on my work and what I've seen around the solr work, I think we
>> >> are well posed to simply create some configuration that makes a switch
>> >> simple and easy.  Something we can do from sakai.properties and not by
>> >> modifying spring config would be the goal.
>> >>
>> >> The decision to move it in was made at the unconference in Phoenix
>> >> last week after I showed a few TCC members the work.  There probably
>> >> could have been more transparency about it, but everyone agreed this
>> >> was a good out of the box experience and the transition from the
>> >> legacy system should be easy.  Beth is finding some issues, and I'm
>> >> confident I can address those quickly.
>> >>
>> >> In terms of the default experience, I don't think solr is an option
>> >> unless I'm wrong, as it requires a separate search server.  So I think
>> >> ES is more well suited to be the default.  We don't even require that
>> >> for the database right now.
>> >>
>> >> This work was initiated by Ian's blog, and some conversations I had
>> >> with him along the way.  I think there is probably a lot of ways these
>> >> two efforts can join forces.  I realize system architecture varies
>> >> from organization to organization.  Our goal is to have a search impl
>> >> that can scale from small to large without ever needing to reengineer
>> >> anything.  That is what I think ES delivers, you simply add or remove
>> >> nodes as capacity grows and shrinks.
>> >>
>> >> Now other might prefer a separate server farm and some message queues
>> >> and a bunch of stuff like that.  I guess it might be helpful to
>> >> understand who other than oxford is really interested in supporting
>> >> that heavy weight of a Sakai installation.  Because while it isn't
>> >> probably a huge amount of work to align these implementations to be
>> >> easily switchable, it is work.
>> >>
>> >> In terms of load testing we've performed some small amount of testing
>> >> mostly to validate the cluster behavior and get some idea how long
>> >> reindexing a terrabyte of data might take.  I don't have any official
>> >> reports to share, but the search times have been impressive, and I
>> >> think the indexing time is acceptable.  Our goal over the next few
>> >> weeks is to produce some benchmarks comparing ES to the old search as
>> >> the information is available I'll share it.
>> >>
>> >> On Thu, Jan 24, 2013 at 7:31 AM, Colin Hebert
>> >> <colin.hebert at it.ox.ac.uk> wrote:
>> >>> In the code written for the Solr implementation we added the
>> >>> possibility of choosing which implementation of search is used,
>> >>> allowing users to keep the current search index and to avoid forcing
>> >>> them to move right away and allowing other implementations to be used.
>> >>>
>> >>> There are also multiple additions that I think could be nice to
>> >>> include in the Elastic Search implementation:
>> >>>  - The Task system (done with TimerTasks in the ES impl) which in the
>> >>> current code of the Solr impl allows to submit Tasks to a queue. This
>> >>> queue can either be an inMemory queue, dequeued by a couple of
>> >>> ExecutorService, or an external queue (such as an AMQP server [yay,
>> >>> scalability]).
>> >>> Another nice thing we've done is allowing the indexing server to do
>> >>> the text extraction (Solr Cell or Attachment Type in ES).
>> >>>
>> >>> I'm also a bit curious, is this implementation of Search with ES
>> >>> related in any way with the work of Ian Boston (
>> >>> http://blog.tfd.co.uk/2012/10/11/sakai-cle-elasticsearch/ ).
>> >>>
>> >>> Anyway the code for SolrSearch is still available (
>> >>> https://github.com/ColinHebert/Sakai-Solr ) if anyone wants to take a
>> >>> look at it. We're also doing a BOF on the subject of Search next week
>> >>> during the EuroSakai conference in Paris to talk about what could be
>> >>> done with search and how to improve it further.
>> >>>
>> >>> On another note, as I said when we started talking about the Solr
>> >>> Implementation, the API could be easier to implement if rewritten.
>> >>> This has been done ( https://github.com/ColinHebert/Sakai-Search2 )
>> >>> and is currently just in need of a nice UI probably designed by
>> >>> someone who has some knowledge of UXP in search.
>> >>>
>> >>> Colin Hebert
>> >>>
>> >>> On 24 January 2013 14:12, Adam Marshall <adam.marshall at it.ox.ac.uk>
>> >>> wrote:
>> >>>> We are running SOLR in production with no issues, we were poised to
>> >>>> contribute this back (possibly with help from Chuck / Adrian Fish) until
>> >>>> this email dropped into my inbox. Now I'm not sure what to do - I think we'd
>> >>>> like to support a configurable plugin approach.
>> >>>>
>> >>>> I'll get Colin Hebert who wrote the implementation to make a post to
>> >>>> outline his thoughts on the matter.
>> >>>>
>> >>>> adam
>> >>>>
>> >>>>
>> >>>>
>> >>>> -----Original Message-----
>> >>>> From: Beth Kirschner [mailto:bkirschn at umich.edu]
>> >>>> Sent: 24 January 2013 14:06
>> >>>> To: Adam Marshall; John Bush
>> >>>> Cc: sakai-dev (sakai-dev at collab.sakaiproject.org)
>> >>>> Subject: Re: [Building Sakai] Elastic Search (SRCH-111)
>> >>>>
>> >>>> I was wondering the same thing... I remember when discussing SOLR,
>> >>>> the thought was that it would provide potential for new functionality (e.g.
>> >>>> faceted search), but not address the scalability problems with search. The
>> >>>> SRCH-111 JIRA states "The bulk of this work is simply a backend replacement
>> >>>> that fixes most of the indexing/merging problems that have been experienced
>> >>>> in large deployments...". This all sounds very promising. Has any of this
>> >>>> been load tested? I'd like to put this on UM's load test calendar to compare
>> >>>> results. I wonder if there's an opportunity to have configurable plugin
>> >>>> options for a search back end?
>> >>>>
>> >>>> - Beth
>> >>>>
>> >>>> On Jan 24, 2013, at 8:39 AM, Adam Marshall wrote:
>> >>>>
>> >>>>> Has this been discussed before?
>> >>>>>
>> >>>>> I mentioned to the list ages ago that we have reimplemented search
>> >>>>> using SOLR and nobody mentioned this elastic search work. We have been asked
>> >>>>> to contribute our SOLR work to 2.10 (by Chuck) - so I think we should have a
>> >>>>> discussion as to how our implementation and this Elastic search work should
>> >>>>> together.
>> >>>>>
>> >>>>> adam
>> >>>>>
>> >>>>> -----Original Message-----
>> >>>>> From: sakai-dev-bounces at collab.sakaiproject.org
>> >>>>> [mailto:sakai-dev-bounces at collab.sakaiproject.org] On Behalf Of Beth
>> >>>>> Kirschner
>> >>>>> Sent: 24 January 2013 13:36
>> >>>>> To: John Bush
>> >>>>> Cc: sakai-dev (sakai-dev at collab.sakaiproject.org)
>> >>>>> Subject: Re: [Building Sakai] Elastic Search (SRCH-111)
>> >>>>>
>> >>>>> Thanks!
>> >>>>>
>> >>>>> On Jan 23, 2013, at 8:12 PM, John Bush wrote:
>> >>>>>
>> >>>>>> It's fixed, https://jira.sakaiproject.org/browse/SRCH-112, sorry
>> >>>>>> about that did some refactoring for unit tests introduced that.
>> >>>>>>
>> >>>>>> On Wed, Jan 23, 2013 at 5:58 PM, John Bush <john.bush at rsmart.com>
>> >>>>>> wrote:
>> >>>>>>> hmm, that sounds like a bug, it should be 100% backwards
>> >>>>>>> compatible
>> >>>>>>> with existing configuration.  Put a JIRA in and I'll address it.
>> >>>>>>>
>> >>>>>>> On Wed, Jan 23, 2013 at 11:47 AM, Beth Kirschner
>> >>>>>>> <bkirschn at umich.edu> wrote:
>> >>>>>>>> Hi John,
>> >>>>>>>>
>> >>>>>>>> The new elastic search (SRCH-111) does not seem to be backward
>> >>>>>>>> compatible, as least for sakai.properties configuration. My sakai trunk
>> >>>>>>>> build does not boot with "search.enable = true". I've attached the
>> >>>>>>>> catalina.out file, but here's the first error:
>> >>>>>>>>
>> >>>>>>>> 2013-01-23 10:41:35,127 ERROR Thread-3
>> >>>>>>>> org.sakaiproject.search.elasticsearch.ElasticSearchIndexBuilder -
>> >>>>>>>> Failed to load Stop words into Analyzer
>> >>>>>>>> java.lang.NullPointerException
>> >>>>>>>>
>> >>>>>>>> I'm not sure if this is intentional and the other specified
>> >>>>>>>> sakai.properties (elasticsearch.http.*) need also need to be set or is this
>> >>>>>>>> a bug? It will definitely break a lot of implementations as it stands.
>> >>>>>>>> Perhaps I missed some email about this, but we should probably either update
>> >>>>>>>> the JIRA to indicate properties changes will be _required_, or write this up
>> >>>>>>>> as a new bug and make sure previous configurations will boot.
>> >>>>>>>>
>> >>>>>>>> Thanks,
>> >>>>>>>> - Beth
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> John Bush
>> >>>>>>> 602-490-0470
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> John Bush
>> >>>>>> 602-490-0470
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> sakai-dev mailing list
>> >>>>> sakai-dev at collab.sakaiproject.org
>> >>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>> >>>>>
>> >>>>> TO UNSUBSCRIBE: send email to
>> >>>>> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>> >>>>> "unsubscribe"
>> >>>>
>> >>>> _______________________________________________
>> >>>> sakai-dev mailing list
>> >>>> sakai-dev at collab.sakaiproject.org
>> >>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>> >>>>
>> >>>> TO UNSUBSCRIBE: send email to
>> >>>> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>> >>>> "unsubscribe"
>> >>
>> >>
>> >>
>> >> --
>> >> John Bush
>> >> 602-490-0470
>>
>>
>>
>> --
>> John Bush
>> 602-490-0470
>> _______________________________________________
>> sakai-dev mailing list
>> sakai-dev at collab.sakaiproject.org
>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>
>> TO UNSUBSCRIBE: send email to
>> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>> "unsubscribe"
>
>


More information about the sakai-dev mailing list