[Building Sakai] Elastic Search (SRCH-111)

Thu Jan 24 15:42:00 PST 2013

" Right now if I understand your code correctly, a complete
reindexation would be executed on only one instance of Sakai, and the
indexation operation isn't run on the least busy instance but on the
one catching the event. (To be fair, this is also what we have right
now in Oxford because we are queuing Tasks in memory.)
But as the Tasks queuing system we're using is abstract enough, it's
possible to make an implementation (there is already one, but it's not
in production yet) which delegates the queuing to an external system
(you don't have to use it but you can and should)."

Actually no.  The way it works for either a site index or full index
is that node capturing the event will queue up resources for indexing.
 This actually goes very fast.  The bulk of the indexing work is
actually in digesting the content.  What we do is put in ES the meta
data without content.  Then each node a reoccurring job that fires
where it looks for the next batch of docs without content.  This way
we spread that work over the cluster.  Its true that initial queuing
is done on one node.  I think testing will tell us if that is a
problem or not, could be.  That is certainly something to consider.  I
understand the risk of each node doing indexing work.  I've been able
to mitigate that load by simply adjusting the thread priority of the
worker doing that.  It seems in practice that works fine to give
priority to user request threads and let that work happen in the
background.

You have some valid points here, certainly there are differences of
option regarding system architecture.  We would like to find a way to
see Sakai scale horizontally.  Right now the session state issue is
the biggest hangup in regards to that.  I've been researching unicon's
for on that back from 2008 for awhile, and I'm anxious to find a way
to really solve that universally.  Because once sessions transfer
across a node, I think a lot of your arguments because less important.

On Thu, Jan 24, 2013 at 11:07 AM, Colin Hebert <colin.hebert at it.ox.ac.uk> wrote:
> - Using an embedded elastic search by default:
> I agree, having an embedded solution by default is nicer than
> requiring an external service. This could also be done by Solr, but I
> don't really see the point of it if the ES impl is already doing that.
>
> - It's easier to scale when it's embedded:
> I disagree with that one, I can't back that thought with a lot of
> experience, but to me it would be really strange to have performance
> issues with my indexing server and say "let's add another instance of
> Sakai to solve the problem". If the problem comes with the indexing
> server, you would need another instance of that service, not another
> instance of Sakai.
> I would probably not use instances of Derby (or any DB I could embed
> and that supports replication) in Sakai as my database just because I
> can spawn a new database with each instance of Sakai, and scale my
> database by creating more instances of Sakai. I'm not sure that making
> Sakai a monolithic application is the right way to go (even if it
> scales), that's just my opinion though.
> I also do not like the idea of the performances of the index that
> could have an impact on an instance of Sakai (using too much
> CPU/memory/disk space) and vice-versa, not having embedded instances
> of the index allow to have more control over that.
> And I don't agree with the fact that it makes it easier to setup
> either (it's not that hard in the first place), but that's probably a
> subjective opinion.
>
> - It requieres a server farm and a lot of stuff:
> Not really, now with VM and cloud computing it's actually easier to
> manage, but then again I'm not a sys-admin and this is not something I
> do on a daily basis.
> And it doesn't require new services. I recommend to use those services
> though. Right now if I understand your code correctly, a complete
> reindexation would be executed on only one instance of Sakai, and the
> indexation operation isn't run on the least busy instance but on the
> one catching the event. (To be fair, this is also what we have right
> now in Oxford because we are queuing Tasks in memory.)
> But as the Tasks queuing system we're using is abstract enough, it's
> possible to make an implementation (there is already one, but it's not
> in production yet) which delegates the queuing to an external system
> (you don't have to use it but you can and should).
> This way, instead of having the current server executing the task, it
> will be one of the servers with a low load that will be in charge of
> that task. And when it comes to reindex everything, the "reindex
> everything" task is split in a lot of smaller tasks "reindex site X"
> spread across all the Sakai instances (it's really neat). But as I
> said, you don't have to use it, you just can if you think it's
> something that will make things faster.
>
>
> Regarding the indexing time, I totally agree with you, one of the
> thing I would like to see disappear is the fear of losing the entire
> index because a complete reindexation would only take a few
> minutes/hours.
>
>
> On 24 January 2013 16:18, John Bush <john.bush at rsmart.com> wrote:
>> Based on my work and what I've seen around the solr work, I think we
>> are well posed to simply create some configuration that makes a switch
>> simple and easy.  Something we can do from sakai.properties and not by
>> modifying spring config would be the goal.
>>
>> The decision to move it in was made at the unconference in Phoenix
>> last week after I showed a few TCC members the work.  There probably
>> could have been more transparency about it, but everyone agreed this
>> was a good out of the box experience and the transition from the
>> legacy system should be easy.  Beth is finding some issues, and I'm
>> confident I can address those quickly.
>>
>> In terms of the default experience, I don't think solr is an option
>> unless I'm wrong, as it requires a separate search server.  So I think
>> ES is more well suited to be the default.  We don't even require that
>> for the database right now.
>>
>> This work was initiated by Ian's blog, and some conversations I had
>> with him along the way.  I think there is probably a lot of ways these
>> two efforts can join forces.  I realize system architecture varies
>> from organization to organization.  Our goal is to have a search impl
>> that can scale from small to large without ever needing to reengineer
>> anything.  That is what I think ES delivers, you simply add or remove
>> nodes as capacity grows and shrinks.
>>
>> Now other might prefer a separate server farm and some message queues
>> and a bunch of stuff like that.  I guess it might be helpful to
>> understand who other than oxford is really interested in supporting
>> that heavy weight of a Sakai installation.  Because while it isn't
>> probably a huge amount of work to align these implementations to be
>> easily switchable, it is work.
>>
>> In terms of load testing we've performed some small amount of testing
>> mostly to validate the cluster behavior and get some idea how long
>> reindexing a terrabyte of data might take.  I don't have any official
>> reports to share, but the search times have been impressive, and I
>> think the indexing time is acceptable.  Our goal over the next few
>> weeks is to produce some benchmarks comparing ES to the old search as
>> the information is available I'll share it.
>>
>> On Thu, Jan 24, 2013 at 7:31 AM, Colin Hebert <colin.hebert at it.ox.ac.uk> wrote:
>>> In the code written for the Solr implementation we added the
>>> possibility of choosing which implementation of search is used,
>>> allowing users to keep the current search index and to avoid forcing
>>> them to move right away and allowing other implementations to be used.
>>>
>>> There are also multiple additions that I think could be nice to
>>> include in the Elastic Search implementation:
>>>  - The Task system (done with TimerTasks in the ES impl) which in the
>>> current code of the Solr impl allows to submit Tasks to a queue. This
>>> queue can either be an inMemory queue, dequeued by a couple of
>>> ExecutorService, or an external queue (such as an AMQP server [yay,
>>> scalability]).
>>> Another nice thing we've done is allowing the indexing server to do
>>> the text extraction (Solr Cell or Attachment Type in ES).
>>>
>>> I'm also a bit curious, is this implementation of Search with ES
>>> related in any way with the work of Ian Boston (
>>> http://blog.tfd.co.uk/2012/10/11/sakai-cle-elasticsearch/ ).
>>>
>>> Anyway the code for SolrSearch is still available (
>>> https://github.com/ColinHebert/Sakai-Solr ) if anyone wants to take a
>>> look at it. We're also doing a BOF on the subject of Search next week
>>> during the EuroSakai conference in Paris to talk about what could be
>>> done with search and how to improve it further.
>>>
>>> On another note, as I said when we started talking about the Solr
>>> Implementation, the API could be easier to implement if rewritten.
>>> This has been done ( https://github.com/ColinHebert/Sakai-Search2 )
>>> and is currently just in need of a nice UI probably designed by
>>> someone who has some knowledge of UXP in search.
>>>
>>> Colin Hebert
>>>
>>> On 24 January 2013 14:12, Adam Marshall <adam.marshall at it.ox.ac.uk> wrote:
>>>> We are running SOLR in production with no issues, we were poised to contribute this back (possibly with help from Chuck / Adrian Fish) until this email dropped into my inbox. Now I'm not sure what to do - I think we'd like to support a configurable plugin approach.
>>>>
>>>> I'll get Colin Hebert who wrote the implementation to make a post to outline his thoughts on the matter.
>>>>
>>>> adam
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Beth Kirschner [mailto:bkirschn at umich.edu]
>>>> Sent: 24 January 2013 14:06
>>>> To: Adam Marshall; John Bush
>>>> Cc: sakai-dev (sakai-dev at collab.sakaiproject.org)
>>>> Subject: Re: [Building Sakai] Elastic Search (SRCH-111)
>>>>
>>>> I was wondering the same thing... I remember when discussing SOLR, the thought was that it would provide potential for new functionality (e.g. faceted search), but not address the scalability problems with search. The SRCH-111 JIRA states "The bulk of this work is simply a backend replacement that fixes most of the indexing/merging problems that have been experienced in large deployments...". This all sounds very promising. Has any of this been load tested? I'd like to put this on UM's load test calendar to compare results. I wonder if there's an opportunity to have configurable plugin options for a search back end?
>>>>
>>>> - Beth
>>>>
>>>> On Jan 24, 2013, at 8:39 AM, Adam Marshall wrote:
>>>>
>>>>> Has this been discussed before?
>>>>>
>>>>> I mentioned to the list ages ago that we have reimplemented search using SOLR and nobody mentioned this elastic search work. We have been asked to contribute our SOLR work to 2.10 (by Chuck) - so I think we should have a discussion as to how our implementation and this Elastic search work should together.
>>>>>
>>>>> adam
>>>>>
>>>>> -----Original Message-----
>>>>> From: sakai-dev-bounces at collab.sakaiproject.org
>>>>> [mailto:sakai-dev-bounces at collab.sakaiproject.org] On Behalf Of Beth
>>>>> Kirschner
>>>>> Sent: 24 January 2013 13:36
>>>>> To: John Bush
>>>>> Cc: sakai-dev (sakai-dev at collab.sakaiproject.org)
>>>>> Subject: Re: [Building Sakai] Elastic Search (SRCH-111)
>>>>>
>>>>> Thanks!
>>>>>
>>>>> On Jan 23, 2013, at 8:12 PM, John Bush wrote:
>>>>>
>>>>>> It's fixed, https://jira.sakaiproject.org/browse/SRCH-112, sorry
>>>>>> about that did some refactoring for unit tests introduced that.
>>>>>>
>>>>>> On Wed, Jan 23, 2013 at 5:58 PM, John Bush <john.bush at rsmart.com> wrote:
>>>>>>> hmm, that sounds like a bug, it should be 100% backwards compatible
>>>>>>> with existing configuration.  Put a JIRA in and I'll address it.
>>>>>>>
>>>>>>> On Wed, Jan 23, 2013 at 11:47 AM, Beth Kirschner <bkirschn at umich.edu> wrote:
>>>>>>>> Hi John,
>>>>>>>>
>>>>>>>> The new elastic search (SRCH-111) does not seem to be backward compatible, as least for sakai.properties configuration. My sakai trunk build does not boot with "search.enable = true". I've attached the catalina.out file, but here's the first error:
>>>>>>>>
>>>>>>>> 2013-01-23 10:41:35,127 ERROR Thread-3
>>>>>>>> org.sakaiproject.search.elasticsearch.ElasticSearchIndexBuilder -
>>>>>>>> Failed to load Stop words into Analyzer
>>>>>>>> java.lang.NullPointerException
>>>>>>>>
>>>>>>>> I'm not sure if this is intentional and the other specified sakai.properties (elasticsearch.http.*) need also need to be set or is this a bug? It will definitely break a lot of implementations as it stands. Perhaps I missed some email about this, but we should probably either update the JIRA to indicate properties changes will be _required_, or write this up as a new bug and make sure previous configurations will boot.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> - Beth
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> John Bush
>>>>>>> 602-490-0470
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> John Bush
>>>>>> 602-490-0470
>>>>>
>>>>> _______________________________________________
>>>>> sakai-dev mailing list
>>>>> sakai-dev at collab.sakaiproject.org
>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>>
>>>>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of "unsubscribe"
>>>>
>>>> _______________________________________________
>>>> sakai-dev mailing list
>>>> sakai-dev at collab.sakaiproject.org
>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>
>>>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of "unsubscribe"
>>
>>
>>
>> --
>> John Bush
>> 602-490-0470

-- 
John Bush
602-490-0470