[Building Sakai] TinyURL Service Impl Issue

antranig at caret.cam.ac.uk antranig at caret.cam.ac.uk
Tue Jul 7 16:08:46 PDT 2009


Please take a look at the comments at the head of the "EighteenIDGenerator"
file for some details of the maths involved in computing how many
documents will arise before a collision:
https://source.caret.cam.ac.uk/rsf/projects/PonderUtilCore/trunk/src/uk/org/ponder/hashutil/EighteenIDGenerator.java

I wouldn't really recommend using a key space smaller than this one
if you want to avoid collisions - although of course you will reprobe
to be sure. Another issue with LCGs is that they are highly predictable -
this is why I use a SecureRandom here: One of the requirements of the
TinyURLService that I made for the work at CARET was that the ID
sequence could not be predicted - to prevent the possibility of malicious
"hijacking" of upcoming URL slots (this was actually a system designed
to allocate small URLs that were plaintext signup URLs routed to a PHP
system, perhaps more risk than is present in a typical application!)

Note that a further issue you have with an LCG is running it in a cluster -
even if you can lock on the particular instance you have, you can't
guarantee that some other instance in the cluster has not stepped on
past the slot you want. I think that the approach in my service
(a SecureRandom plus a transactional check into persistence) is most
likely to work reasonably well in a cluster - although it hasn't been
tested there :P

Cheers,
Antranig.


Quoting Steve Swinsburg <s.swinsburg at lancaster.ac.uk>:

> Carl,
>
> Sounds good, is there an existing LCG Java implementation we can
> leverage or should we roll our own?
>
> cheers,
> Steve
>
>
> On 6 Jul 2009, at 12:34, Carl Hall wrote:
>
> > Summary: There is an ever increasing probability that the collision of
> > IDs in the TinyURL service will slow down the service considerably.
> >
> > In looking at the implementation of the TinyURL service, I've noticed
> > something about the way URLs are generated.  If I'm headed in the
> > wrong direction with this, please correct me.
> >
> > The very general approach is:
> > 1) generate a random ID
> > 2) check that the ID doesn't already exist
> > 3) if exists, go to 1.  if not, record the ID and release it to the
> > caller.
> >
> > The following code for generating a URL:
> >
> > Random generator = new Random();
> > int seed = Math.abs(generator.nextInt());
> >
> > has an issue with collision which is addressed by checking for the
> > existence of that ID before handing it out but the algorithm run time
> > will continue to grow because of the growing chance of collision over
> > time.
> >
> > Explanation of concern:
> > The first generated ID will have a 0/n chance of collision where n is
> > the total number of possible IDs.  The next ID has a 1/n chance, the
> > next 2/n, etc up to the last having a chance of collision being n-1/n.
> > Considering n needs to be sufficiently large to have a low to zero
> > rate of duplication over time, this check will take more time because
> > of the number of collisions over time.  Assuming that the ID column in
> > the database is indexed, the lookup time is minimized but there's
> > still a call to the database lookup on each generated ID.  As the
> > number of IDs generated reaches the max possible, there are possibly
> > n-1 lookups before a valid ID is found.
> >
> > The current code recreates the Random each time which stops nextInt()
> > from knowing history (ie. new random seed on each generation).  Java
> > implements a linear congruential generator to generate random numbers
> > which requires knowledge of the previously generated ID to be
> > effective.  Math.random() creates the random seed once and reuses it
> > for the life of the JVM process which is a better approach.
> >
> > I would suggest creating an LCG in the TinyURL service so that the
> > proper range can be easily managed (better performance due to a
> > narrower scope than Math.random()) and the seed can be restored after
> > a restart.  I have created a prototype in Python based on the rules
> > laid out in the Wikipedia article on LCG's [1] and using a
> > sufficiently large prime number.  This still carries a concern of
> > generating IDs across a cluster but is relatively easy to address.
> >
> > WDYT?
> >
> > [1] http://en.wikipedia.org/wiki/Linear_congruential_generator
> > _______________________________________________
> > sakai-dev mailing list
> > sakai-dev at collab.sakaiproject.org
> > http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
> >
> > TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org
> >  with a subject of "unsubscribe"
>
> _______________________________________________
> sakai-dev mailing list
> sakai-dev at collab.sakaiproject.org
> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>
> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org
> with a subject of "unsubscribe"
>




----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.



More information about the sakai-dev mailing list