[Building Sakai] TinyURL Service Impl Issue

Carl Hall carl.hall at gatech.edu
Mon Jul 6 09:34:49 PDT 2009


Summary: There is an ever increasing probability that the collision of
IDs in the TinyURL service will slow down the service considerably.

In looking at the implementation of the TinyURL service, I've noticed
something about the way URLs are generated.  If I'm headed in the
wrong direction with this, please correct me.

The very general approach is:
1) generate a random ID
2) check that the ID doesn't already exist
3) if exists, go to 1.  if not, record the ID and release it to the caller.

The following code for generating a URL:

Random generator = new Random();
int seed = Math.abs(generator.nextInt());

has an issue with collision which is addressed by checking for the
existence of that ID before handing it out but the algorithm run time
will continue to grow because of the growing chance of collision over
time.

Explanation of concern:
The first generated ID will have a 0/n chance of collision where n is
the total number of possible IDs.  The next ID has a 1/n chance, the
next 2/n, etc up to the last having a chance of collision being n-1/n.
 Considering n needs to be sufficiently large to have a low to zero
rate of duplication over time, this check will take more time because
of the number of collisions over time.  Assuming that the ID column in
the database is indexed, the lookup time is minimized but there's
still a call to the database lookup on each generated ID.  As the
number of IDs generated reaches the max possible, there are possibly
n-1 lookups before a valid ID is found.

The current code recreates the Random each time which stops nextInt()
from knowing history (ie. new random seed on each generation).  Java
implements a linear congruential generator to generate random numbers
which requires knowledge of the previously generated ID to be
effective.  Math.random() creates the random seed once and reuses it
for the life of the JVM process which is a better approach.

I would suggest creating an LCG in the TinyURL service so that the
proper range can be easily managed (better performance due to a
narrower scope than Math.random()) and the seed can be restored after
a restart.  I have created a prototype in Python based on the rules
laid out in the Wikipedia article on LCG's [1] and using a
sufficiently large prime number.  This still carries a concern of
generating IDs across a cluster but is relatively easy to address.

WDYT?

[1] http://en.wikipedia.org/wiki/Linear_congruential_generator


More information about the sakai-dev mailing list