[Building Sakai] TinyURL Service Impl Issue

Aaron Zeckoski aaronz at vt.edu
Wed Jul 8 09:10:15 PDT 2009


Is there a reason why the id generation could be more like tinyURL and
simply be sequential and thus very short (at least initially and
probably for most institutions it would stay small).

Just a thought.
-AZ


On Wed, Jul 8, 2009 at 5:00 PM, Matthew Jones<jonespm at umich.edu> wrote:
> You could consider using 22 character uuids as autogenerated tinies in
> the url. It might not be able to be called tinyurl anymore, but it
> would at least be short, linkable and (much more) collision free.
>
> http://stackoverflow.com/questions/772802/storing-uuid-as-base64-string
>
> I'm not convinced it would be necessary to keep these id's from being
> unpredictable. The current security model should prevent anyone from
> accessing content they don't have permission to even guessing id's. I
> think eventually when a UI was designed for this it might even want to
> allow the user to specify a custom ID, similar to how the Site Aliases
> (on new site creation) works, allowing a user to specify anything they
> want for their site alias. (Like tinyurl.com) So this would almost
> guarantee collisions.
>
> So I wouldn't think that the possibility of a collision on an low use
> feature for an indexed column (when collisions will probably be
> featured in eventually anyway) likely won't be a big deal.
>
> -Matthew
>
> On Tue, Jul 7, 2009 at 7:08 PM, <antranig at caret.cam.ac.uk> wrote:
>> Please take a look at the comments at the head of the "EighteenIDGenerator"
>> file for some details of the maths involved in computing how many
>> documents will arise before a collision:
>> https://source.caret.cam.ac.uk/rsf/projects/PonderUtilCore/trunk/src/uk/org/ponder/hashutil/EighteenIDGenerator.java
>>
>> I wouldn't really recommend using a key space smaller than this one
>> if you want to avoid collisions - although of course you will reprobe
>> to be sure. Another issue with LCGs is that they are highly predictable -
>> this is why I use a SecureRandom here: One of the requirements of the
>> TinyURLService that I made for the work at CARET was that the ID
>> sequence could not be predicted - to prevent the possibility of malicious
>> "hijacking" of upcoming URL slots (this was actually a system designed
>> to allocate small URLs that were plaintext signup URLs routed to a PHP
>> system, perhaps more risk than is present in a typical application!)
>>
>> Note that a further issue you have with an LCG is running it in a cluster -
>> even if you can lock on the particular instance you have, you can't
>> guarantee that some other instance in the cluster has not stepped on
>> past the slot you want. I think that the approach in my service
>> (a SecureRandom plus a transactional check into persistence) is most
>> likely to work reasonably well in a cluster - although it hasn't been
>> tested there :P
>>
>> Cheers,
>> Antranig.
>>
>>
>> Quoting Steve Swinsburg <s.swinsburg at lancaster.ac.uk>:
>>
>>> Carl,
>>>
>>> Sounds good, is there an existing LCG Java implementation we can
>>> leverage or should we roll our own?
>>>
>>> cheers,
>>> Steve
>>>
>>>
>>> On 6 Jul 2009, at 12:34, Carl Hall wrote:
>>>
>>> > Summary: There is an ever increasing probability that the collision of
>>> > IDs in the TinyURL service will slow down the service considerably.
>>> >
>>> > In looking at the implementation of the TinyURL service, I've noticed
>>> > something about the way URLs are generated.  If I'm headed in the
>>> > wrong direction with this, please correct me.
>>> >
>>> > The very general approach is:
>>> > 1) generate a random ID
>>> > 2) check that the ID doesn't already exist
>>> > 3) if exists, go to 1.  if not, record the ID and release it to the
>>> > caller.
>>> >
>>> > The following code for generating a URL:
>>> >
>>> > Random generator = new Random();
>>> > int seed = Math.abs(generator.nextInt());
>>> >
>>> > has an issue with collision which is addressed by checking for the
>>> > existence of that ID before handing it out but the algorithm run time
>>> > will continue to grow because of the growing chance of collision over
>>> > time.
>>> >
>>> > Explanation of concern:
>>> > The first generated ID will have a 0/n chance of collision where n is
>>> > the total number of possible IDs.  The next ID has a 1/n chance, the
>>> > next 2/n, etc up to the last having a chance of collision being n-1/n.
>>> > Considering n needs to be sufficiently large to have a low to zero
>>> > rate of duplication over time, this check will take more time because
>>> > of the number of collisions over time.  Assuming that the ID column in
>>> > the database is indexed, the lookup time is minimized but there's
>>> > still a call to the database lookup on each generated ID.  As the
>>> > number of IDs generated reaches the max possible, there are possibly
>>> > n-1 lookups before a valid ID is found.
>>> >
>>> > The current code recreates the Random each time which stops nextInt()
>>> > from knowing history (ie. new random seed on each generation).  Java
>>> > implements a linear congruential generator to generate random numbers
>>> > which requires knowledge of the previously generated ID to be
>>> > effective.  Math.random() creates the random seed once and reuses it
>>> > for the life of the JVM process which is a better approach.
>>> >
>>> > I would suggest creating an LCG in the TinyURL service so that the
>>> > proper range can be easily managed (better performance due to a
>>> > narrower scope than Math.random()) and the seed can be restored after
>>> > a restart.  I have created a prototype in Python based on the rules
>>> > laid out in the Wikipedia article on LCG's [1] and using a
>>> > sufficiently large prime number.  This still carries a concern of
>>> > generating IDs across a cluster but is relatively easy to address.
>>> >
>>> > WDYT?
>>> >
>>> > [1] http://en.wikipedia.org/wiki/Linear_congruential_generator
>>> > _______________________________________________
>>> > sakai-dev mailing list
>>> > sakai-dev at collab.sakaiproject.org
>>> > http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>> >
>>> > TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org
>>> >  with a subject of "unsubscribe"
>>>
>>> _______________________________________________
>>> sakai-dev mailing list
>>> sakai-dev at collab.sakaiproject.org
>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>
>>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org
>>> with a subject of "unsubscribe"
>>>
>>
>>
>>
>>
>> ----------------------------------------------------------------
>> This message was sent using IMP, the Internet Messaging Program.
>>
>> _______________________________________________
>> sakai-dev mailing list
>> sakai-dev at collab.sakaiproject.org
>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>
>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of "unsubscribe"
>>
> _______________________________________________
> sakai-dev mailing list
> sakai-dev at collab.sakaiproject.org
> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>
> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of "unsubscribe"
>



-- 
Aaron Zeckoski (azeckoski (at) vt.edu)
Senior Research Engineer - CARET - University of Cambridge
https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski
http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile


More information about the sakai-dev mailing list