[Building Sakai] TinyURL Service Impl Issue

Aaron Zeckoski aaronz at vt.edu
Wed Jul 8 14:35:08 PDT 2009


I would say that the tinyurl service we are using at CARET is a good
starting point but that is just my opinion. It gives us 2 out of 3
things that we want and the 3rd (custom tiny url) is easy to add.
-AZ

On Wed, Jul 8, 2009 at 10:25 PM, <antranig at caret.cam.ac.uk> wrote:
> The issue with collisions occurs when you are hosting URLs which are
> not protected by the container, but held by some external resource -
> for example, a signup system "click here to validate your email to
> Xxxxx service". The implementation I made actually runs in two modes,
> you may request either "SECURE" or "SHORT" URLs by an option to the
> API. The short URLs start at only a couple of characters and
> are indeed sequential.
> Anyway, feel free to implement your own system, it is the Sakai way
> after all :)
>
> Cheers,
> Antranig.
>
> Quoting Steve Swinsburg <s.swinsburg at lancaster.ac.uk>:
>
>> The security is handled in the container anyway so you get a shortened
>> URL and it resolves into the full URL, redirects and Sakai takes over.
>> Even if it was sequential it wouldn't be a security concern.
>>
>> With a keyspace of lowercase letters and numbers(36 chars) there are
>> about 2 billion combinations, with uppercase as well there's 50
>> billion. Info here on the theoretical limits:
>> http://bit.ly/3QG2uj
>>
>> Using RandomStringUtils from commons-lang, I wrote a short method to
>> iterate over a million alphanumeric combinations using letters and
>> numbers of length 6 chars and didn't run into a single collision.
>> Using this method and a transactional check when adding into the
>> database is probably the way to go.
>>
>> We can also filter for bad words, BBC has an interesting list (thanks
>> Seth! :)
>>
>> WDYT?
>>
>> Steve
>>
>>
>> On 8 Jul 2009, at 12:10, Aaron Zeckoski wrote:
>>
>> > Is there a reason why the id generation could be more like tinyURL and
>> > simply be sequential and thus very short (at least initially and
>> > probably for most institutions it would stay small).
>> >
>> > Just a thought.
>> > -AZ
>> >
>> >
>> > On Wed, Jul 8, 2009 at 5:00 PM, Matthew Jones<jonespm at umich.edu>
>> > wrote:
>> >> You could consider using 22 character uuids as autogenerated tinies
>> >> in
>> >> the url. It might not be able to be called tinyurl anymore, but it
>> >> would at least be short, linkable and (much more) collision free.
>> >>
>> >> http://stackoverflow.com/questions/772802/storing-uuid-as-base64-string
>> >>
>> >> I'm not convinced it would be necessary to keep these id's from being
>> >> unpredictable. The current security model should prevent anyone from
>> >> accessing content they don't have permission to even guessing id's. I
>> >> think eventually when a UI was designed for this it might even want
>> >> to
>> >> allow the user to specify a custom ID, similar to how the Site
>> >> Aliases
>> >> (on new site creation) works, allowing a user to specify anything
>> >> they
>> >> want for their site alias. (Like tinyurl.com) So this would almost
>> >> guarantee collisions.
>> >>
>> >> So I wouldn't think that the possibility of a collision on an low use
>> >> feature for an indexed column (when collisions will probably be
>> >> featured in eventually anyway) likely won't be a big deal.
>> >>
>> >> -Matthew
>> >>
>> >> On Tue, Jul 7, 2009 at 7:08 PM, <antranig at caret.cam.ac.uk> wrote:
>> >>> Please take a look at the comments at the head of the
>> >>> "EighteenIDGenerator"
>> >>> file for some details of the maths involved in computing how many
>> >>> documents will arise before a collision:
>> >>>
>>
> https://source.caret.cam.ac.uk/rsf/projects/PonderUtilCore/trunk/src/uk/org/ponder/hashutil/EighteenIDGenerator.java
>> >>>
>> >>> I wouldn't really recommend using a key space smaller than this one
>> >>> if you want to avoid collisions - although of course you will
>> >>> reprobe
>> >>> to be sure. Another issue with LCGs is that they are highly
>> >>> predictable -
>> >>> this is why I use a SecureRandom here: One of the requirements of
>> >>> the
>> >>> TinyURLService that I made for the work at CARET was that the ID
>> >>> sequence could not be predicted - to prevent the possibility of
>> >>> malicious
>> >>> "hijacking" of upcoming URL slots (this was actually a system
>> >>> designed
>> >>> to allocate small URLs that were plaintext signup URLs routed to a
>> >>> PHP
>> >>> system, perhaps more risk than is present in a typical application!)
>> >>>
>> >>> Note that a further issue you have with an LCG is running it in a
>> >>> cluster -
>> >>> even if you can lock on the particular instance you have, you can't
>> >>> guarantee that some other instance in the cluster has not stepped on
>> >>> past the slot you want. I think that the approach in my service
>> >>> (a SecureRandom plus a transactional check into persistence) is most
>> >>> likely to work reasonably well in a cluster - although it hasn't
>> >>> been
>> >>> tested there :P
>> >>>
>> >>> Cheers,
>> >>> Antranig.
>> >>>
>> >>>
>> >>> Quoting Steve Swinsburg <s.swinsburg at lancaster.ac.uk>:
>> >>>
>> >>>> Carl,
>> >>>>
>> >>>> Sounds good, is there an existing LCG Java implementation we can
>> >>>> leverage or should we roll our own?
>> >>>>
>> >>>> cheers,
>> >>>> Steve
>> >>>>
>> >>>>
>> >>>> On 6 Jul 2009, at 12:34, Carl Hall wrote:
>> >>>>
>> >>>>> Summary: There is an ever increasing probability that the
>> >>>>> collision of
>> >>>>> IDs in the TinyURL service will slow down the service
>> >>>>> considerably.
>> >>>>>
>> >>>>> In looking at the implementation of the TinyURL service, I've
>> >>>>> noticed
>> >>>>> something about the way URLs are generated.  If I'm headed in the
>> >>>>> wrong direction with this, please correct me.
>> >>>>>
>> >>>>> The very general approach is:
>> >>>>> 1) generate a random ID
>> >>>>> 2) check that the ID doesn't already exist
>> >>>>> 3) if exists, go to 1.  if not, record the ID and release it to
>> >>>>> the
>> >>>>> caller.
>> >>>>>
>> >>>>> The following code for generating a URL:
>> >>>>>
>> >>>>> Random generator = new Random();
>> >>>>> int seed = Math.abs(generator.nextInt());
>> >>>>>
>> >>>>> has an issue with collision which is addressed by checking for the
>> >>>>> existence of that ID before handing it out but the algorithm run
>> >>>>> time
>> >>>>> will continue to grow because of the growing chance of collision
>> >>>>> over
>> >>>>> time.
>> >>>>>
>> >>>>> Explanation of concern:
>> >>>>> The first generated ID will have a 0/n chance of collision where
>> >>>>> n is
>> >>>>> the total number of possible IDs.  The next ID has a 1/n chance,
>> >>>>> the
>> >>>>> next 2/n, etc up to the last having a chance of collision being
>> >>>>> n-1/n.
>> >>>>> Considering n needs to be sufficiently large to have a low to zero
>> >>>>> rate of duplication over time, this check will take more time
>> >>>>> because
>> >>>>> of the number of collisions over time.  Assuming that the ID
>> >>>>> column in
>> >>>>> the database is indexed, the lookup time is minimized but there's
>> >>>>> still a call to the database lookup on each generated ID.  As the
>> >>>>> number of IDs generated reaches the max possible, there are
>> >>>>> possibly
>> >>>>> n-1 lookups before a valid ID is found.
>> >>>>>
>> >>>>> The current code recreates the Random each time which stops
>> >>>>> nextInt()
>> >>>>> from knowing history (ie. new random seed on each generation).
>> >>>>> Java
>> >>>>> implements a linear congruential generator to generate random
>> >>>>> numbers
>> >>>>> which requires knowledge of the previously generated ID to be
>> >>>>> effective.  Math.random() creates the random seed once and
>> >>>>> reuses it
>> >>>>> for the life of the JVM process which is a better approach.
>> >>>>>
>> >>>>> I would suggest creating an LCG in the TinyURL service so that the
>> >>>>> proper range can be easily managed (better performance due to a
>> >>>>> narrower scope than Math.random()) and the seed can be restored
>> >>>>> after
>> >>>>> a restart.  I have created a prototype in Python based on the
>> >>>>> rules
>> >>>>> laid out in the Wikipedia article on LCG's [1] and using a
>> >>>>> sufficiently large prime number.  This still carries a concern of
>> >>>>> generating IDs across a cluster but is relatively easy to address.
>> >>>>>
>> >>>>> WDYT?
>> >>>>>
>> >>>>> [1] http://en.wikipedia.org/wiki/Linear_congruential_generator
>> >>>>> _______________________________________________
>> >>>>> sakai-dev mailing list
>> >>>>> sakai-dev at collab.sakaiproject.org
>> >>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>> >>>>>
>> >>>>> TO UNSUBSCRIBE: send email to
>> sakai-dev-unsubscribe at collab.sakaiproject.org
>> >>>>>  with a subject of "unsubscribe"
>> >>>>
>> >>>> _______________________________________________
>> >>>> sakai-dev mailing list
>> >>>> sakai-dev at collab.sakaiproject.org
>> >>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>> >>>>
>> >>>> TO UNSUBSCRIBE: send email to
>> sakai-dev-unsubscribe at collab.sakaiproject.org
>> >>>> with a subject of "unsubscribe"
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> ----------------------------------------------------------------
>> >>> This message was sent using IMP, the Internet Messaging Program.
>> >>>
>> >>> _______________________________________________
>> >>> sakai-dev mailing list
>> >>> sakai-dev at collab.sakaiproject.org
>> >>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>> >>>
>> >>> TO UNSUBSCRIBE: send email to
>> sakai-dev-unsubscribe at collab.sakaiproject.org
>> >>>  with a subject of "unsubscribe"
>> >>>
>> >> _______________________________________________
>> >> sakai-dev mailing list
>> >> sakai-dev at collab.sakaiproject.org
>> >> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>> >>
>> >> TO UNSUBSCRIBE: send email to
>> sakai-dev-unsubscribe at collab.sakaiproject.org
>> >>  with a subject of "unsubscribe"
>> >>
>> >
>> >
>> >
>> > --
>> > Aaron Zeckoski (azeckoski (at) vt.edu)
>> > Senior Research Engineer - CARET - University of Cambridge
>> > https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski
>> > http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile
>> > _______________________________________________
>> > sakai-dev mailing list
>> > sakai-dev at collab.sakaiproject.org
>> > http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>> >
>> > TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org
>> >  with a subject of "unsubscribe"
>>
>> _______________________________________________
>> sakai-dev mailing list
>> sakai-dev at collab.sakaiproject.org
>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>
>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org
>> with a subject of "unsubscribe"
>>
>
>
>
>
> ----------------------------------------------------------------
> This message was sent using IMP, the Internet Messaging Program.
>
>



-- 
Aaron Zeckoski (azeckoski (at) vt.edu)
Senior Research Engineer - CARET - University of Cambridge
https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski
http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile


More information about the sakai-dev mailing list