[Building Sakai] TinyURL Service Impl Issue

antranig at caret.cam.ac.uk antranig at caret.cam.ac.uk
Wed Jul 8 14:25:56 PDT 2009


The issue with collisions occurs when you are hosting URLs which are
not protected by the container, but held by some external resource -
for example, a signup system "click here to validate your email to
Xxxxx service". The implementation I made actually runs in two modes,
you may request either "SECURE" or "SHORT" URLs by an option to the
API. The short URLs start at only a couple of characters and
are indeed sequential.
Anyway, feel free to implement your own system, it is the Sakai way
after all :)

Cheers,
Antranig.

Quoting Steve Swinsburg <s.swinsburg at lancaster.ac.uk>:

> The security is handled in the container anyway so you get a shortened
> URL and it resolves into the full URL, redirects and Sakai takes over.
> Even if it was sequential it wouldn't be a security concern.
>
> With a keyspace of lowercase letters and numbers(36 chars) there are
> about 2 billion combinations, with uppercase as well there's 50
> billion. Info here on the theoretical limits:
> http://bit.ly/3QG2uj
>
> Using RandomStringUtils from commons-lang, I wrote a short method to
> iterate over a million alphanumeric combinations using letters and
> numbers of length 6 chars and didn't run into a single collision.
> Using this method and a transactional check when adding into the
> database is probably the way to go.
>
> We can also filter for bad words, BBC has an interesting list (thanks
> Seth! :)
>
> WDYT?
>
> Steve
>
>
> On 8 Jul 2009, at 12:10, Aaron Zeckoski wrote:
>
> > Is there a reason why the id generation could be more like tinyURL and
> > simply be sequential and thus very short (at least initially and
> > probably for most institutions it would stay small).
> >
> > Just a thought.
> > -AZ
> >
> >
> > On Wed, Jul 8, 2009 at 5:00 PM, Matthew Jones<jonespm at umich.edu>
> > wrote:
> >> You could consider using 22 character uuids as autogenerated tinies
> >> in
> >> the url. It might not be able to be called tinyurl anymore, but it
> >> would at least be short, linkable and (much more) collision free.
> >>
> >> http://stackoverflow.com/questions/772802/storing-uuid-as-base64-string
> >>
> >> I'm not convinced it would be necessary to keep these id's from being
> >> unpredictable. The current security model should prevent anyone from
> >> accessing content they don't have permission to even guessing id's. I
> >> think eventually when a UI was designed for this it might even want
> >> to
> >> allow the user to specify a custom ID, similar to how the Site
> >> Aliases
> >> (on new site creation) works, allowing a user to specify anything
> >> they
> >> want for their site alias. (Like tinyurl.com) So this would almost
> >> guarantee collisions.
> >>
> >> So I wouldn't think that the possibility of a collision on an low use
> >> feature for an indexed column (when collisions will probably be
> >> featured in eventually anyway) likely won't be a big deal.
> >>
> >> -Matthew
> >>
> >> On Tue, Jul 7, 2009 at 7:08 PM, <antranig at caret.cam.ac.uk> wrote:
> >>> Please take a look at the comments at the head of the
> >>> "EighteenIDGenerator"
> >>> file for some details of the maths involved in computing how many
> >>> documents will arise before a collision:
> >>>
>
https://source.caret.cam.ac.uk/rsf/projects/PonderUtilCore/trunk/src/uk/org/ponder/hashutil/EighteenIDGenerator.java
> >>>
> >>> I wouldn't really recommend using a key space smaller than this one
> >>> if you want to avoid collisions - although of course you will
> >>> reprobe
> >>> to be sure. Another issue with LCGs is that they are highly
> >>> predictable -
> >>> this is why I use a SecureRandom here: One of the requirements of
> >>> the
> >>> TinyURLService that I made for the work at CARET was that the ID
> >>> sequence could not be predicted - to prevent the possibility of
> >>> malicious
> >>> "hijacking" of upcoming URL slots (this was actually a system
> >>> designed
> >>> to allocate small URLs that were plaintext signup URLs routed to a
> >>> PHP
> >>> system, perhaps more risk than is present in a typical application!)
> >>>
> >>> Note that a further issue you have with an LCG is running it in a
> >>> cluster -
> >>> even if you can lock on the particular instance you have, you can't
> >>> guarantee that some other instance in the cluster has not stepped on
> >>> past the slot you want. I think that the approach in my service
> >>> (a SecureRandom plus a transactional check into persistence) is most
> >>> likely to work reasonably well in a cluster - although it hasn't
> >>> been
> >>> tested there :P
> >>>
> >>> Cheers,
> >>> Antranig.
> >>>
> >>>
> >>> Quoting Steve Swinsburg <s.swinsburg at lancaster.ac.uk>:
> >>>
> >>>> Carl,
> >>>>
> >>>> Sounds good, is there an existing LCG Java implementation we can
> >>>> leverage or should we roll our own?
> >>>>
> >>>> cheers,
> >>>> Steve
> >>>>
> >>>>
> >>>> On 6 Jul 2009, at 12:34, Carl Hall wrote:
> >>>>
> >>>>> Summary: There is an ever increasing probability that the
> >>>>> collision of
> >>>>> IDs in the TinyURL service will slow down the service
> >>>>> considerably.
> >>>>>
> >>>>> In looking at the implementation of the TinyURL service, I've
> >>>>> noticed
> >>>>> something about the way URLs are generated.  If I'm headed in the
> >>>>> wrong direction with this, please correct me.
> >>>>>
> >>>>> The very general approach is:
> >>>>> 1) generate a random ID
> >>>>> 2) check that the ID doesn't already exist
> >>>>> 3) if exists, go to 1.  if not, record the ID and release it to
> >>>>> the
> >>>>> caller.
> >>>>>
> >>>>> The following code for generating a URL:
> >>>>>
> >>>>> Random generator = new Random();
> >>>>> int seed = Math.abs(generator.nextInt());
> >>>>>
> >>>>> has an issue with collision which is addressed by checking for the
> >>>>> existence of that ID before handing it out but the algorithm run
> >>>>> time
> >>>>> will continue to grow because of the growing chance of collision
> >>>>> over
> >>>>> time.
> >>>>>
> >>>>> Explanation of concern:
> >>>>> The first generated ID will have a 0/n chance of collision where
> >>>>> n is
> >>>>> the total number of possible IDs.  The next ID has a 1/n chance,
> >>>>> the
> >>>>> next 2/n, etc up to the last having a chance of collision being
> >>>>> n-1/n.
> >>>>> Considering n needs to be sufficiently large to have a low to zero
> >>>>> rate of duplication over time, this check will take more time
> >>>>> because
> >>>>> of the number of collisions over time.  Assuming that the ID
> >>>>> column in
> >>>>> the database is indexed, the lookup time is minimized but there's
> >>>>> still a call to the database lookup on each generated ID.  As the
> >>>>> number of IDs generated reaches the max possible, there are
> >>>>> possibly
> >>>>> n-1 lookups before a valid ID is found.
> >>>>>
> >>>>> The current code recreates the Random each time which stops
> >>>>> nextInt()
> >>>>> from knowing history (ie. new random seed on each generation).
> >>>>> Java
> >>>>> implements a linear congruential generator to generate random
> >>>>> numbers
> >>>>> which requires knowledge of the previously generated ID to be
> >>>>> effective.  Math.random() creates the random seed once and
> >>>>> reuses it
> >>>>> for the life of the JVM process which is a better approach.
> >>>>>
> >>>>> I would suggest creating an LCG in the TinyURL service so that the
> >>>>> proper range can be easily managed (better performance due to a
> >>>>> narrower scope than Math.random()) and the seed can be restored
> >>>>> after
> >>>>> a restart.  I have created a prototype in Python based on the
> >>>>> rules
> >>>>> laid out in the Wikipedia article on LCG's [1] and using a
> >>>>> sufficiently large prime number.  This still carries a concern of
> >>>>> generating IDs across a cluster but is relatively easy to address.
> >>>>>
> >>>>> WDYT?
> >>>>>
> >>>>> [1] http://en.wikipedia.org/wiki/Linear_congruential_generator
> >>>>> _______________________________________________
> >>>>> sakai-dev mailing list
> >>>>> sakai-dev at collab.sakaiproject.org
> >>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
> >>>>>
> >>>>> TO UNSUBSCRIBE: send email to
> sakai-dev-unsubscribe at collab.sakaiproject.org
> >>>>>  with a subject of "unsubscribe"
> >>>>
> >>>> _______________________________________________
> >>>> sakai-dev mailing list
> >>>> sakai-dev at collab.sakaiproject.org
> >>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
> >>>>
> >>>> TO UNSUBSCRIBE: send email to
> sakai-dev-unsubscribe at collab.sakaiproject.org
> >>>> with a subject of "unsubscribe"
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>> ----------------------------------------------------------------
> >>> This message was sent using IMP, the Internet Messaging Program.
> >>>
> >>> _______________________________________________
> >>> sakai-dev mailing list
> >>> sakai-dev at collab.sakaiproject.org
> >>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
> >>>
> >>> TO UNSUBSCRIBE: send email to
> sakai-dev-unsubscribe at collab.sakaiproject.org
> >>>  with a subject of "unsubscribe"
> >>>
> >> _______________________________________________
> >> sakai-dev mailing list
> >> sakai-dev at collab.sakaiproject.org
> >> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
> >>
> >> TO UNSUBSCRIBE: send email to
> sakai-dev-unsubscribe at collab.sakaiproject.org
> >>  with a subject of "unsubscribe"
> >>
> >
> >
> >
> > --
> > Aaron Zeckoski (azeckoski (at) vt.edu)
> > Senior Research Engineer - CARET - University of Cambridge
> > https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski
> > http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile
> > _______________________________________________
> > sakai-dev mailing list
> > sakai-dev at collab.sakaiproject.org
> > http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
> >
> > TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org
> >  with a subject of "unsubscribe"
>
> _______________________________________________
> sakai-dev mailing list
> sakai-dev at collab.sakaiproject.org
> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>
> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org
> with a subject of "unsubscribe"
>




----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.



More information about the sakai-dev mailing list