[Building Sakai] TinyURL Service Impl Issue

Michael Osterman ostermmg.sakai at gmail.com
Wed Jul 8 14:30:12 PDT 2009


You realize, of course, that said bad word filter would need to be
internationalized. Could be some of the more entertaining il8n work, though.


-Mike

On Wed, Jul 8, 2009 at 4:51 PM, Jim Martino <jrm at jhu.edu> wrote:

> i love the bad words filter idea, once inserted a one-line filter in a
> perl script (for displaying student comments on course surveys) to
> replace offending constructions with asterisks. easily the funniest line
> of code i ever constructed, especially with the regular expressions :)
>
> jim
>
> Steve Swinsburg wrote:
> > The security is handled in the container anyway so you get a shortened
> > URL and it resolves into the full URL, redirects and Sakai takes over.
> > Even if it was sequential it wouldn't be a security concern.
> >
> > With a keyspace of lowercase letters and numbers(36 chars) there are
> > about 2 billion combinations, with uppercase as well there's 50
> > billion. Info here on the theoretical limits:
> > http://bit.ly/3QG2uj
> >
> > Using RandomStringUtils from commons-lang, I wrote a short method to
> > iterate over a million alphanumeric combinations using letters and
> > numbers of length 6 chars and didn't run into a single collision.
> > Using this method and a transactional check when adding into the
> > database is probably the way to go.
> >
> > We can also filter for bad words, BBC has an interesting list (thanks
> > Seth! :)
> >
> > WDYT?
> >
> > Steve
> >
> >
> > On 8 Jul 2009, at 12:10, Aaron Zeckoski wrote:
> >
> >
> >> Is there a reason why the id generation could be more like tinyURL and
> >> simply be sequential and thus very short (at least initially and
> >> probably for most institutions it would stay small).
> >>
> >> Just a thought.
> >> -AZ
> >>
> >>
> >> On Wed, Jul 8, 2009 at 5:00 PM, Matthew Jones<jonespm at umich.edu>
> >> wrote:
> >>
> >>> You could consider using 22 character uuids as autogenerated tinies
> >>> in
> >>> the url. It might not be able to be called tinyurl anymore, but it
> >>> would at least be short, linkable and (much more) collision free.
> >>>
> >>>
> http://stackoverflow.com/questions/772802/storing-uuid-as-base64-string
> >>>
> >>> I'm not convinced it would be necessary to keep these id's from being
> >>> unpredictable. The current security model should prevent anyone from
> >>> accessing content they don't have permission to even guessing id's. I
> >>> think eventually when a UI was designed for this it might even want
> >>> to
> >>> allow the user to specify a custom ID, similar to how the Site
> >>> Aliases
> >>> (on new site creation) works, allowing a user to specify anything
> >>> they
> >>> want for their site alias. (Like tinyurl.com) So this would almost
> >>> guarantee collisions.
> >>>
> >>> So I wouldn't think that the possibility of a collision on an low use
> >>> feature for an indexed column (when collisions will probably be
> >>> featured in eventually anyway) likely won't be a big deal.
> >>>
> >>> -Matthew
> >>>
> >>> On Tue, Jul 7, 2009 at 7:08 PM, <antranig at caret.cam.ac.uk> wrote:
> >>>
> >>>> Please take a look at the comments at the head of the
> >>>> "EighteenIDGenerator"
> >>>> file for some details of the maths involved in computing how many
> >>>> documents will arise before a collision:
> >>>>
> https://source.caret.cam.ac.uk/rsf/projects/PonderUtilCore/trunk/src/uk/org/ponder/hashutil/EighteenIDGenerator.java
> >>>>
> >>>> I wouldn't really recommend using a key space smaller than this one
> >>>> if you want to avoid collisions - although of course you will
> >>>> reprobe
> >>>> to be sure. Another issue with LCGs is that they are highly
> >>>> predictable -
> >>>> this is why I use a SecureRandom here: One of the requirements of
> >>>> the
> >>>> TinyURLService that I made for the work at CARET was that the ID
> >>>> sequence could not be predicted - to prevent the possibility of
> >>>> malicious
> >>>> "hijacking" of upcoming URL slots (this was actually a system
> >>>> designed
> >>>> to allocate small URLs that were plaintext signup URLs routed to a
> >>>> PHP
> >>>> system, perhaps more risk than is present in a typical application!)
> >>>>
> >>>> Note that a further issue you have with an LCG is running it in a
> >>>> cluster -
> >>>> even if you can lock on the particular instance you have, you can't
> >>>> guarantee that some other instance in the cluster has not stepped on
> >>>> past the slot you want. I think that the approach in my service
> >>>> (a SecureRandom plus a transactional check into persistence) is most
> >>>> likely to work reasonably well in a cluster - although it hasn't
> >>>> been
> >>>> tested there :P
> >>>>
> >>>> Cheers,
> >>>> Antranig.
> >>>>
> >>>>
> >>>> Quoting Steve Swinsburg <s.swinsburg at lancaster.ac.uk>:
> >>>>
> >>>>
> >>>>> Carl,
> >>>>>
> >>>>> Sounds good, is there an existing LCG Java implementation we can
> >>>>> leverage or should we roll our own?
> >>>>>
> >>>>> cheers,
> >>>>> Steve
> >>>>>
> >>>>>
> >>>>> On 6 Jul 2009, at 12:34, Carl Hall wrote:
> >>>>>
> >>>>>
> >>>>>> Summary: There is an ever increasing probability that the
> >>>>>> collision of
> >>>>>> IDs in the TinyURL service will slow down the service
> >>>>>> considerably.
> >>>>>>
> >>>>>> In looking at the implementation of the TinyURL service, I've
> >>>>>> noticed
> >>>>>> something about the way URLs are generated.  If I'm headed in the
> >>>>>> wrong direction with this, please correct me.
> >>>>>>
> >>>>>> The very general approach is:
> >>>>>> 1) generate a random ID
> >>>>>> 2) check that the ID doesn't already exist
> >>>>>> 3) if exists, go to 1.  if not, record the ID and release it to
> >>>>>> the
> >>>>>> caller.
> >>>>>>
> >>>>>> The following code for generating a URL:
> >>>>>>
> >>>>>> Random generator = new Random();
> >>>>>> int seed = Math.abs(generator.nextInt());
> >>>>>>
> >>>>>> has an issue with collision which is addressed by checking for the
> >>>>>> existence of that ID before handing it out but the algorithm run
> >>>>>> time
> >>>>>> will continue to grow because of the growing chance of collision
> >>>>>> over
> >>>>>> time.
> >>>>>>
> >>>>>> Explanation of concern:
> >>>>>> The first generated ID will have a 0/n chance of collision where
> >>>>>> n is
> >>>>>> the total number of possible IDs.  The next ID has a 1/n chance,
> >>>>>> the
> >>>>>> next 2/n, etc up to the last having a chance of collision being
> >>>>>> n-1/n.
> >>>>>> Considering n needs to be sufficiently large to have a low to zero
> >>>>>> rate of duplication over time, this check will take more time
> >>>>>> because
> >>>>>> of the number of collisions over time.  Assuming that the ID
> >>>>>> column in
> >>>>>> the database is indexed, the lookup time is minimized but there's
> >>>>>> still a call to the database lookup on each generated ID.  As the
> >>>>>> number of IDs generated reaches the max possible, there are
> >>>>>> possibly
> >>>>>> n-1 lookups before a valid ID is found.
> >>>>>>
> >>>>>> The current code recreates the Random each time which stops
> >>>>>> nextInt()
> >>>>>> from knowing history (ie. new random seed on each generation).
> >>>>>> Java
> >>>>>> implements a linear congruential generator to generate random
> >>>>>> numbers
> >>>>>> which requires knowledge of the previously generated ID to be
> >>>>>> effective.  Math.random() creates the random seed once and
> >>>>>> reuses it
> >>>>>> for the life of the JVM process which is a better approach.
> >>>>>>
> >>>>>> I would suggest creating an LCG in the TinyURL service so that the
> >>>>>> proper range can be easily managed (better performance due to a
> >>>>>> narrower scope than Math.random()) and the seed can be restored
> >>>>>> after
> >>>>>> a restart.  I have created a prototype in Python based on the
> >>>>>> rules
> >>>>>> laid out in the Wikipedia article on LCG's [1] and using a
> >>>>>> sufficiently large prime number.  This still carries a concern of
> >>>>>> generating IDs across a cluster but is relatively easy to address.
> >>>>>>
> >>>>>> WDYT?
> >>>>>>
> >>>>>> [1] http://en.wikipedia.org/wiki/Linear_congruential_generator
> >>>>>> _______________________________________________
> >>>>>> sakai-dev mailing list
> >>>>>> sakai-dev at collab.sakaiproject.org
> >>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
> >>>>>>
> >>>>>> TO UNSUBSCRIBE: send email to
> sakai-dev-unsubscribe at collab.sakaiproject.org
> >>>>>>  with a subject of "unsubscribe"
> >>>>>>
> >>>>> _______________________________________________
> >>>>> sakai-dev mailing list
> >>>>> sakai-dev at collab.sakaiproject.org
> >>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
> >>>>>
> >>>>> TO UNSUBSCRIBE: send email to
> sakai-dev-unsubscribe at collab.sakaiproject.org
> >>>>> with a subject of "unsubscribe"
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> ----------------------------------------------------------------
> >>>> This message was sent using IMP, the Internet Messaging Program.
> >>>>
> >>>> _______________________________________________
> >>>> sakai-dev mailing list
> >>>> sakai-dev at collab.sakaiproject.org
> >>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
> >>>>
> >>>> TO UNSUBSCRIBE: send email to
> sakai-dev-unsubscribe at collab.sakaiproject.org
> >>>>  with a subject of "unsubscribe"
> >>>>
> >>>>
> >>> _______________________________________________
> >>> sakai-dev mailing list
> >>> sakai-dev at collab.sakaiproject.org
> >>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
> >>>
> >>> TO UNSUBSCRIBE: send email to
> sakai-dev-unsubscribe at collab.sakaiproject.org
> >>>  with a subject of "unsubscribe"
> >>>
> >>>
> >>
> >> --
> >> Aaron Zeckoski (azeckoski (at) vt.edu)
> >> Senior Research Engineer - CARET - University of Cambridge
> >> https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski
> >> http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile
> >> _______________________________________________
> >> sakai-dev mailing list
> >> sakai-dev at collab.sakaiproject.org
> >> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
> >>
> >> TO UNSUBSCRIBE: send email to
> sakai-dev-unsubscribe at collab.sakaiproject.org
> >>  with a subject of "unsubscribe"
> >>
> >
> > _______________________________________________
> > sakai-dev mailing list
> > sakai-dev at collab.sakaiproject.org
> > http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
> >
> > TO UNSUBSCRIBE: send email to
> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
> "unsubscribe"
> >
>
> _______________________________________________
> sakai-dev mailing list
> sakai-dev at collab.sakaiproject.org
> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>
> TO UNSUBSCRIBE: send email to
> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
> "unsubscribe"
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://collab.sakaiproject.org/pipermail/sakai-dev/attachments/20090708/090bf7eb/attachment.html 


More information about the sakai-dev mailing list