[Building Sakai] TinyURL Service Impl Issue

Wed Jul 8 13:51:21 PDT 2009

i love the bad words filter idea, once inserted a one-line filter in a 
perl script (for displaying student comments on course surveys) to 
replace offending constructions with asterisks. easily the funniest line 
of code i ever constructed, especially with the regular expressions :)

jim

Steve Swinsburg wrote:
> The security is handled in the container anyway so you get a shortened
> URL and it resolves into the full URL, redirects and Sakai takes over.
> Even if it was sequential it wouldn't be a security concern.
>
> With a keyspace of lowercase letters and numbers(36 chars) there are
> about 2 billion combinations, with uppercase as well there's 50
> billion. Info here on the theoretical limits:
> http://bit.ly/3QG2uj
>
> Using RandomStringUtils from commons-lang, I wrote a short method to
> iterate over a million alphanumeric combinations using letters and
> numbers of length 6 chars and didn't run into a single collision.
> Using this method and a transactional check when adding into the
> database is probably the way to go.
>
> We can also filter for bad words, BBC has an interesting list (thanks
> Seth! :)
>
> WDYT?
>
> Steve
>
>
> On 8 Jul 2009, at 12:10, Aaron Zeckoski wrote:
>
>   
>> Is there a reason why the id generation could be more like tinyURL and
>> simply be sequential and thus very short (at least initially and
>> probably for most institutions it would stay small).
>>
>> Just a thought.
>> -AZ
>>
>>
>> On Wed, Jul 8, 2009 at 5:00 PM, Matthew Jones<jonespm at umich.edu>
>> wrote:
>>     
>>> You could consider using 22 character uuids as autogenerated tinies
>>> in
>>> the url. It might not be able to be called tinyurl anymore, but it
>>> would at least be short, linkable and (much more) collision free.
>>>
>>> http://stackoverflow.com/questions/772802/storing-uuid-as-base64-string
>>>
>>> I'm not convinced it would be necessary to keep these id's from being
>>> unpredictable. The current security model should prevent anyone from
>>> accessing content they don't have permission to even guessing id's. I
>>> think eventually when a UI was designed for this it might even want
>>> to
>>> allow the user to specify a custom ID, similar to how the Site
>>> Aliases
>>> (on new site creation) works, allowing a user to specify anything
>>> they
>>> want for their site alias. (Like tinyurl.com) So this would almost
>>> guarantee collisions.
>>>
>>> So I wouldn't think that the possibility of a collision on an low use
>>> feature for an indexed column (when collisions will probably be
>>> featured in eventually anyway) likely won't be a big deal.
>>>
>>> -Matthew
>>>
>>> On Tue, Jul 7, 2009 at 7:08 PM, <antranig at caret.cam.ac.uk> wrote:
>>>       
>>>> Please take a look at the comments at the head of the
>>>> "EighteenIDGenerator"
>>>> file for some details of the maths involved in computing how many
>>>> documents will arise before a collision:
>>>> https://source.caret.cam.ac.uk/rsf/projects/PonderUtilCore/trunk/src/uk/org/ponder/hashutil/EighteenIDGenerator.java
>>>>
>>>> I wouldn't really recommend using a key space smaller than this one
>>>> if you want to avoid collisions - although of course you will
>>>> reprobe
>>>> to be sure. Another issue with LCGs is that they are highly
>>>> predictable -
>>>> this is why I use a SecureRandom here: One of the requirements of
>>>> the
>>>> TinyURLService that I made for the work at CARET was that the ID
>>>> sequence could not be predicted - to prevent the possibility of
>>>> malicious
>>>> "hijacking" of upcoming URL slots (this was actually a system
>>>> designed
>>>> to allocate small URLs that were plaintext signup URLs routed to a
>>>> PHP
>>>> system, perhaps more risk than is present in a typical application!)
>>>>
>>>> Note that a further issue you have with an LCG is running it in a
>>>> cluster -
>>>> even if you can lock on the particular instance you have, you can't
>>>> guarantee that some other instance in the cluster has not stepped on
>>>> past the slot you want. I think that the approach in my service
>>>> (a SecureRandom plus a transactional check into persistence) is most
>>>> likely to work reasonably well in a cluster - although it hasn't
>>>> been
>>>> tested there :P
>>>>
>>>> Cheers,
>>>> Antranig.
>>>>
>>>>
>>>> Quoting Steve Swinsburg <s.swinsburg at lancaster.ac.uk>:
>>>>
>>>>         
>>>>> Carl,
>>>>>
>>>>> Sounds good, is there an existing LCG Java implementation we can
>>>>> leverage or should we roll our own?
>>>>>
>>>>> cheers,
>>>>> Steve
>>>>>
>>>>>
>>>>> On 6 Jul 2009, at 12:34, Carl Hall wrote:
>>>>>
>>>>>           
>>>>>> Summary: There is an ever increasing probability that the
>>>>>> collision of
>>>>>> IDs in the TinyURL service will slow down the service
>>>>>> considerably.
>>>>>>
>>>>>> In looking at the implementation of the TinyURL service, I've
>>>>>> noticed
>>>>>> something about the way URLs are generated.  If I'm headed in the
>>>>>> wrong direction with this, please correct me.
>>>>>>
>>>>>> The very general approach is:
>>>>>> 1) generate a random ID
>>>>>> 2) check that the ID doesn't already exist
>>>>>> 3) if exists, go to 1.  if not, record the ID and release it to
>>>>>> the
>>>>>> caller.
>>>>>>
>>>>>> The following code for generating a URL:
>>>>>>
>>>>>> Random generator = new Random();
>>>>>> int seed = Math.abs(generator.nextInt());
>>>>>>
>>>>>> has an issue with collision which is addressed by checking for the
>>>>>> existence of that ID before handing it out but the algorithm run
>>>>>> time
>>>>>> will continue to grow because of the growing chance of collision
>>>>>> over
>>>>>> time.
>>>>>>
>>>>>> Explanation of concern:
>>>>>> The first generated ID will have a 0/n chance of collision where
>>>>>> n is
>>>>>> the total number of possible IDs.  The next ID has a 1/n chance,
>>>>>> the
>>>>>> next 2/n, etc up to the last having a chance of collision being
>>>>>> n-1/n.
>>>>>> Considering n needs to be sufficiently large to have a low to zero
>>>>>> rate of duplication over time, this check will take more time
>>>>>> because
>>>>>> of the number of collisions over time.  Assuming that the ID
>>>>>> column in
>>>>>> the database is indexed, the lookup time is minimized but there's
>>>>>> still a call to the database lookup on each generated ID.  As the
>>>>>> number of IDs generated reaches the max possible, there are
>>>>>> possibly
>>>>>> n-1 lookups before a valid ID is found.
>>>>>>
>>>>>> The current code recreates the Random each time which stops
>>>>>> nextInt()
>>>>>> from knowing history (ie. new random seed on each generation).
>>>>>> Java
>>>>>> implements a linear congruential generator to generate random
>>>>>> numbers
>>>>>> which requires knowledge of the previously generated ID to be
>>>>>> effective.  Math.random() creates the random seed once and
>>>>>> reuses it
>>>>>> for the life of the JVM process which is a better approach.
>>>>>>
>>>>>> I would suggest creating an LCG in the TinyURL service so that the
>>>>>> proper range can be easily managed (better performance due to a
>>>>>> narrower scope than Math.random()) and the seed can be restored
>>>>>> after
>>>>>> a restart.  I have created a prototype in Python based on the
>>>>>> rules
>>>>>> laid out in the Wikipedia article on LCG's [1] and using a
>>>>>> sufficiently large prime number.  This still carries a concern of
>>>>>> generating IDs across a cluster but is relatively easy to address.
>>>>>>
>>>>>> WDYT?
>>>>>>
>>>>>> [1] http://en.wikipedia.org/wiki/Linear_congruential_generator
>>>>>> _______________________________________________
>>>>>> sakai-dev mailing list
>>>>>> sakai-dev at collab.sakaiproject.org
>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>>>
>>>>>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org
>>>>>>  with a subject of "unsubscribe"
>>>>>>             
>>>>> _______________________________________________
>>>>> sakai-dev mailing list
>>>>> sakai-dev at collab.sakaiproject.org
>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>>
>>>>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org
>>>>> with a subject of "unsubscribe"
>>>>>
>>>>>           
>>>>
>>>>
>>>> ----------------------------------------------------------------
>>>> This message was sent using IMP, the Internet Messaging Program.
>>>>
>>>> _______________________________________________
>>>> sakai-dev mailing list
>>>> sakai-dev at collab.sakaiproject.org
>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>
>>>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org
>>>>  with a subject of "unsubscribe"
>>>>
>>>>         
>>> _______________________________________________
>>> sakai-dev mailing list
>>> sakai-dev at collab.sakaiproject.org
>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>
>>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org
>>>  with a subject of "unsubscribe"
>>>
>>>       
>>
>> --
>> Aaron Zeckoski (azeckoski (at) vt.edu)
>> Senior Research Engineer - CARET - University of Cambridge
>> https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski
>> http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile
>> _______________________________________________
>> sakai-dev mailing list
>> sakai-dev at collab.sakaiproject.org
>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>
>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org
>>  with a subject of "unsubscribe"
>>     
>
> _______________________________________________
> sakai-dev mailing list
> sakai-dev at collab.sakaiproject.org
> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>
> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of "unsubscribe"
>