[Building Sakai] TinyURL Service Impl Issue

Steve Swinsburg s.swinsburg at lancaster.ac.uk
Wed Jul 8 14:52:14 PDT 2009


Ok I've added an alternative method to return a longer random id which  
makes it less guessable. The number of permutations with this combo  
has 30 zeroes (on a-z and 0-9).

Have some fun with the calculator! http://bit.ly/gJP1X

cheers,
Steve



On 8 Jul 2009, at 17:25, antranig at caret.cam.ac.uk wrote:

> The issue with collisions occurs when you are hosting URLs which are
> not protected by the container, but held by some external resource -
> for example, a signup system "click here to validate your email to
> Xxxxx service". The implementation I made actually runs in two modes,
> you may request either "SECURE" or "SHORT" URLs by an option to the
> API. The short URLs start at only a couple of characters and
> are indeed sequential.
> Anyway, feel free to implement your own system, it is the Sakai way
> after all :)
>
> Cheers,
> Antranig.
>
> Quoting Steve Swinsburg <s.swinsburg at lancaster.ac.uk>:
>
>> The security is handled in the container anyway so you get a  
>> shortened
>> URL and it resolves into the full URL, redirects and Sakai takes  
>> over.
>> Even if it was sequential it wouldn't be a security concern.
>>
>> With a keyspace of lowercase letters and numbers(36 chars) there are
>> about 2 billion combinations, with uppercase as well there's 50
>> billion. Info here on the theoretical limits:
>> http://bit.ly/3QG2uj
>>
>> Using RandomStringUtils from commons-lang, I wrote a short method to
>> iterate over a million alphanumeric combinations using letters and
>> numbers of length 6 chars and didn't run into a single collision.
>> Using this method and a transactional check when adding into the
>> database is probably the way to go.
>>
>> We can also filter for bad words, BBC has an interesting list (thanks
>> Seth! :)
>>
>> WDYT?
>>
>> Steve
>>
>>
>> On 8 Jul 2009, at 12:10, Aaron Zeckoski wrote:
>>
>>> Is there a reason why the id generation could be more like tinyURL  
>>> and
>>> simply be sequential and thus very short (at least initially and
>>> probably for most institutions it would stay small).
>>>
>>> Just a thought.
>>> -AZ
>>>
>>>
>>> On Wed, Jul 8, 2009 at 5:00 PM, Matthew Jones<jonespm at umich.edu>
>>> wrote:
>>>> You could consider using 22 character uuids as autogenerated tinies
>>>> in
>>>> the url. It might not be able to be called tinyurl anymore, but it
>>>> would at least be short, linkable and (much more) collision free.
>>>>
>>>> http://stackoverflow.com/questions/772802/storing-uuid-as-base64-string
>>>>
>>>> I'm not convinced it would be necessary to keep these id's from  
>>>> being
>>>> unpredictable. The current security model should prevent anyone  
>>>> from
>>>> accessing content they don't have permission to even guessing  
>>>> id's. I
>>>> think eventually when a UI was designed for this it might even want
>>>> to
>>>> allow the user to specify a custom ID, similar to how the Site
>>>> Aliases
>>>> (on new site creation) works, allowing a user to specify anything
>>>> they
>>>> want for their site alias. (Like tinyurl.com) So this would almost
>>>> guarantee collisions.
>>>>
>>>> So I wouldn't think that the possibility of a collision on an low  
>>>> use
>>>> feature for an indexed column (when collisions will probably be
>>>> featured in eventually anyway) likely won't be a big deal.
>>>>
>>>> -Matthew
>>>>
>>>> On Tue, Jul 7, 2009 at 7:08 PM, <antranig at caret.cam.ac.uk> wrote:
>>>>> Please take a look at the comments at the head of the
>>>>> "EighteenIDGenerator"
>>>>> file for some details of the maths involved in computing how many
>>>>> documents will arise before a collision:
>>>>>
>>
> https://source.caret.cam.ac.uk/rsf/projects/PonderUtilCore/trunk/src/uk/org/ponder/hashutil/EighteenIDGenerator.java
>>>>>
>>>>> I wouldn't really recommend using a key space smaller than this  
>>>>> one
>>>>> if you want to avoid collisions - although of course you will
>>>>> reprobe
>>>>> to be sure. Another issue with LCGs is that they are highly
>>>>> predictable -
>>>>> this is why I use a SecureRandom here: One of the requirements of
>>>>> the
>>>>> TinyURLService that I made for the work at CARET was that the ID
>>>>> sequence could not be predicted - to prevent the possibility of
>>>>> malicious
>>>>> "hijacking" of upcoming URL slots (this was actually a system
>>>>> designed
>>>>> to allocate small URLs that were plaintext signup URLs routed to a
>>>>> PHP
>>>>> system, perhaps more risk than is present in a typical  
>>>>> application!)
>>>>>
>>>>> Note that a further issue you have with an LCG is running it in a
>>>>> cluster -
>>>>> even if you can lock on the particular instance you have, you  
>>>>> can't
>>>>> guarantee that some other instance in the cluster has not  
>>>>> stepped on
>>>>> past the slot you want. I think that the approach in my service
>>>>> (a SecureRandom plus a transactional check into persistence) is  
>>>>> most
>>>>> likely to work reasonably well in a cluster - although it hasn't
>>>>> been
>>>>> tested there :P
>>>>>
>>>>> Cheers,
>>>>> Antranig.
>>>>>
>>>>>
>>>>> Quoting Steve Swinsburg <s.swinsburg at lancaster.ac.uk>:
>>>>>
>>>>>> Carl,
>>>>>>
>>>>>> Sounds good, is there an existing LCG Java implementation we can
>>>>>> leverage or should we roll our own?
>>>>>>
>>>>>> cheers,
>>>>>> Steve
>>>>>>
>>>>>>
>>>>>> On 6 Jul 2009, at 12:34, Carl Hall wrote:
>>>>>>
>>>>>>> Summary: There is an ever increasing probability that the
>>>>>>> collision of
>>>>>>> IDs in the TinyURL service will slow down the service
>>>>>>> considerably.
>>>>>>>
>>>>>>> In looking at the implementation of the TinyURL service, I've
>>>>>>> noticed
>>>>>>> something about the way URLs are generated.  If I'm headed in  
>>>>>>> the
>>>>>>> wrong direction with this, please correct me.
>>>>>>>
>>>>>>> The very general approach is:
>>>>>>> 1) generate a random ID
>>>>>>> 2) check that the ID doesn't already exist
>>>>>>> 3) if exists, go to 1.  if not, record the ID and release it to
>>>>>>> the
>>>>>>> caller.
>>>>>>>
>>>>>>> The following code for generating a URL:
>>>>>>>
>>>>>>> Random generator = new Random();
>>>>>>> int seed = Math.abs(generator.nextInt());
>>>>>>>
>>>>>>> has an issue with collision which is addressed by checking for  
>>>>>>> the
>>>>>>> existence of that ID before handing it out but the algorithm run
>>>>>>> time
>>>>>>> will continue to grow because of the growing chance of collision
>>>>>>> over
>>>>>>> time.
>>>>>>>
>>>>>>> Explanation of concern:
>>>>>>> The first generated ID will have a 0/n chance of collision where
>>>>>>> n is
>>>>>>> the total number of possible IDs.  The next ID has a 1/n chance,
>>>>>>> the
>>>>>>> next 2/n, etc up to the last having a chance of collision being
>>>>>>> n-1/n.
>>>>>>> Considering n needs to be sufficiently large to have a low to  
>>>>>>> zero
>>>>>>> rate of duplication over time, this check will take more time
>>>>>>> because
>>>>>>> of the number of collisions over time.  Assuming that the ID
>>>>>>> column in
>>>>>>> the database is indexed, the lookup time is minimized but  
>>>>>>> there's
>>>>>>> still a call to the database lookup on each generated ID.  As  
>>>>>>> the
>>>>>>> number of IDs generated reaches the max possible, there are
>>>>>>> possibly
>>>>>>> n-1 lookups before a valid ID is found.
>>>>>>>
>>>>>>> The current code recreates the Random each time which stops
>>>>>>> nextInt()
>>>>>>> from knowing history (ie. new random seed on each generation).
>>>>>>> Java
>>>>>>> implements a linear congruential generator to generate random
>>>>>>> numbers
>>>>>>> which requires knowledge of the previously generated ID to be
>>>>>>> effective.  Math.random() creates the random seed once and
>>>>>>> reuses it
>>>>>>> for the life of the JVM process which is a better approach.
>>>>>>>
>>>>>>> I would suggest creating an LCG in the TinyURL service so that  
>>>>>>> the
>>>>>>> proper range can be easily managed (better performance due to a
>>>>>>> narrower scope than Math.random()) and the seed can be restored
>>>>>>> after
>>>>>>> a restart.  I have created a prototype in Python based on the
>>>>>>> rules
>>>>>>> laid out in the Wikipedia article on LCG's [1] and using a
>>>>>>> sufficiently large prime number.  This still carries a concern  
>>>>>>> of
>>>>>>> generating IDs across a cluster but is relatively easy to  
>>>>>>> address.
>>>>>>>
>>>>>>> WDYT?
>>>>>>>
>>>>>>> [1] http://en.wikipedia.org/wiki/Linear_congruential_generator
>>>>>>> _______________________________________________
>>>>>>> sakai-dev mailing list
>>>>>>> sakai-dev at collab.sakaiproject.org
>>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>>>>
>>>>>>> TO UNSUBSCRIBE: send email to
>> sakai-dev-unsubscribe at collab.sakaiproject.org
>>>>>>> with a subject of "unsubscribe"
>>>>>>
>>>>>> _______________________________________________
>>>>>> sakai-dev mailing list
>>>>>> sakai-dev at collab.sakaiproject.org
>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>>>
>>>>>> TO UNSUBSCRIBE: send email to
>> sakai-dev-unsubscribe at collab.sakaiproject.org
>>>>>> with a subject of "unsubscribe"
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ----------------------------------------------------------------
>>>>> This message was sent using IMP, the Internet Messaging Program.
>>>>>
>>>>> _______________________________________________
>>>>> sakai-dev mailing list
>>>>> sakai-dev at collab.sakaiproject.org
>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>>
>>>>> TO UNSUBSCRIBE: send email to
>> sakai-dev-unsubscribe at collab.sakaiproject.org
>>>>> with a subject of "unsubscribe"
>>>>>
>>>> _______________________________________________
>>>> sakai-dev mailing list
>>>> sakai-dev at collab.sakaiproject.org
>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>
>>>> TO UNSUBSCRIBE: send email to
>> sakai-dev-unsubscribe at collab.sakaiproject.org
>>>> with a subject of "unsubscribe"
>>>>
>>>
>>>
>>>
>>> --
>>> Aaron Zeckoski (azeckoski (at) vt.edu)
>>> Senior Research Engineer - CARET - University of Cambridge
>>> https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski
>>> http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile
>>> _______________________________________________
>>> sakai-dev mailing list
>>> sakai-dev at collab.sakaiproject.org
>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>
>>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org
>>> with a subject of "unsubscribe"
>>
>> _______________________________________________
>> sakai-dev mailing list
>> sakai-dev at collab.sakaiproject.org
>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>
>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org
>> with a subject of "unsubscribe"
>>
>
>
>
>
> ----------------------------------------------------------------
> This message was sent using IMP, the Internet Messaging Program.
>



More information about the sakai-dev mailing list