[Building Sakai] TinyURL Service Impl Issue

Steve Swinsburg s.swinsburg at lancaster.ac.uk
Wed Jul 8 13:43:50 PDT 2009


The security is handled in the container anyway so you get a shortened  
URL and it resolves into the full URL, redirects and Sakai takes over.  
Even if it was sequential it wouldn't be a security concern.

With a keyspace of lowercase letters and numbers(36 chars) there are  
about 2 billion combinations, with uppercase as well there's 50  
billion. Info here on the theoretical limits:
http://bit.ly/3QG2uj

Using RandomStringUtils from commons-lang, I wrote a short method to  
iterate over a million alphanumeric combinations using letters and  
numbers of length 6 chars and didn't run into a single collision.  
Using this method and a transactional check when adding into the  
database is probably the way to go.

We can also filter for bad words, BBC has an interesting list (thanks  
Seth! :)

WDYT?

Steve


On 8 Jul 2009, at 12:10, Aaron Zeckoski wrote:

> Is there a reason why the id generation could be more like tinyURL and
> simply be sequential and thus very short (at least initially and
> probably for most institutions it would stay small).
>
> Just a thought.
> -AZ
>
>
> On Wed, Jul 8, 2009 at 5:00 PM, Matthew Jones<jonespm at umich.edu>  
> wrote:
>> You could consider using 22 character uuids as autogenerated tinies  
>> in
>> the url. It might not be able to be called tinyurl anymore, but it
>> would at least be short, linkable and (much more) collision free.
>>
>> http://stackoverflow.com/questions/772802/storing-uuid-as-base64-string
>>
>> I'm not convinced it would be necessary to keep these id's from being
>> unpredictable. The current security model should prevent anyone from
>> accessing content they don't have permission to even guessing id's. I
>> think eventually when a UI was designed for this it might even want  
>> to
>> allow the user to specify a custom ID, similar to how the Site  
>> Aliases
>> (on new site creation) works, allowing a user to specify anything  
>> they
>> want for their site alias. (Like tinyurl.com) So this would almost
>> guarantee collisions.
>>
>> So I wouldn't think that the possibility of a collision on an low use
>> feature for an indexed column (when collisions will probably be
>> featured in eventually anyway) likely won't be a big deal.
>>
>> -Matthew
>>
>> On Tue, Jul 7, 2009 at 7:08 PM, <antranig at caret.cam.ac.uk> wrote:
>>> Please take a look at the comments at the head of the  
>>> "EighteenIDGenerator"
>>> file for some details of the maths involved in computing how many
>>> documents will arise before a collision:
>>> https://source.caret.cam.ac.uk/rsf/projects/PonderUtilCore/trunk/src/uk/org/ponder/hashutil/EighteenIDGenerator.java
>>>
>>> I wouldn't really recommend using a key space smaller than this one
>>> if you want to avoid collisions - although of course you will  
>>> reprobe
>>> to be sure. Another issue with LCGs is that they are highly  
>>> predictable -
>>> this is why I use a SecureRandom here: One of the requirements of  
>>> the
>>> TinyURLService that I made for the work at CARET was that the ID
>>> sequence could not be predicted - to prevent the possibility of  
>>> malicious
>>> "hijacking" of upcoming URL slots (this was actually a system  
>>> designed
>>> to allocate small URLs that were plaintext signup URLs routed to a  
>>> PHP
>>> system, perhaps more risk than is present in a typical application!)
>>>
>>> Note that a further issue you have with an LCG is running it in a  
>>> cluster -
>>> even if you can lock on the particular instance you have, you can't
>>> guarantee that some other instance in the cluster has not stepped on
>>> past the slot you want. I think that the approach in my service
>>> (a SecureRandom plus a transactional check into persistence) is most
>>> likely to work reasonably well in a cluster - although it hasn't  
>>> been
>>> tested there :P
>>>
>>> Cheers,
>>> Antranig.
>>>
>>>
>>> Quoting Steve Swinsburg <s.swinsburg at lancaster.ac.uk>:
>>>
>>>> Carl,
>>>>
>>>> Sounds good, is there an existing LCG Java implementation we can
>>>> leverage or should we roll our own?
>>>>
>>>> cheers,
>>>> Steve
>>>>
>>>>
>>>> On 6 Jul 2009, at 12:34, Carl Hall wrote:
>>>>
>>>>> Summary: There is an ever increasing probability that the  
>>>>> collision of
>>>>> IDs in the TinyURL service will slow down the service  
>>>>> considerably.
>>>>>
>>>>> In looking at the implementation of the TinyURL service, I've  
>>>>> noticed
>>>>> something about the way URLs are generated.  If I'm headed in the
>>>>> wrong direction with this, please correct me.
>>>>>
>>>>> The very general approach is:
>>>>> 1) generate a random ID
>>>>> 2) check that the ID doesn't already exist
>>>>> 3) if exists, go to 1.  if not, record the ID and release it to  
>>>>> the
>>>>> caller.
>>>>>
>>>>> The following code for generating a URL:
>>>>>
>>>>> Random generator = new Random();
>>>>> int seed = Math.abs(generator.nextInt());
>>>>>
>>>>> has an issue with collision which is addressed by checking for the
>>>>> existence of that ID before handing it out but the algorithm run  
>>>>> time
>>>>> will continue to grow because of the growing chance of collision  
>>>>> over
>>>>> time.
>>>>>
>>>>> Explanation of concern:
>>>>> The first generated ID will have a 0/n chance of collision where  
>>>>> n is
>>>>> the total number of possible IDs.  The next ID has a 1/n chance,  
>>>>> the
>>>>> next 2/n, etc up to the last having a chance of collision being  
>>>>> n-1/n.
>>>>> Considering n needs to be sufficiently large to have a low to zero
>>>>> rate of duplication over time, this check will take more time  
>>>>> because
>>>>> of the number of collisions over time.  Assuming that the ID  
>>>>> column in
>>>>> the database is indexed, the lookup time is minimized but there's
>>>>> still a call to the database lookup on each generated ID.  As the
>>>>> number of IDs generated reaches the max possible, there are  
>>>>> possibly
>>>>> n-1 lookups before a valid ID is found.
>>>>>
>>>>> The current code recreates the Random each time which stops  
>>>>> nextInt()
>>>>> from knowing history (ie. new random seed on each generation).   
>>>>> Java
>>>>> implements a linear congruential generator to generate random  
>>>>> numbers
>>>>> which requires knowledge of the previously generated ID to be
>>>>> effective.  Math.random() creates the random seed once and  
>>>>> reuses it
>>>>> for the life of the JVM process which is a better approach.
>>>>>
>>>>> I would suggest creating an LCG in the TinyURL service so that the
>>>>> proper range can be easily managed (better performance due to a
>>>>> narrower scope than Math.random()) and the seed can be restored  
>>>>> after
>>>>> a restart.  I have created a prototype in Python based on the  
>>>>> rules
>>>>> laid out in the Wikipedia article on LCG's [1] and using a
>>>>> sufficiently large prime number.  This still carries a concern of
>>>>> generating IDs across a cluster but is relatively easy to address.
>>>>>
>>>>> WDYT?
>>>>>
>>>>> [1] http://en.wikipedia.org/wiki/Linear_congruential_generator
>>>>> _______________________________________________
>>>>> sakai-dev mailing list
>>>>> sakai-dev at collab.sakaiproject.org
>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>>
>>>>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org
>>>>>  with a subject of "unsubscribe"
>>>>
>>>> _______________________________________________
>>>> sakai-dev mailing list
>>>> sakai-dev at collab.sakaiproject.org
>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>
>>>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org
>>>> with a subject of "unsubscribe"
>>>>
>>>
>>>
>>>
>>>
>>> ----------------------------------------------------------------
>>> This message was sent using IMP, the Internet Messaging Program.
>>>
>>> _______________________________________________
>>> sakai-dev mailing list
>>> sakai-dev at collab.sakaiproject.org
>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>
>>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org 
>>>  with a subject of "unsubscribe"
>>>
>> _______________________________________________
>> sakai-dev mailing list
>> sakai-dev at collab.sakaiproject.org
>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>
>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org 
>>  with a subject of "unsubscribe"
>>
>
>
>
> -- 
> Aaron Zeckoski (azeckoski (at) vt.edu)
> Senior Research Engineer - CARET - University of Cambridge
> https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski
> http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile
> _______________________________________________
> sakai-dev mailing list
> sakai-dev at collab.sakaiproject.org
> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>
> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org 
>  with a subject of "unsubscribe"



More information about the sakai-dev mailing list