[Building Sakai] TinyURL Service Impl Issue

Steve Swinsburg s.swinsburg at lancaster.ac.uk
Wed Jul 15 02:51:43 PDT 2009


Good idea, I'll add that as an additional method where you specify the  
TTL before the URL is no longer resolved.


cheers,
Steve

---
Steve Swinsburg
Portal Systems Developer
Centre for e-Science
Lancaster University
Lancaster
LA1 4YT

email: s.swinsburg at lancaster.ac.uk
phone: +44 (0) 1524 594870







On 15 Jul 2009, at 04:27, Jon Gorrono wrote:

> I have been a little curious why timeboxing is not used to decrease
> collisions for this service.... we use the service in a custom guest
> access app we made to put registration urls etc in emails....
> obviously we want to timebox access to the urls but a ttl can be
> xtremely effective in reducing an otherwise skewed data set to
> something resembling white noise... as it give a small window view of
> a larger distribution
>
> Maybe an api call could change the ttl and return the probability of
> collision during that time or some other silliness
>
>
>
> On Wed, Jul 8, 2009 at 2:52 PM, Steve
> Swinsburg<s.swinsburg at lancaster.ac.uk> wrote:
>> Ok I've added an alternative method to return a longer random id  
>> which
>> makes it less guessable. The number of permutations with this combo
>> has 30 zeroes (on a-z and 0-9).
>>
>> Have some fun with the calculator! http://bit.ly/gJP1X
>>
>> cheers,
>> Steve
>>
>>
>>
>> On 8 Jul 2009, at 17:25, antranig at caret.cam.ac.uk wrote:
>>
>>> The issue with collisions occurs when you are hosting URLs which are
>>> not protected by the container, but held by some external resource -
>>> for example, a signup system "click here to validate your email to
>>> Xxxxx service". The implementation I made actually runs in two  
>>> modes,
>>> you may request either "SECURE" or "SHORT" URLs by an option to the
>>> API. The short URLs start at only a couple of characters and
>>> are indeed sequential.
>>> Anyway, feel free to implement your own system, it is the Sakai way
>>> after all :)
>>>
>>> Cheers,
>>> Antranig.
>>>
>>> Quoting Steve Swinsburg <s.swinsburg at lancaster.ac.uk>:
>>>
>>>> The security is handled in the container anyway so you get a
>>>> shortened
>>>> URL and it resolves into the full URL, redirects and Sakai takes
>>>> over.
>>>> Even if it was sequential it wouldn't be a security concern.
>>>>
>>>> With a keyspace of lowercase letters and numbers(36 chars) there  
>>>> are
>>>> about 2 billion combinations, with uppercase as well there's 50
>>>> billion. Info here on the theoretical limits:
>>>> http://bit.ly/3QG2uj
>>>>
>>>> Using RandomStringUtils from commons-lang, I wrote a short method  
>>>> to
>>>> iterate over a million alphanumeric combinations using letters and
>>>> numbers of length 6 chars and didn't run into a single collision.
>>>> Using this method and a transactional check when adding into the
>>>> database is probably the way to go.
>>>>
>>>> We can also filter for bad words, BBC has an interesting list  
>>>> (thanks
>>>> Seth! :)
>>>>
>>>> WDYT?
>>>>
>>>> Steve
>>>>
>>>>
>>>> On 8 Jul 2009, at 12:10, Aaron Zeckoski wrote:
>>>>
>>>>> Is there a reason why the id generation could be more like tinyURL
>>>>> and
>>>>> simply be sequential and thus very short (at least initially and
>>>>> probably for most institutions it would stay small).
>>>>>
>>>>> Just a thought.
>>>>> -AZ
>>>>>
>>>>>
>>>>> On Wed, Jul 8, 2009 at 5:00 PM, Matthew Jones<jonespm at umich.edu>
>>>>> wrote:
>>>>>> You could consider using 22 character uuids as autogenerated  
>>>>>> tinies
>>>>>> in
>>>>>> the url. It might not be able to be called tinyurl anymore, but  
>>>>>> it
>>>>>> would at least be short, linkable and (much more) collision free.
>>>>>>
>>>>>> http://stackoverflow.com/questions/772802/storing-uuid-as-base64-string
>>>>>>
>>>>>> I'm not convinced it would be necessary to keep these id's from
>>>>>> being
>>>>>> unpredictable. The current security model should prevent anyone
>>>>>> from
>>>>>> accessing content they don't have permission to even guessing
>>>>>> id's. I
>>>>>> think eventually when a UI was designed for this it might even  
>>>>>> want
>>>>>> to
>>>>>> allow the user to specify a custom ID, similar to how the Site
>>>>>> Aliases
>>>>>> (on new site creation) works, allowing a user to specify anything
>>>>>> they
>>>>>> want for their site alias. (Like tinyurl.com) So this would  
>>>>>> almost
>>>>>> guarantee collisions.
>>>>>>
>>>>>> So I wouldn't think that the possibility of a collision on an low
>>>>>> use
>>>>>> feature for an indexed column (when collisions will probably be
>>>>>> featured in eventually anyway) likely won't be a big deal.
>>>>>>
>>>>>> -Matthew
>>>>>>
>>>>>> On Tue, Jul 7, 2009 at 7:08 PM, <antranig at caret.cam.ac.uk> wrote:
>>>>>>> Please take a look at the comments at the head of the
>>>>>>> "EighteenIDGenerator"
>>>>>>> file for some details of the maths involved in computing how  
>>>>>>> many
>>>>>>> documents will arise before a collision:
>>>>>>>
>>>>
>>> https://source.caret.cam.ac.uk/rsf/projects/PonderUtilCore/trunk/src/uk/org/ponder/hashutil/EighteenIDGenerator.java
>>>>>>>
>>>>>>> I wouldn't really recommend using a key space smaller than this
>>>>>>> one
>>>>>>> if you want to avoid collisions - although of course you will
>>>>>>> reprobe
>>>>>>> to be sure. Another issue with LCGs is that they are highly
>>>>>>> predictable -
>>>>>>> this is why I use a SecureRandom here: One of the requirements  
>>>>>>> of
>>>>>>> the
>>>>>>> TinyURLService that I made for the work at CARET was that the ID
>>>>>>> sequence could not be predicted - to prevent the possibility of
>>>>>>> malicious
>>>>>>> "hijacking" of upcoming URL slots (this was actually a system
>>>>>>> designed
>>>>>>> to allocate small URLs that were plaintext signup URLs routed  
>>>>>>> to a
>>>>>>> PHP
>>>>>>> system, perhaps more risk than is present in a typical
>>>>>>> application!)
>>>>>>>
>>>>>>> Note that a further issue you have with an LCG is running it  
>>>>>>> in a
>>>>>>> cluster -
>>>>>>> even if you can lock on the particular instance you have, you
>>>>>>> can't
>>>>>>> guarantee that some other instance in the cluster has not
>>>>>>> stepped on
>>>>>>> past the slot you want. I think that the approach in my service
>>>>>>> (a SecureRandom plus a transactional check into persistence) is
>>>>>>> most
>>>>>>> likely to work reasonably well in a cluster - although it hasn't
>>>>>>> been
>>>>>>> tested there :P
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Antranig.
>>>>>>>
>>>>>>>
>>>>>>> Quoting Steve Swinsburg <s.swinsburg at lancaster.ac.uk>:
>>>>>>>
>>>>>>>> Carl,
>>>>>>>>
>>>>>>>> Sounds good, is there an existing LCG Java implementation we  
>>>>>>>> can
>>>>>>>> leverage or should we roll our own?
>>>>>>>>
>>>>>>>> cheers,
>>>>>>>> Steve
>>>>>>>>
>>>>>>>>
>>>>>>>> On 6 Jul 2009, at 12:34, Carl Hall wrote:
>>>>>>>>
>>>>>>>>> Summary: There is an ever increasing probability that the
>>>>>>>>> collision of
>>>>>>>>> IDs in the TinyURL service will slow down the service
>>>>>>>>> considerably.
>>>>>>>>>
>>>>>>>>> In looking at the implementation of the TinyURL service, I've
>>>>>>>>> noticed
>>>>>>>>> something about the way URLs are generated.  If I'm headed in
>>>>>>>>> the
>>>>>>>>> wrong direction with this, please correct me.
>>>>>>>>>
>>>>>>>>> The very general approach is:
>>>>>>>>> 1) generate a random ID
>>>>>>>>> 2) check that the ID doesn't already exist
>>>>>>>>> 3) if exists, go to 1.  if not, record the ID and release it  
>>>>>>>>> to
>>>>>>>>> the
>>>>>>>>> caller.
>>>>>>>>>
>>>>>>>>> The following code for generating a URL:
>>>>>>>>>
>>>>>>>>> Random generator = new Random();
>>>>>>>>> int seed = Math.abs(generator.nextInt());
>>>>>>>>>
>>>>>>>>> has an issue with collision which is addressed by checking for
>>>>>>>>> the
>>>>>>>>> existence of that ID before handing it out but the algorithm  
>>>>>>>>> run
>>>>>>>>> time
>>>>>>>>> will continue to grow because of the growing chance of  
>>>>>>>>> collision
>>>>>>>>> over
>>>>>>>>> time.
>>>>>>>>>
>>>>>>>>> Explanation of concern:
>>>>>>>>> The first generated ID will have a 0/n chance of collision  
>>>>>>>>> where
>>>>>>>>> n is
>>>>>>>>> the total number of possible IDs.  The next ID has a 1/n  
>>>>>>>>> chance,
>>>>>>>>> the
>>>>>>>>> next 2/n, etc up to the last having a chance of collision  
>>>>>>>>> being
>>>>>>>>> n-1/n.
>>>>>>>>> Considering n needs to be sufficiently large to have a low to
>>>>>>>>> zero
>>>>>>>>> rate of duplication over time, this check will take more time
>>>>>>>>> because
>>>>>>>>> of the number of collisions over time.  Assuming that the ID
>>>>>>>>> column in
>>>>>>>>> the database is indexed, the lookup time is minimized but
>>>>>>>>> there's
>>>>>>>>> still a call to the database lookup on each generated ID.  As
>>>>>>>>> the
>>>>>>>>> number of IDs generated reaches the max possible, there are
>>>>>>>>> possibly
>>>>>>>>> n-1 lookups before a valid ID is found.
>>>>>>>>>
>>>>>>>>> The current code recreates the Random each time which stops
>>>>>>>>> nextInt()
>>>>>>>>> from knowing history (ie. new random seed on each generation).
>>>>>>>>> Java
>>>>>>>>> implements a linear congruential generator to generate random
>>>>>>>>> numbers
>>>>>>>>> which requires knowledge of the previously generated ID to be
>>>>>>>>> effective.  Math.random() creates the random seed once and
>>>>>>>>> reuses it
>>>>>>>>> for the life of the JVM process which is a better approach.
>>>>>>>>>
>>>>>>>>> I would suggest creating an LCG in the TinyURL service so that
>>>>>>>>> the
>>>>>>>>> proper range can be easily managed (better performance due  
>>>>>>>>> to a
>>>>>>>>> narrower scope than Math.random()) and the seed can be  
>>>>>>>>> restored
>>>>>>>>> after
>>>>>>>>> a restart.  I have created a prototype in Python based on the
>>>>>>>>> rules
>>>>>>>>> laid out in the Wikipedia article on LCG's [1] and using a
>>>>>>>>> sufficiently large prime number.  This still carries a concern
>>>>>>>>> of
>>>>>>>>> generating IDs across a cluster but is relatively easy to
>>>>>>>>> address.
>>>>>>>>>
>>>>>>>>> WDYT?
>>>>>>>>>
>>>>>>>>> [1] http://en.wikipedia.org/wiki/Linear_congruential_generator
>>>>>>>>> _______________________________________________
>>>>>>>>> sakai-dev mailing list
>>>>>>>>> sakai-dev at collab.sakaiproject.org
>>>>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>>>>>>
>>>>>>>>> TO UNSUBSCRIBE: send email to
>>>> sakai-dev-unsubscribe at collab.sakaiproject.org
>>>>>>>>> with a subject of "unsubscribe"
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> sakai-dev mailing list
>>>>>>>> sakai-dev at collab.sakaiproject.org
>>>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>>>>>
>>>>>>>> TO UNSUBSCRIBE: send email to
>>>> sakai-dev-unsubscribe at collab.sakaiproject.org
>>>>>>>> with a subject of "unsubscribe"
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ----------------------------------------------------------------
>>>>>>> This message was sent using IMP, the Internet Messaging Program.
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> sakai-dev mailing list
>>>>>>> sakai-dev at collab.sakaiproject.org
>>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>>>>
>>>>>>> TO UNSUBSCRIBE: send email to
>>>> sakai-dev-unsubscribe at collab.sakaiproject.org
>>>>>>> with a subject of "unsubscribe"
>>>>>>>
>>>>>> _______________________________________________
>>>>>> sakai-dev mailing list
>>>>>> sakai-dev at collab.sakaiproject.org
>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>>>
>>>>>> TO UNSUBSCRIBE: send email to
>>>> sakai-dev-unsubscribe at collab.sakaiproject.org
>>>>>> with a subject of "unsubscribe"
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Aaron Zeckoski (azeckoski (at) vt.edu)
>>>>> Senior Research Engineer - CARET - University of Cambridge
>>>>> https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski
>>>>> http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile
>>>>> _______________________________________________
>>>>> sakai-dev mailing list
>>>>> sakai-dev at collab.sakaiproject.org
>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>>
>>>>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org
>>>>> with a subject of "unsubscribe"
>>>>
>>>> _______________________________________________
>>>> sakai-dev mailing list
>>>> sakai-dev at collab.sakaiproject.org
>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>
>>>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org
>>>> with a subject of "unsubscribe"
>>>>
>>>
>>>
>>>
>>>
>>> ----------------------------------------------------------------
>>> This message was sent using IMP, the Internet Messaging Program.
>>>
>>
>> _______________________________________________
>> sakai-dev mailing list
>> sakai-dev at collab.sakaiproject.org
>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>
>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org 
>>  with a subject of "unsubscribe"
>>
>
>
>
> -- 
> Jon Gorrono
> PGP Key: 5434509D
> email{>+++++++++[>+++++++++++>++++++++++++>+++++++>+++++<<<<-]>++++++ 
> +.>++++.<---.>-.+++..---.-.+.>+.<++++++.<----.+.---.>+.<+++++++ 
> +.>---.>>+.<<<----.-.>++.}
> http{ats.ucdavis.edu}
> _______________________________________________
> sakai-dev mailing list
> sakai-dev at collab.sakaiproject.org
> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>
> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org 
>  with a subject of "unsubscribe"



More information about the sakai-dev mailing list