[Building Sakai] TinyURL Service Impl Issue

Steve Swinsburg steve.swinsburg at gmail.com
Sat Jul 18 02:29:07 PDT 2009


What, that you had a good idea? Or someone would implement it :) Just  
got a Jira space, I'll file it.


cheers,
Steve


On 18/07/2009, at 2:56 AM, Jon Gorrono wrote:

> I'm sort of reeling over this.... are you sure?
>
> unaccustomed and unprepared for this kind of response...
>
> :)
>
>
>
> On Wed, Jul 15, 2009 at 2:51 AM, Steve
> Swinsburg<s.swinsburg at lancaster.ac.uk> wrote:
>> Good idea, I'll add that as an additional method where you specify  
>> the TTL
>> before the URL is no longer resolved.
>>
>>
>> cheers,
>> Steve
>>
>> ---
>> Steve Swinsburg
>> Portal Systems Developer
>> Centre for e-Science
>> Lancaster University
>> Lancaster
>> LA1 4YT
>>
>> email: s.swinsburg at lancaster.ac.uk
>> phone: +44 (0) 1524 594870
>>
>>
>>
>>
>>
>>
>>
>> On 15 Jul 2009, at 04:27, Jon Gorrono wrote:
>>
>>> I have been a little curious why timeboxing is not used to decrease
>>> collisions for this service.... we use the service in a custom guest
>>> access app we made to put registration urls etc in emails....
>>> obviously we want to timebox access to the urls but a ttl can be
>>> xtremely effective in reducing an otherwise skewed data set to
>>> something resembling white noise... as it give a small window view  
>>> of
>>> a larger distribution
>>>
>>> Maybe an api call could change the ttl and return the probability of
>>> collision during that time or some other silliness
>>>
>>>
>>>
>>> On Wed, Jul 8, 2009 at 2:52 PM, Steve
>>> Swinsburg<s.swinsburg at lancaster.ac.uk> wrote:
>>>>
>>>> Ok I've added an alternative method to return a longer random id  
>>>> which
>>>> makes it less guessable. The number of permutations with this combo
>>>> has 30 zeroes (on a-z and 0-9).
>>>>
>>>> Have some fun with the calculator! http://bit.ly/gJP1X
>>>>
>>>> cheers,
>>>> Steve
>>>>
>>>>
>>>>
>>>> On 8 Jul 2009, at 17:25, antranig at caret.cam.ac.uk wrote:
>>>>
>>>>> The issue with collisions occurs when you are hosting URLs which  
>>>>> are
>>>>> not protected by the container, but held by some external  
>>>>> resource -
>>>>> for example, a signup system "click here to validate your email to
>>>>> Xxxxx service". The implementation I made actually runs in two  
>>>>> modes,
>>>>> you may request either "SECURE" or "SHORT" URLs by an option to  
>>>>> the
>>>>> API. The short URLs start at only a couple of characters and
>>>>> are indeed sequential.
>>>>> Anyway, feel free to implement your own system, it is the Sakai  
>>>>> way
>>>>> after all :)
>>>>>
>>>>> Cheers,
>>>>> Antranig.
>>>>>
>>>>> Quoting Steve Swinsburg <s.swinsburg at lancaster.ac.uk>:
>>>>>
>>>>>> The security is handled in the container anyway so you get a
>>>>>> shortened
>>>>>> URL and it resolves into the full URL, redirects and Sakai takes
>>>>>> over.
>>>>>> Even if it was sequential it wouldn't be a security concern.
>>>>>>
>>>>>> With a keyspace of lowercase letters and numbers(36 chars)  
>>>>>> there are
>>>>>> about 2 billion combinations, with uppercase as well there's 50
>>>>>> billion. Info here on the theoretical limits:
>>>>>> http://bit.ly/3QG2uj
>>>>>>
>>>>>> Using RandomStringUtils from commons-lang, I wrote a short  
>>>>>> method to
>>>>>> iterate over a million alphanumeric combinations using letters  
>>>>>> and
>>>>>> numbers of length 6 chars and didn't run into a single collision.
>>>>>> Using this method and a transactional check when adding into the
>>>>>> database is probably the way to go.
>>>>>>
>>>>>> We can also filter for bad words, BBC has an interesting list  
>>>>>> (thanks
>>>>>> Seth! :)
>>>>>>
>>>>>> WDYT?
>>>>>>
>>>>>> Steve
>>>>>>
>>>>>>
>>>>>> On 8 Jul 2009, at 12:10, Aaron Zeckoski wrote:
>>>>>>
>>>>>>> Is there a reason why the id generation could be more like  
>>>>>>> tinyURL
>>>>>>> and
>>>>>>> simply be sequential and thus very short (at least initially and
>>>>>>> probably for most institutions it would stay small).
>>>>>>>
>>>>>>> Just a thought.
>>>>>>> -AZ
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jul 8, 2009 at 5:00 PM, Matthew Jones<jonespm at umich.edu>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> You could consider using 22 character uuids as autogenerated  
>>>>>>>> tinies
>>>>>>>> in
>>>>>>>> the url. It might not be able to be called tinyurl anymore,  
>>>>>>>> but it
>>>>>>>> would at least be short, linkable and (much more) collision  
>>>>>>>> free.
>>>>>>>>
>>>>>>>>
>>>>>>>> http://stackoverflow.com/questions/772802/storing-uuid-as-base64-string
>>>>>>>>
>>>>>>>> I'm not convinced it would be necessary to keep these id's from
>>>>>>>> being
>>>>>>>> unpredictable. The current security model should prevent anyone
>>>>>>>> from
>>>>>>>> accessing content they don't have permission to even guessing
>>>>>>>> id's. I
>>>>>>>> think eventually when a UI was designed for this it might  
>>>>>>>> even want
>>>>>>>> to
>>>>>>>> allow the user to specify a custom ID, similar to how the Site
>>>>>>>> Aliases
>>>>>>>> (on new site creation) works, allowing a user to specify  
>>>>>>>> anything
>>>>>>>> they
>>>>>>>> want for their site alias. (Like tinyurl.com) So this would  
>>>>>>>> almost
>>>>>>>> guarantee collisions.
>>>>>>>>
>>>>>>>> So I wouldn't think that the possibility of a collision on an  
>>>>>>>> low
>>>>>>>> use
>>>>>>>> feature for an indexed column (when collisions will probably be
>>>>>>>> featured in eventually anyway) likely won't be a big deal.
>>>>>>>>
>>>>>>>> -Matthew
>>>>>>>>
>>>>>>>> On Tue, Jul 7, 2009 at 7:08 PM, <antranig at caret.cam.ac.uk>  
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Please take a look at the comments at the head of the
>>>>>>>>> "EighteenIDGenerator"
>>>>>>>>> file for some details of the maths involved in computing how  
>>>>>>>>> many
>>>>>>>>> documents will arise before a collision:
>>>>>>>>>
>>>>>>
>>>>>
>>>>> https://source.caret.cam.ac.uk/rsf/projects/PonderUtilCore/trunk/src/uk/org/ponder/hashutil/EighteenIDGenerator.java
>>>>>>>>>
>>>>>>>>> I wouldn't really recommend using a key space smaller than  
>>>>>>>>> this
>>>>>>>>> one
>>>>>>>>> if you want to avoid collisions - although of course you will
>>>>>>>>> reprobe
>>>>>>>>> to be sure. Another issue with LCGs is that they are highly
>>>>>>>>> predictable -
>>>>>>>>> this is why I use a SecureRandom here: One of the  
>>>>>>>>> requirements of
>>>>>>>>> the
>>>>>>>>> TinyURLService that I made for the work at CARET was that  
>>>>>>>>> the ID
>>>>>>>>> sequence could not be predicted - to prevent the possibility  
>>>>>>>>> of
>>>>>>>>> malicious
>>>>>>>>> "hijacking" of upcoming URL slots (this was actually a system
>>>>>>>>> designed
>>>>>>>>> to allocate small URLs that were plaintext signup URLs  
>>>>>>>>> routed to a
>>>>>>>>> PHP
>>>>>>>>> system, perhaps more risk than is present in a typical
>>>>>>>>> application!)
>>>>>>>>>
>>>>>>>>> Note that a further issue you have with an LCG is running it  
>>>>>>>>> in a
>>>>>>>>> cluster -
>>>>>>>>> even if you can lock on the particular instance you have, you
>>>>>>>>> can't
>>>>>>>>> guarantee that some other instance in the cluster has not
>>>>>>>>> stepped on
>>>>>>>>> past the slot you want. I think that the approach in my  
>>>>>>>>> service
>>>>>>>>> (a SecureRandom plus a transactional check into persistence)  
>>>>>>>>> is
>>>>>>>>> most
>>>>>>>>> likely to work reasonably well in a cluster - although it  
>>>>>>>>> hasn't
>>>>>>>>> been
>>>>>>>>> tested there :P
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Antranig.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Quoting Steve Swinsburg <s.swinsburg at lancaster.ac.uk>:
>>>>>>>>>
>>>>>>>>>> Carl,
>>>>>>>>>>
>>>>>>>>>> Sounds good, is there an existing LCG Java implementation  
>>>>>>>>>> we can
>>>>>>>>>> leverage or should we roll our own?
>>>>>>>>>>
>>>>>>>>>> cheers,
>>>>>>>>>> Steve
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 6 Jul 2009, at 12:34, Carl Hall wrote:
>>>>>>>>>>
>>>>>>>>>>> Summary: There is an ever increasing probability that the
>>>>>>>>>>> collision of
>>>>>>>>>>> IDs in the TinyURL service will slow down the service
>>>>>>>>>>> considerably.
>>>>>>>>>>>
>>>>>>>>>>> In looking at the implementation of the TinyURL service,  
>>>>>>>>>>> I've
>>>>>>>>>>> noticed
>>>>>>>>>>> something about the way URLs are generated.  If I'm headed  
>>>>>>>>>>> in
>>>>>>>>>>> the
>>>>>>>>>>> wrong direction with this, please correct me.
>>>>>>>>>>>
>>>>>>>>>>> The very general approach is:
>>>>>>>>>>> 1) generate a random ID
>>>>>>>>>>> 2) check that the ID doesn't already exist
>>>>>>>>>>> 3) if exists, go to 1.  if not, record the ID and release  
>>>>>>>>>>> it to
>>>>>>>>>>> the
>>>>>>>>>>> caller.
>>>>>>>>>>>
>>>>>>>>>>> The following code for generating a URL:
>>>>>>>>>>>
>>>>>>>>>>> Random generator = new Random();
>>>>>>>>>>> int seed = Math.abs(generator.nextInt());
>>>>>>>>>>>
>>>>>>>>>>> has an issue with collision which is addressed by checking  
>>>>>>>>>>> for
>>>>>>>>>>> the
>>>>>>>>>>> existence of that ID before handing it out but the  
>>>>>>>>>>> algorithm run
>>>>>>>>>>> time
>>>>>>>>>>> will continue to grow because of the growing chance of  
>>>>>>>>>>> collision
>>>>>>>>>>> over
>>>>>>>>>>> time.
>>>>>>>>>>>
>>>>>>>>>>> Explanation of concern:
>>>>>>>>>>> The first generated ID will have a 0/n chance of collision  
>>>>>>>>>>> where
>>>>>>>>>>> n is
>>>>>>>>>>> the total number of possible IDs.  The next ID has a 1/n  
>>>>>>>>>>> chance,
>>>>>>>>>>> the
>>>>>>>>>>> next 2/n, etc up to the last having a chance of collision  
>>>>>>>>>>> being
>>>>>>>>>>> n-1/n.
>>>>>>>>>>> Considering n needs to be sufficiently large to have a low  
>>>>>>>>>>> to
>>>>>>>>>>> zero
>>>>>>>>>>> rate of duplication over time, this check will take more  
>>>>>>>>>>> time
>>>>>>>>>>> because
>>>>>>>>>>> of the number of collisions over time.  Assuming that the ID
>>>>>>>>>>> column in
>>>>>>>>>>> the database is indexed, the lookup time is minimized but
>>>>>>>>>>> there's
>>>>>>>>>>> still a call to the database lookup on each generated ID.   
>>>>>>>>>>> As
>>>>>>>>>>> the
>>>>>>>>>>> number of IDs generated reaches the max possible, there are
>>>>>>>>>>> possibly
>>>>>>>>>>> n-1 lookups before a valid ID is found.
>>>>>>>>>>>
>>>>>>>>>>> The current code recreates the Random each time which stops
>>>>>>>>>>> nextInt()
>>>>>>>>>>> from knowing history (ie. new random seed on each  
>>>>>>>>>>> generation).
>>>>>>>>>>> Java
>>>>>>>>>>> implements a linear congruential generator to generate  
>>>>>>>>>>> random
>>>>>>>>>>> numbers
>>>>>>>>>>> which requires knowledge of the previously generated ID to  
>>>>>>>>>>> be
>>>>>>>>>>> effective.  Math.random() creates the random seed once and
>>>>>>>>>>> reuses it
>>>>>>>>>>> for the life of the JVM process which is a better approach.
>>>>>>>>>>>
>>>>>>>>>>> I would suggest creating an LCG in the TinyURL service so  
>>>>>>>>>>> that
>>>>>>>>>>> the
>>>>>>>>>>> proper range can be easily managed (better performance due  
>>>>>>>>>>> to a
>>>>>>>>>>> narrower scope than Math.random()) and the seed can be  
>>>>>>>>>>> restored
>>>>>>>>>>> after
>>>>>>>>>>> a restart.  I have created a prototype in Python based on  
>>>>>>>>>>> the
>>>>>>>>>>> rules
>>>>>>>>>>> laid out in the Wikipedia article on LCG's [1] and using a
>>>>>>>>>>> sufficiently large prime number.  This still carries a  
>>>>>>>>>>> concern
>>>>>>>>>>> of
>>>>>>>>>>> generating IDs across a cluster but is relatively easy to
>>>>>>>>>>> address.
>>>>>>>>>>>
>>>>>>>>>>> WDYT?
>>>>>>>>>>>
>>>>>>>>>>> [1] http://en.wikipedia.org/wiki/Linear_congruential_generator
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> sakai-dev mailing list
>>>>>>>>>>> sakai-dev at collab.sakaiproject.org
>>>>>>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>>>>>>>>
>>>>>>>>>>> TO UNSUBSCRIBE: send email to
>>>>>>
>>>>>> sakai-dev-unsubscribe at collab.sakaiproject.org
>>>>>>>>>>>
>>>>>>>>>>> with a subject of "unsubscribe"
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> sakai-dev mailing list
>>>>>>>>>> sakai-dev at collab.sakaiproject.org
>>>>>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>>>>>>>
>>>>>>>>>> TO UNSUBSCRIBE: send email to
>>>>>>
>>>>>> sakai-dev-unsubscribe at collab.sakaiproject.org
>>>>>>>>>>
>>>>>>>>>> with a subject of "unsubscribe"
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ----------------------------------------------------------------
>>>>>>>>> This message was sent using IMP, the Internet Messaging  
>>>>>>>>> Program.
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> sakai-dev mailing list
>>>>>>>>> sakai-dev at collab.sakaiproject.org
>>>>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>>>>>>
>>>>>>>>> TO UNSUBSCRIBE: send email to
>>>>>>
>>>>>> sakai-dev-unsubscribe at collab.sakaiproject.org
>>>>>>>>>
>>>>>>>>> with a subject of "unsubscribe"
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> sakai-dev mailing list
>>>>>>>> sakai-dev at collab.sakaiproject.org
>>>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>>>>>
>>>>>>>> TO UNSUBSCRIBE: send email to
>>>>>>
>>>>>> sakai-dev-unsubscribe at collab.sakaiproject.org
>>>>>>>>
>>>>>>>> with a subject of "unsubscribe"
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Aaron Zeckoski (azeckoski (at) vt.edu)
>>>>>>> Senior Research Engineer - CARET - University of Cambridge
>>>>>>> https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski
>>>>>>> http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile
>>>>>>> _______________________________________________
>>>>>>> sakai-dev mailing list
>>>>>>> sakai-dev at collab.sakaiproject.org
>>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>>>>
>>>>>>> TO UNSUBSCRIBE: send email to
>>>>>>> sakai-dev-unsubscribe at collab.sakaiproject.org
>>>>>>> with a subject of "unsubscribe"
>>>>>>
>>>>>> _______________________________________________
>>>>>> sakai-dev mailing list
>>>>>> sakai-dev at collab.sakaiproject.org
>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>>>
>>>>>> TO UNSUBSCRIBE: send email to
>>>>>> sakai-dev-unsubscribe at collab.sakaiproject.org
>>>>>> with a subject of "unsubscribe"
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ----------------------------------------------------------------
>>>>> This message was sent using IMP, the Internet Messaging Program.
>>>>>
>>>>
>>>> _______________________________________________
>>>> sakai-dev mailing list
>>>> sakai-dev at collab.sakaiproject.org
>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>
>>>> TO UNSUBSCRIBE: send email to
>>>> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>>>> "unsubscribe"
>>>>
>>>
>>>
>>>
>>> --
>>> Jon Gorrono
>>> PGP Key: 5434509D
>>>
>>> email{>+++++++++[>+++++++++++>++++++++++++>+++++++>+++++<<<<-]>++++ 
>>> +++.>++++.<---.>-.+++..---.-.+.>+.<++++++.<----.+.---.>+.<+++++++ 
>>> +.>---.>>+.<<<----.-.>++.}
>>> http{ats.ucdavis.edu}
>>> _______________________________________________
>>> sakai-dev mailing list
>>> sakai-dev at collab.sakaiproject.org
>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>
>>> TO UNSUBSCRIBE: send email to
>>> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>>> "unsubscribe"
>>
>>
>
>
>
> -- 
> Jon Gorrono
> PGP Key: 5434509D
> email{>+++++++++[>+++++++++++>++++++++++++>+++++++>+++++<<<<-]>++++++ 
> +.>++++.<---.>-.+++..---.-.+.>+.<++++++.<----.+.---.>+.<+++++++ 
> +.>---.>>+.<<<----.-.>++.}
> http{ats.ucdavis.edu}
>
> Sent from Davis, CA, United States
> _______________________________________________
> sakai-dev mailing list
> sakai-dev at collab.sakaiproject.org
> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>
> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org 
>  with a subject of "unsubscribe"



More information about the sakai-dev mailing list