[Building Sakai] TinyURL Service Impl Issue

Jon Gorrono jpgorrono at ucdavis.edu
Tue Jul 14 20:27:31 PDT 2009


I have been a little curious why timeboxing is not used to decrease
collisions for this service.... we use the service in a custom guest
access app we made to put registration urls etc in emails....
obviously we want to timebox access to the urls but a ttl can be
xtremely effective in reducing an otherwise skewed data set to
something resembling white noise... as it give a small window view of
a larger distribution

Maybe an api call could change the ttl and return the probability of
collision during that time or some other silliness



On Wed, Jul 8, 2009 at 2:52 PM, Steve
Swinsburg<s.swinsburg at lancaster.ac.uk> wrote:
> Ok I've added an alternative method to return a longer random id which
> makes it less guessable. The number of permutations with this combo
> has 30 zeroes (on a-z and 0-9).
>
> Have some fun with the calculator! http://bit.ly/gJP1X
>
> cheers,
> Steve
>
>
>
> On 8 Jul 2009, at 17:25, antranig at caret.cam.ac.uk wrote:
>
>> The issue with collisions occurs when you are hosting URLs which are
>> not protected by the container, but held by some external resource -
>> for example, a signup system "click here to validate your email to
>> Xxxxx service". The implementation I made actually runs in two modes,
>> you may request either "SECURE" or "SHORT" URLs by an option to the
>> API. The short URLs start at only a couple of characters and
>> are indeed sequential.
>> Anyway, feel free to implement your own system, it is the Sakai way
>> after all :)
>>
>> Cheers,
>> Antranig.
>>
>> Quoting Steve Swinsburg <s.swinsburg at lancaster.ac.uk>:
>>
>>> The security is handled in the container anyway so you get a
>>> shortened
>>> URL and it resolves into the full URL, redirects and Sakai takes
>>> over.
>>> Even if it was sequential it wouldn't be a security concern.
>>>
>>> With a keyspace of lowercase letters and numbers(36 chars) there are
>>> about 2 billion combinations, with uppercase as well there's 50
>>> billion. Info here on the theoretical limits:
>>> http://bit.ly/3QG2uj
>>>
>>> Using RandomStringUtils from commons-lang, I wrote a short method to
>>> iterate over a million alphanumeric combinations using letters and
>>> numbers of length 6 chars and didn't run into a single collision.
>>> Using this method and a transactional check when adding into the
>>> database is probably the way to go.
>>>
>>> We can also filter for bad words, BBC has an interesting list (thanks
>>> Seth! :)
>>>
>>> WDYT?
>>>
>>> Steve
>>>
>>>
>>> On 8 Jul 2009, at 12:10, Aaron Zeckoski wrote:
>>>
>>>> Is there a reason why the id generation could be more like tinyURL
>>>> and
>>>> simply be sequential and thus very short (at least initially and
>>>> probably for most institutions it would stay small).
>>>>
>>>> Just a thought.
>>>> -AZ
>>>>
>>>>
>>>> On Wed, Jul 8, 2009 at 5:00 PM, Matthew Jones<jonespm at umich.edu>
>>>> wrote:
>>>>> You could consider using 22 character uuids as autogenerated tinies
>>>>> in
>>>>> the url. It might not be able to be called tinyurl anymore, but it
>>>>> would at least be short, linkable and (much more) collision free.
>>>>>
>>>>> http://stackoverflow.com/questions/772802/storing-uuid-as-base64-string
>>>>>
>>>>> I'm not convinced it would be necessary to keep these id's from
>>>>> being
>>>>> unpredictable. The current security model should prevent anyone
>>>>> from
>>>>> accessing content they don't have permission to even guessing
>>>>> id's. I
>>>>> think eventually when a UI was designed for this it might even want
>>>>> to
>>>>> allow the user to specify a custom ID, similar to how the Site
>>>>> Aliases
>>>>> (on new site creation) works, allowing a user to specify anything
>>>>> they
>>>>> want for their site alias. (Like tinyurl.com) So this would almost
>>>>> guarantee collisions.
>>>>>
>>>>> So I wouldn't think that the possibility of a collision on an low
>>>>> use
>>>>> feature for an indexed column (when collisions will probably be
>>>>> featured in eventually anyway) likely won't be a big deal.
>>>>>
>>>>> -Matthew
>>>>>
>>>>> On Tue, Jul 7, 2009 at 7:08 PM, <antranig at caret.cam.ac.uk> wrote:
>>>>>> Please take a look at the comments at the head of the
>>>>>> "EighteenIDGenerator"
>>>>>> file for some details of the maths involved in computing how many
>>>>>> documents will arise before a collision:
>>>>>>
>>>
>> https://source.caret.cam.ac.uk/rsf/projects/PonderUtilCore/trunk/src/uk/org/ponder/hashutil/EighteenIDGenerator.java
>>>>>>
>>>>>> I wouldn't really recommend using a key space smaller than this
>>>>>> one
>>>>>> if you want to avoid collisions - although of course you will
>>>>>> reprobe
>>>>>> to be sure. Another issue with LCGs is that they are highly
>>>>>> predictable -
>>>>>> this is why I use a SecureRandom here: One of the requirements of
>>>>>> the
>>>>>> TinyURLService that I made for the work at CARET was that the ID
>>>>>> sequence could not be predicted - to prevent the possibility of
>>>>>> malicious
>>>>>> "hijacking" of upcoming URL slots (this was actually a system
>>>>>> designed
>>>>>> to allocate small URLs that were plaintext signup URLs routed to a
>>>>>> PHP
>>>>>> system, perhaps more risk than is present in a typical
>>>>>> application!)
>>>>>>
>>>>>> Note that a further issue you have with an LCG is running it in a
>>>>>> cluster -
>>>>>> even if you can lock on the particular instance you have, you
>>>>>> can't
>>>>>> guarantee that some other instance in the cluster has not
>>>>>> stepped on
>>>>>> past the slot you want. I think that the approach in my service
>>>>>> (a SecureRandom plus a transactional check into persistence) is
>>>>>> most
>>>>>> likely to work reasonably well in a cluster - although it hasn't
>>>>>> been
>>>>>> tested there :P
>>>>>>
>>>>>> Cheers,
>>>>>> Antranig.
>>>>>>
>>>>>>
>>>>>> Quoting Steve Swinsburg <s.swinsburg at lancaster.ac.uk>:
>>>>>>
>>>>>>> Carl,
>>>>>>>
>>>>>>> Sounds good, is there an existing LCG Java implementation we can
>>>>>>> leverage or should we roll our own?
>>>>>>>
>>>>>>> cheers,
>>>>>>> Steve
>>>>>>>
>>>>>>>
>>>>>>> On 6 Jul 2009, at 12:34, Carl Hall wrote:
>>>>>>>
>>>>>>>> Summary: There is an ever increasing probability that the
>>>>>>>> collision of
>>>>>>>> IDs in the TinyURL service will slow down the service
>>>>>>>> considerably.
>>>>>>>>
>>>>>>>> In looking at the implementation of the TinyURL service, I've
>>>>>>>> noticed
>>>>>>>> something about the way URLs are generated.  If I'm headed in
>>>>>>>> the
>>>>>>>> wrong direction with this, please correct me.
>>>>>>>>
>>>>>>>> The very general approach is:
>>>>>>>> 1) generate a random ID
>>>>>>>> 2) check that the ID doesn't already exist
>>>>>>>> 3) if exists, go to 1.  if not, record the ID and release it to
>>>>>>>> the
>>>>>>>> caller.
>>>>>>>>
>>>>>>>> The following code for generating a URL:
>>>>>>>>
>>>>>>>> Random generator = new Random();
>>>>>>>> int seed = Math.abs(generator.nextInt());
>>>>>>>>
>>>>>>>> has an issue with collision which is addressed by checking for
>>>>>>>> the
>>>>>>>> existence of that ID before handing it out but the algorithm run
>>>>>>>> time
>>>>>>>> will continue to grow because of the growing chance of collision
>>>>>>>> over
>>>>>>>> time.
>>>>>>>>
>>>>>>>> Explanation of concern:
>>>>>>>> The first generated ID will have a 0/n chance of collision where
>>>>>>>> n is
>>>>>>>> the total number of possible IDs.  The next ID has a 1/n chance,
>>>>>>>> the
>>>>>>>> next 2/n, etc up to the last having a chance of collision being
>>>>>>>> n-1/n.
>>>>>>>> Considering n needs to be sufficiently large to have a low to
>>>>>>>> zero
>>>>>>>> rate of duplication over time, this check will take more time
>>>>>>>> because
>>>>>>>> of the number of collisions over time.  Assuming that the ID
>>>>>>>> column in
>>>>>>>> the database is indexed, the lookup time is minimized but
>>>>>>>> there's
>>>>>>>> still a call to the database lookup on each generated ID.  As
>>>>>>>> the
>>>>>>>> number of IDs generated reaches the max possible, there are
>>>>>>>> possibly
>>>>>>>> n-1 lookups before a valid ID is found.
>>>>>>>>
>>>>>>>> The current code recreates the Random each time which stops
>>>>>>>> nextInt()
>>>>>>>> from knowing history (ie. new random seed on each generation).
>>>>>>>> Java
>>>>>>>> implements a linear congruential generator to generate random
>>>>>>>> numbers
>>>>>>>> which requires knowledge of the previously generated ID to be
>>>>>>>> effective.  Math.random() creates the random seed once and
>>>>>>>> reuses it
>>>>>>>> for the life of the JVM process which is a better approach.
>>>>>>>>
>>>>>>>> I would suggest creating an LCG in the TinyURL service so that
>>>>>>>> the
>>>>>>>> proper range can be easily managed (better performance due to a
>>>>>>>> narrower scope than Math.random()) and the seed can be restored
>>>>>>>> after
>>>>>>>> a restart.  I have created a prototype in Python based on the
>>>>>>>> rules
>>>>>>>> laid out in the Wikipedia article on LCG's [1] and using a
>>>>>>>> sufficiently large prime number.  This still carries a concern
>>>>>>>> of
>>>>>>>> generating IDs across a cluster but is relatively easy to
>>>>>>>> address.
>>>>>>>>
>>>>>>>> WDYT?
>>>>>>>>
>>>>>>>> [1] http://en.wikipedia.org/wiki/Linear_congruential_generator
>>>>>>>> _______________________________________________
>>>>>>>> sakai-dev mailing list
>>>>>>>> sakai-dev at collab.sakaiproject.org
>>>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>>>>>
>>>>>>>> TO UNSUBSCRIBE: send email to
>>> sakai-dev-unsubscribe at collab.sakaiproject.org
>>>>>>>> with a subject of "unsubscribe"
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> sakai-dev mailing list
>>>>>>> sakai-dev at collab.sakaiproject.org
>>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>>>>
>>>>>>> TO UNSUBSCRIBE: send email to
>>> sakai-dev-unsubscribe at collab.sakaiproject.org
>>>>>>> with a subject of "unsubscribe"
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ----------------------------------------------------------------
>>>>>> This message was sent using IMP, the Internet Messaging Program.
>>>>>>
>>>>>> _______________________________________________
>>>>>> sakai-dev mailing list
>>>>>> sakai-dev at collab.sakaiproject.org
>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>>>
>>>>>> TO UNSUBSCRIBE: send email to
>>> sakai-dev-unsubscribe at collab.sakaiproject.org
>>>>>> with a subject of "unsubscribe"
>>>>>>
>>>>> _______________________________________________
>>>>> sakai-dev mailing list
>>>>> sakai-dev at collab.sakaiproject.org
>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>>
>>>>> TO UNSUBSCRIBE: send email to
>>> sakai-dev-unsubscribe at collab.sakaiproject.org
>>>>> with a subject of "unsubscribe"
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Aaron Zeckoski (azeckoski (at) vt.edu)
>>>> Senior Research Engineer - CARET - University of Cambridge
>>>> https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski
>>>> http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile
>>>> _______________________________________________
>>>> sakai-dev mailing list
>>>> sakai-dev at collab.sakaiproject.org
>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>>
>>>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org
>>>> with a subject of "unsubscribe"
>>>
>>> _______________________________________________
>>> sakai-dev mailing list
>>> sakai-dev at collab.sakaiproject.org
>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>>
>>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org
>>> with a subject of "unsubscribe"
>>>
>>
>>
>>
>>
>> ----------------------------------------------------------------
>> This message was sent using IMP, the Internet Messaging Program.
>>
>
> _______________________________________________
> sakai-dev mailing list
> sakai-dev at collab.sakaiproject.org
> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>
> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of "unsubscribe"
>



-- 
Jon Gorrono
PGP Key: 5434509D
email{>+++++++++[>+++++++++++>++++++++++++>+++++++>+++++<<<<-]>+++++++.>++++.<---.>-.+++..---.-.+.>+.<++++++.<----.+.---.>+.<++++++++.>---.>>+.<<<----.-.>++.}
http{ats.ucdavis.edu}


More information about the sakai-dev mailing list