Wednesday, February 17, 2010

"Microcomputers: the unredeemed lollipop..."

So let us formalize the problem of finding a pair of strings (not necessarily distinct) from a dictionary of size N, such that the Java string hash code of their concatenation is Integer.MIN_VALUE or 0.

How quickly can this be done?

A quote from Joshua Bloch himself on how he found "polygenelubricants" (at 1:07:00 in the video):
I have a dictionary of 200,000 words, which means there are 40 billion word pairs, [...] and it took it 10 minutes to calculate all the hash values of every word pair
The quote suggests at least an O(N^2) solution, assuming that the hash code calculation is O(1) for each word pair (which can in fact be done).

I think you can do MUCH better than that, however. I'm still working on this as we speak, but here's a Maple snippet to shed some light:
> # "lubricants".hashCode() = -573982625
> # "lubricants".length() = 10
> msolve(x * 31^(10) + (-573982625) = 0, 2^31);
                                {x = 561786081}
> # 561786081 = "polygene".hashCode()
Thus, the problem is solvable in O(N) times the time it takes to do msolve.

My goal is to demonstrate the superiority of this technique by coming up with word triplets. If O(N^2) takes 10 minutes, then O(N^3) would take almost 4 years.

Specifically, we need to be able to solve this equation:
x = -H/31^L (mod 2^31)
The division translates to finding the modular multiplicative inverse. Note, however, that we only need to do this for very few choice values of L, so we can simply precompute a lookup table in Maple. Thus, x can be found in O(1).
It is done...
Microcomputers: the unredeemed lollipop...
Incentively, my dear, I don't tessellate a derangement.
A person who never made a mistake, never bombed resurgent spiders.
And so my fellow mismanagements: ask not what your newsdealer can sugarcoat for you -- ask what you can sugarcoat for your newsdealer.
By the way, bonus points if you can identify the original quotes.
Update 2/20:

I challenged myself to solve more complex patterns, e.g. ABCABC, where it's not just suffix -> prefix matchings.
Foliates recycle; putrid foliates recycle putridly.
Allocator redistricts; strict allocator redistricts strictly.
Familiarness proffers; freakish familiarness proffers freakishly.
Entrails transmigrate; swinging entrails transmigrate swingingly.
A couple of things that make this problem a bit harder:
  • The multiplicative inverse is no longer uniquely defined. Per the linear congruence theorem, an equation may have no solution at all, or it may have many (bounded by the modulus).
  • The equation to solve is dependent on the length of the infix to inject. Fortunately, in any practical dictionary, length of longest word is O(1).

No comments:

Post a Comment