Entries Tagged 'first principles' ↓
December 16th, 2007 — first principles
Someone at work asked me for help the other day with a problem that boiled down to: “how do I efficiently sort 29,000 lines of pipe-delimited text based on the value in the 6th column?”
29,000 lines sounds pretty specific to me, and I’m inherently suspicious of specifications that are so precise about the scale of the problem. In my experience there are very few static data sets in the world, and a problem that involves 29,000 lines of data today is likely to scale up to involve many more lines of data within the lifetime of the code you write to solve the problem.
There are a few obvious exceptions: national populations, for example, will generally double over a period of time equivalent to many software iterations (although if you’re writing software that concerns itself with national populations you’re most likely on contract to the government and in that case you’ve got a whole different set of challenges). But generally, data sets will grow and it pays to expect them to grow beyond expectations. Hence Whitaker’s First Rule of Scalability:
There are very few real-world data sets that won’t eventually grow to double their current size.
And here’s the kicker: Whitaker’s First Rule is immutable, and hence applies equally once the data set has doubled in size.
Continue reading →
November 18th, 2007 — algorithms, first principles, perl
One for the mathmos: an interesting article on hash functions by Bob Jenkins, formerly of Oracle, now at Microsoft. Jenkins looks at various hash functions in use at the time (1996, updated 2005), then proposes his own, along with a confident promise:
I offer you a new hash function for hash table lookup that is faster and more thorough than the one you are using now. I also give you a way to verify that it is more thorough.
Wikipedia’s interpretation of the function shows it to be the same function as the one Perl 5.8.8 uses for its built-in hashes. (Search for #define PERL_HASH).
Browsing Bob Jenkins’ site turns up many other gems, including this:
Little League did not produce a great baseball player, but Bob did acquire a knack for spotting four-leaf clovers.
An attitude to sport which is achingly familiar to me. My advice to Microsoft, Oracle et al: if you’re looking for the nerds of tomorrow, you could do a lot worse than start your search on the perimeters of the world’s school sports fields.
August 24th, 2007 — coding, first principles
At work we have a standard coding test that we give all of our prospective software developers to complete during their interview day. The coding test revolves around a simple card game, and one of the things candidates have to do is shuffle a deck of cards.
Shuffling a deck of cards, or more generally randomising the order of an array, is a pretty simple task and almost every candidate does it the same way. Here’s my (very simple) version, in Perl:
sub shuffle {
my @array = @_;
my $length = scalar(@array);
for (my $i = 0; $i < $length; $i++) {
my $r = int(rand() * $length);
@array[$i, $r] = @array[$r, $i];
}
return @array;
}
That's pretty much the algorithm that every candidate writes, including me when I sat the test. We call it the Knuth shuffle, because it was described by Donald Knuth in his book The Art of Computer Programming Vol 2: Seminumerical Algorithms.
So, everyone writes it like that. And everyone gets it wrong.
The first clue that the above implementation is flawed is in the numbers. Generating a random number between 1 and N and repeating that N times gives NN possible execution paths, but there are only N! ways you can order the numbers 1 to N. This is a great example of where intuition can be confusing: it feels right that "every card gets the chance to be swapped with every other card", but the numbers don't back up the gut feeling.
I prefer to think of it another way; instead of thinking of a magician shuffling a deck of cards, think of lottery balls coming out of a lottery machine. You start with N balls. One is selected at random (N possible outcomes), then another (N-1), then another (N-2), and so on. If you continue until all the balls are selected you have this many possible outcomes:
N * (N-1) * (N-2) * ... * 1
Or to put it another way:
1 * 2 * ... * N = N!
Hurrah, N! outcomes!
In other words, for each element in the array, choose a random number between that element's index and the last index in the deck, and swap the cards at those two indices. An easier way to do this in code is to iterate backwards from the last index to zero and generate a random number between zero and the current index, then swap those two indices:
sub shuffle {
my @array = @_;
my $length = scalar(@array);
for (my $i = $length - 1; $i > 0; $i--) {
my $r = int(rand() * ($i + 1));
@array[$i, $r] = @array[$r, $i];
}
return @array;
}
And that's the correct Knuth shuffle, exactly as he describes it himself:
Algorithm P (Shuffling). Let X1, X2, ..., Xt be a set of t numbers to be shuffed.
P1. [Initialize.] Set j ← t.
P2. [Generate U.] Generate a random number U, uniformly distributed between zero and one.
P3. [Exchange.] Set k ← [jU] + 1. (Now k is a random integer, between 1 and j. Exchange Xk and Xj.
P4. [Decrease j.] Decrease j by 1. If j > 1, return to step P2.
The Art of Computer Programming, Volume 2: Seminumerical Algorithms, p145. Third Edition. DE Knuth.
August 22nd, 2007 — cryptography, first principles
I think it’s important in any line of work, and especially in software development, to know your limitations. One of my limitations is security in software; cryptography, stuff like that. I only know enough about security in software to know that I don’t know enough. Hence, any attempt I made to develop secure software would almost certainly not produce a robust solution, and a non-robust security solution is worse than no solution at all.
Here’s how Phil “PGP” Zimmerman describes the perils of bad cryptography products:
This is like selling automotive seat belts that look good and feel good, but snap open in the slowest crash test. Depending on them may be worse than not wearing seat belts at all. No one suspects they are bad until a real crash. Depending on weak cryptographic software may cause you to unknowingly place sensitive information at risk when you might not otherwise have done so if you had no cryptographic software at all. Perhaps you may never even discover that your data has been compromised.
From http://www.pgpi.org/doc/guide/6.5/en/intro/