Whitaker’s First Rule of Scalability

Someone at work asked me for help the other day with a problem that boiled down to: “how do I efficiently sort 29,000 lines of pipe-delimited text based on the value in the 6th column?”

29,000 lines sounds pretty specific to me, and I’m inherently suspicious of specifications that are so precise about the scale of the problem. In my experience there are very few static data sets in the world, and a problem that involves 29,000 lines of data today is likely to scale up to involve many more lines of data within the lifetime of the code you write to solve the problem.

There are a few obvious exceptions: national populations, for example, will generally double over a period of time equivalent to many software iterations (although if you’re writing software that concerns itself with national populations you’re most likely on contract to the government and in that case you’ve got a whole different set of challenges). But generally, data sets will grow and it pays to expect them to grow beyond expectations. Hence Whitaker’s First Rule of Scalability:

There are very few real-world data sets that won’t eventually grow to double their current size.

And here’s the kicker: Whitaker’s First Rule is immutable, and hence applies equally once the data set has doubled in size.

By the way, in case you’re wondering, I found that GNU sort could tackle 29,000 lines of data in just over a second on my pretty average Linux desktop:

sort -t '|' -k 6,6 -n filename.txt

1 comment so far ↓

#1 Mark on 12.16.07 at 6:47 pm

Actually, if you’re on contract to the government then the population will have more than doubled in size before you’ve even delivered a stable beta. Whitaker’s First Rule stands uncontested! :-)