December 30th, 2007 — os x, software, unix
I’ve been following the news on Daring Fireball about Omniture’s sneaky use of the 192.168.112.2o7.net domain name in phone-home functionality used by iTunes, Adobe CS3 apps and (presumably) others.
Mitcho.com notes that you can opt out of 2o7.net tracking, but only by setting a browser cookie, which won’t have any effect within apps such as iTunes and Photoshop CS3. That site suggests a third-party solution for preventing apps from connecting to Omniture’s servers. But there is a simpler (and free!) way, at least on OS X.
Like most Unix and Unix-like OSs, Mac OS X has a file that maps domain names to IP addresses, /etc/hosts. The OS looks in this file first when it needs to translate a domain name, before hitting your DNS server – so you can consider it a very crude highest-priority local DNS server. /etc/hosts is dead handy for assigning simple names to machines on your local network with static IP addresses. Mine currently looks like this:
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting. Do not change this entry.
##
127.0.0.1 localhost
255.255.255.255 broadcasthost
::1 localhost
fe80::1%lo0 localhost
10.0.0.11 imac
10.0.0.5 penguin
10.0.0.1 router
The format is simple:
<ip address> <domain_name> [<domain name> ...]
The essence of this trick is: we’re going to add a new domain name (192.168.112.2o7.net) that points to the localhost entry.
How to edit the hosts file is outside the scope of this article, but typically you’ll need to open a Terminal window and type:
$ sudo editor /etc/hosts
where editor is your text editor of choice. If you’ve never used a command-line text editor before and want something easy, use pico (sudo pico /etc/hosts). If you have TextMate and the mate shell command installed you can use that (sudo mate /etc/hosts). You should probably make a backup of /etc/hosts before you edit it, just in case – you can do that with:
$ sudo cp /etc/hosts /etc/hosts.backup
So I’ve changed my /etc/hosts file to resolve the domain name 192.168.112.2o7.net to 127.0.0.1 – the localhost “loopback” address, which always points to the local machine:
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting. Do not change this entry.
##
127.0.0.1 localhost 192.168.112.2o7.net
255.255.255.255 broadcasthost
::1 localhost
fe80::1%lo0 localhost
10.0.0.11 imac
10.0.0.5 penguin
10.0.0.1 router
Now when an app tries to phone home to 192.168.112.2o7.net, it’ll be connecting to the local host instead of Omniture’s server. This will typically result in an error response, which the app should silently ignore.
To confirm that the redirection works as expected, I saved my changes to /etc/hosts, opened a new Terminal window and used traceroute:
$ traceroute 192.168.112.2o7.net
traceroute to localhost (127.0.0.1), 64 hops max, 40 byte packets
1 localhost (127.0.0.1) 0.387 ms 0.039 ms 0.032 ms
Problem solved!
Continue reading →
December 23rd, 2007 — darwin, os x
A while ago I noticed that my home directory on my Macbook Pro wasn’t visible in the Finder. That is, if I opened a Finder window and browsed to /Users, there was no simon subdirectory, which was a bit odd. I could still browse to the stuff inside my home directory (Documents, Music, etc) using the shortcuts in the Finder’s left-hand panel, but my home directory itself was invisible.
Turning to the terminal, the first thing I discovered was an at symbol after the permissions string for my home directory:
$ ls -l /Users/
total 0
drwxrwxrwt 4 root wheel 136 2 Nov 23:34 Shared
drwxr-xr-x@ 40 simon staff 1360 23 Dec 12:05 simon
Hmm… that’s a bit weird. I couldn’t find any mention of a trailing @ in the ls manpage, and Google couldn’t help either. So my next step was to find out the hard way, by browsing Darwin’s source code. And here’s what I found, in print.c from the source for the ls command.
#ifdef __APPLE__
// irrelevant stuff left out
xattr = listxattr(filename, NULL, 0, XATTR_NOFOLLOW);
if (xattr < 0)
xattr = 0;
str[1] = '\0';
if (xattr > 0)
str[0] = '@';
else if (acl != NULL)
str[0] = '+';
else
str[0] = ' ';
#endif /* __APPLE__ */
Note firstly that this stuff is in an #ifdef __APPLE__ block, which might explain why there’s no mention of it in the manpage, which appears to be the standard FreeBSD version. Anyway, according to this code, we’re appending an @ if the file has xattrs, or extended attributes.
Darwin has an xattr command that lists extended attributes for a given file, but it’s a bit cryptic:
$ xattr -l /Users/simon/
com.apple.FinderInfo:
0000 00 00 00 00 00 00 00 00 40 00 00 00 00 00 00 00 ........@.......
0010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Extended attributes are stored as key/value pairs, and in this instance the key is com.apple.FinderInfo, the value is the 32-byte blob that follows it. Presumably, byte 8 having a value of 0×40 means something to someone, but it doesn’t mean anything to me.
A regular Finder Info window doesn’t tell you much either, but fortunately Path Finder came up with the goods. Although my home directory doesn’t show up in Path Finder either, if I browse to, for example, my Music directory, again using the shortcuts in the left panel, I can then control-click on “simon” in the breadcrumb browse strip along the top of the main Path Finder window and choose Get Info, and Path Finder’s Info panel does show extended attributes. Attributes like this:

Invisible! Result!
No idea how that extended attribute ever got set, but that’s a different problem.
Update: MacFixIt has a good article on the invisible attribute, including how to set it at the command line.
December 16th, 2007 — first principles
Someone at work asked me for help the other day with a problem that boiled down to: “how do I efficiently sort 29,000 lines of pipe-delimited text based on the value in the 6th column?”
29,000 lines sounds pretty specific to me, and I’m inherently suspicious of specifications that are so precise about the scale of the problem. In my experience there are very few static data sets in the world, and a problem that involves 29,000 lines of data today is likely to scale up to involve many more lines of data within the lifetime of the code you write to solve the problem.
There are a few obvious exceptions: national populations, for example, will generally double over a period of time equivalent to many software iterations (although if you’re writing software that concerns itself with national populations you’re most likely on contract to the government and in that case you’ve got a whole different set of challenges). But generally, data sets will grow and it pays to expect them to grow beyond expectations. Hence Whitaker’s First Rule of Scalability:
There are very few real-world data sets that won’t eventually grow to double their current size.
And here’s the kicker: Whitaker’s First Rule is immutable, and hence applies equally once the data set has doubled in size.
Continue reading →
December 14th, 2007 — amazon, databases, web services
Amazon have released their much-anticipated SimpleDB Web Service. There have long been whispers that such a service is on the cards, and I’ve been eager to find out some details. (Despite working at Amazon, I don’t get advance notice of this stuff!)
Data is stored on S3, as most people predicted it would be. This clearly isn’t a full relational database offering, and Amazon aren’t pretending it is. It’s a means of storing persistent data in domain tables, with a simple query syntax that looks quite unlike SQL and not much like the commonly used param=value web service syntax either. Here’s an example from the developer docs, to return rows from a specified domain table where price is less than 14.99 and color is blue:
['Price' < '14.99'] intersection ['Color' = 'Blue']
I can’t wait for SDB to go live, so that I can start playing with it and exploring the possibilities. More here soon!
November 30th, 2007 — coding, recruitment
We do initial phone screens where I work as part of our hiring process. It’s a great way of checking whether a candidate’s CV is fact or fiction before inviting them in for a full interview.
I never used to ask coding questions in my phone screen, on the basis that they’re difficult to administer over the phone and we’d be testing coding skills at an interview proper anyway. So instead I’d ask a range of theory questions: what’s the difference between a class and an object, what’s the efficiency of retrieval in a balanced binary search tree, that sort of thing.
That led to lots of candidates sailing through the phone screen only to fail miserably on the coding test in their full interview. It’s fascinating that many people with perfectly good CS theory and a long background in software development can’t actually write good code. So eventually, I added a coding question to my phone screen script.
Here’s the question I’m currently using:
Write a function that takes a single integer as input and returns a string that is the binary representation of that integer. For example, if the input is 20, the output is “10100″. You should complete the exercise in C/C++, Java or Perl. You may not use printf or similar functions.
I give them five minutes to complete the task then they have to read their code back to me over the phone.
This is a simple task – in Perl it’s a one-liner, even without sprintf – and it answers a range of questions including:
- Can the candidate understand and follow a simple spec?
- Do they understand binary?
- Can they use bitwise operators?
- Can they write good functional code?
- Can they write generalised code? (For example, do they make assumptions about the size of an integer.)
I look for answers where the syntax is perfect, although I’m more forgiving on semantic errors, provided the candidate can solve them when they’re pointed out. (A common error on the first implementation is to return a reversed string, for example.)
The test also reveals how well the candidate knows their language of choice. For example, candidates answering the question in C typically use a char* to store the result, but often forget to leave room for the trailing null byte; a pretty elemental requirement.
If you have five minutes and you’re up to the challenge, give this one a whirl and let me know how you got on.
November 18th, 2007 — algorithms, first principles, perl
One for the mathmos: an interesting article on hash functions by Bob Jenkins, formerly of Oracle, now at Microsoft. Jenkins looks at various hash functions in use at the time (1996, updated 2005), then proposes his own, along with a confident promise:
I offer you a new hash function for hash table lookup that is faster and more thorough than the one you are using now. I also give you a way to verify that it is more thorough.
Wikipedia’s interpretation of the function shows it to be the same function as the one Perl 5.8.8 uses for its built-in hashes. (Search for #define PERL_HASH).
Browsing Bob Jenkins’ site turns up many other gems, including this:
Little League did not produce a great baseball player, but Bob did acquire a knack for spotting four-leaf clovers.
An attitude to sport which is achingly familiar to me. My advice to Microsoft, Oracle et al: if you’re looking for the nerds of tomorrow, you could do a lot worse than start your search on the perimeters of the world’s school sports fields.
August 24th, 2007 — coding, first principles
At work we have a standard coding test that we give all of our prospective software developers to complete during their interview day. The coding test revolves around a simple card game, and one of the things candidates have to do is shuffle a deck of cards.
Shuffling a deck of cards, or more generally randomising the order of an array, is a pretty simple task and almost every candidate does it the same way. Here’s my (very simple) version, in Perl:
sub shuffle {
my @array = @_;
my $length = scalar(@array);
for (my $i = 0; $i < $length; $i++) {
my $r = int(rand() * $length);
@array[$i, $r] = @array[$r, $i];
}
return @array;
}
That's pretty much the algorithm that every candidate writes, including me when I sat the test. We call it the Knuth shuffle, because it was described by Donald Knuth in his book The Art of Computer Programming Vol 2: Seminumerical Algorithms.
So, everyone writes it like that. And everyone gets it wrong.
The first clue that the above implementation is flawed is in the numbers. Generating a random number between 1 and N and repeating that N times gives NN possible execution paths, but there are only N! ways you can order the numbers 1 to N. This is a great example of where intuition can be confusing: it feels right that "every card gets the chance to be swapped with every other card", but the numbers don't back up the gut feeling.
I prefer to think of it another way; instead of thinking of a magician shuffling a deck of cards, think of lottery balls coming out of a lottery machine. You start with N balls. One is selected at random (N possible outcomes), then another (N-1), then another (N-2), and so on. If you continue until all the balls are selected you have this many possible outcomes:
N * (N-1) * (N-2) * ... * 1
Or to put it another way:
1 * 2 * ... * N = N!
Hurrah, N! outcomes!
In other words, for each element in the array, choose a random number between that element's index and the last index in the deck, and swap the cards at those two indices. An easier way to do this in code is to iterate backwards from the last index to zero and generate a random number between zero and the current index, then swap those two indices:
sub shuffle {
my @array = @_;
my $length = scalar(@array);
for (my $i = $length - 1; $i > 0; $i--) {
my $r = int(rand() * ($i + 1));
@array[$i, $r] = @array[$r, $i];
}
return @array;
}
And that's the correct Knuth shuffle, exactly as he describes it himself:
Algorithm P (Shuffling). Let X1, X2, ..., Xt be a set of t numbers to be shuffed.
P1. [Initialize.] Set j ← t.
P2. [Generate U.] Generate a random number U, uniformly distributed between zero and one.
P3. [Exchange.] Set k ← [jU] + 1. (Now k is a random integer, between 1 and j. Exchange Xk and Xj.
P4. [Decrease j.] Decrease j by 1. If j > 1, return to step P2.
The Art of Computer Programming, Volume 2: Seminumerical Algorithms, p145. Third Edition. DE Knuth.
August 22nd, 2007 — cryptography, first principles
I think it’s important in any line of work, and especially in software development, to know your limitations. One of my limitations is security in software; cryptography, stuff like that. I only know enough about security in software to know that I don’t know enough. Hence, any attempt I made to develop secure software would almost certainly not produce a robust solution, and a non-robust security solution is worse than no solution at all.
Here’s how Phil “PGP” Zimmerman describes the perils of bad cryptography products:
This is like selling automotive seat belts that look good and feel good, but snap open in the slowest crash test. Depending on them may be worse than not wearing seat belts at all. No one suspects they are bad until a real crash. Depending on weak cryptographic software may cause you to unknowingly place sensitive information at risk when you might not otherwise have done so if you had no cryptographic software at all. Perhaps you may never even discover that your data has been compromised.
From http://www.pgpi.org/doc/guide/6.5/en/intro/
August 22nd, 2007 — coding, perl
Perl allows you to do all kinds of weird things to the semantics of the language. One such weird thing is the facility to override the default die handler in perl with something of your own design. The way it works is like so:
function some_func {
local $SIG{__DIE__} = sub {
my $error = shift;
# do weird and wonderful stuff with $error here
};
# your code here
die("there's nothing to live for");
}
Pretty cool, huh? Well, yes, but there are a couple of quirks to self-rolled die handlers that can cause you no end of pain.
Firstly, the keyword local should already have brought you out in a cold sweat. local gives a variable dynamic scope, which means that the variable holds until the block that contains it is popped back off the call stack. And that can really hurt if your code block calls other functions.
Consider a simple scenario: func1 defines a local die handler, then calls func2…
sub func1 {
local $SIG{__DIE__} = sub {
print STDERR $_[0]; exit(1)
};
# do something that might die here
func2();
}
The die handler doesn’t just affect the stuff you do in func1, it affects anything that happens in func2, too. And if func2 calls functions itself, the die handler will apply while those functions are running as well, and the functions they call, and the functions they call…
Well that’s OK, you might think – I want my die handler to be called if anything dies. That’s why I wrote it! But that brings us to the second quirk.
You may have come across the use of eval blocks in perl as a means of emulating the try/catch blocks seen in some other languages, such as Java. Here’s a simple example:
eval {
# do something that might die
};
if ($@) {
# this is your catch block
print STDERR "Something horrible happened: $@";
}
Using an eval block, then testing the value of $@ afterwards, lets you catch instances of die and handle them gracefully. Your code continues to run, unless you specifically state that it doesn’t inside the “catch” block. It’s a nifty trick, and it’s used a lot. For example, the XML::Simple CPAN module uses it to decide which XML parser to use:
eval { require XML::SAX; }; # We didn't need it until now
if($@) { # No XML::SAX - fall back to XML::Parser
if($preferred_parser) { # unless a SAX parser was expressly requested
croak "XMLin() could not load XML::SAX";
}
return($self->build_tree_xml_parser($filename, $string));
}
Very cool.
Until, that is, you mix eval try/catch blocks with custom die handlers.
Why? Because custom die handlers run even if the die happened within an eval block. The perlvar perldoc calls this “an implementation glitch”. Yeah. What this means is that if you use a custom die handler in your code, and it contains a die or exit, you’ll trample all over any attempt to use eval for try/catch semantics anywhere else downstream of the block that declares the die handler! (I found this out the hard way – after an hour in the perl debugger trying to figure out why XML::Simple was crashing my code.)
Fortunately there is a workaround, albeit not a very elegant one: the variable $^S indicates whether code was called from within an eval block or not. So if you write a custom die handler, you should always check $^S, and die again if it’s set. (You can die within a die handler – the die handler doesn’t get called recursively.)
local $SIG{__DIE__} = sub {
die @_ if $^S;
# rest of the die handler goes here
}
Better still, don’t define your own die handlers! They’re rarely needed, you should use an eval block wherever possible to trap errors when they surface, rather than trying to affect the way they surface.