Useful stuff

I’ve started reading Cory Doctorow’s novel Little Brother, and so far I’m loving it. It’s aimed for a teen readership, and it pulls no punches. We’re treated to issues of government abuse of power, the individual versus the state, and personal empowerment through mastery of high technology. One thing that I am finding remarkable about it is the way that Doctorow weaves in accurate tidbits of high technology, tossed off in the casually breezy voice of his teenage protagonist. Here, for example, is a perfect explanation of Bayesian statistics:

“Thomas Bayes was an 18th century British mathematician that no one cared about until a couple hundred years after he died, when computer scientists realized that his technique for statistically analyzing mountains of data would be super-useful for the modern world’s info-Himalayas.

Here’s some of how Bayesian stats work. Say you’ve got a bunch of spam. You take every word that’s in the spam and count how many times it appears. This is called a “word frequency histogram” and it tells you what the probability is that any bag of words is likely to be spam. Now, take a ton of email that’s not spam — in the biz, they call that “ham” — and do the same.

Wait until a new email arrives and count the words that appear in it. Then use the word-frequency histogram in the candidate message to calculate the probability that it belongs in the “spam” pile or the “ham” pile. If it turns out to be spam, you adjust the “spam” histogram accordingly. There are lots of ways to refine the technique — looking at words in pairs, throwing away old data — but this is how it works at core. It’s one of those great, simple ideas that seems obvious after you hear about it.

It’s got lots of applications — you can ask a computer to count the lines in a picture and see if it’s more like a “dog” line-frequency histogram or a “cat” line-frequency histogram. It can find porn, bank fraud, and flamewars. Useful stuff.”

    – Cory Doctorow, Little Brother

Isn’t that simply lovely?

2 thoughts on “Useful stuff”

  1. And, it introduces an important concept in an accessible way to an age group that normally would run screaming from anything resembling statistics. Very cool. 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *