It’s funny how words, when used in different contexts, take on completely different meaning. Same spelling but different context. Thus a different meaning. Technically speaking, these are called homographs. For example:

The term “axes” can mean either …

(a) a plurality of a certain hand tool, or

(b) a plurality of fixed reference lines for the measurement of coordinates.

The term “number” can describe …

(a) a mathematical object used to count, measure, or label, or

(b) an increased feeling of numbness.

As you may notice, I chose two homographs that have a different meaning in a mathematical versus colloquial context. This is deliberate. Math does funny things to the english language. And vice versa. It’s one reason why I can’t really learn from math textbooks. It’s also why this article might strain to communicate what I want to say. So I’ll just come right out and say it:

Bayes’ Rule is a fascinating, powerful method for prediction based on past information.

It is also a comfortable trap for anyone who doesn’t want to look beyond old intuitions.

Before I expand on this, I need to point everyone to a pair of fantastic articles that explain Bayes’ Rule very well:

  1. For a brief introduction, see Devin Soni’s What Is Bayes Rule?
  2. For a terrific application of the concept, see Will Koehrsen’s Bayes Rule Applied

For an even simpler explanation, I’ll borrow a line from the Wikipedia article:

Bayes’ Rule describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

Finally, for one final bit of detail, I’ll borrow from the fantastic book Algorithms to Live By by authors Brian Christian and Tom Griffiths:

Bayes Rule … gives a remarkably straightforward solution to the problem of how to combine preexisting beliefs with observed evidence: multiply their probabilities together.

If you’re like me, that last phrase “multiply their probabilities together” is hard to envision. So let’s see the actual multiplication. Here’s what the Bayes’ Rule looks like as a formula:

From Wikipedia

Please examine the notation P(A). This is what’s known as the prior. And as you can see in the formula, it has a multiplier effect on our initial belief of probability—which is represented by P(B|A). I like to think about this in the context of gossip.

Oh, But Did You Know?

Imagine you meet someone new today. They seem pretty nice. So what’s the chance that they really are nice? As an initial belief of P(B|A), you’re probably 75% certain of their niceness.

Then a friend tells you that this person has a long recorded history of advocating for the destruction of all kittens and puppies. They show you the person’s blog, full of awful rants against kittens and puppies. As you read their wanton screeds, you establish a prior, P(A). This immediately produces a powerful multiplier effect on your beliefs. In a negative direction. Where you were once 75% certain of the person’s niceness, you are now only 1% certain. If asked to predict the person’s behavior in the future, you’d use this prior to say bad things.

As our authors from the aforementioned book explain,

The richer the prior information we bring to Bayes’ Rule, the more useful the prediction we can get out of it.

Here’s where we get back to homographs. The word “prior” has a slightly different, more narrow meaning in criminal justice.

The term “prior” can describe …

(a) shorthand for prior probability in mathematics, aka the believed value of a uncertain quantity before evidence is taken into account, or

(b) shorthand for prior convictions in criminal justice, aka the historical record of previous convictions of a defendant in a sentencing case or a suspect in a criminal investigation.

As a non-mathematician, I can’t help but pick up on this parallel in the language. The deliberate cross-pollination of the term “prior” in both fields shows the ways in which Bayes’ Rule, and probablistic thinking as a whole, emerges in other disciplines.

This is a good thing. As covered during the study of Daniel Kahneman’s Thinking Fast and Slow, such probablistic thinking helps us understand the distinction between decisions and outcomes and so much more.

But again, it can be a trap if we’re not careful.

You’ll Probably Rob A Bank

Did you know that you run the risk of robbing a bank? It’s true. There is a real possibility that you will wake up one morning, hop into a car, drive to a bank, politely request that they give you all their money, and then attempt to escape as a rich criminal.

You can say the base rate is 0.0000001% That’s fine. But there’s still a chance. Thus a risk.

But this only works if you look at this in a normal distribution. This is something that we examined while studying Taleb’s book The Black Swan. As stated before, we really try to normalize everything. Including your chance of committing a felony.

But criminal statistics don’t always follow normal distributions. There are “mandelbrotian variations” or significant skewness in the data. Much of this activity lives in Taleb’s realm of Extremistan, where power laws romp playfully in fields of green.

For example, a fascinating analysis from Sweden showed that 1% of the entire population accounted for 63% of all violent crime.

That’s only for violent crime and only for Sweden and only for data from 1973 – 2004. All the same, it shows the real predictive power of prior information. It also shows how prior information uncovers the distribution of the data behind a topic. If you want to find the population that is most likely to commit a violent crime, look no further than the population that has already committed a violent crime. The “prior” matters here.

I wish it didn’t. Because while these findings lead us to more-accurate predictions, they also pull us towards more-rigid dogma. People can see this information and jump immediately to notions like “Once a criminal, always a criminal.”

System 2 Informs System 1

I’ve seen it firsthand. Trial and error, experience and observation, all these things lead criminal justice professionals to snap judgements and thin-sliced decisions based on a single question: “They got any priors?”

I’m not criticizing that question. I’m not questioning the veracity of the logic. This question is the Occam’s Razor for predicting crime.

And it is the valid by-product of legitimate experience. Such experience leads people to formulate variations on the original Bayes’ Rule algorithm. A useful heuristic begets other useful heuristics.

In criminal justice, this means that an inexperienced judge might initially assess a sentencing case from System 2 thinking. But repeated exposure leads them to crystallize the System 2 arguments into System 1 algorithms.

This is why a balanced way of thinking is so crucial.

Because the algorithm works. Until it doesn’t. And there’s always a case where it doesn’t.

Specific to criminal justice and “priors”, there is a switch that occurs in the data when we add a new variable into the mix: time.

So prior criminal behavior is a strong indicator of future criminal behavior. But that relationship weakens as time passes. The pattern is reliable because it emerges on a normal distribution. We’ve now ventured back into Taleb’s Mediocristan. Look no further for proof than the important study by Kurlycheck et al., from 2006.

“Big Enough” Data

The point? I’m pulling from a lot of different sources in order to wrestle with something the authors observe in Algorithms To Live By:

Small data is big data in disguise. The reason we can often make good predictions from a small number of observations—or just a single one—is that our priors are so rich.

This is brilliant. This is true. But if anyone were to take that idea at face value (which our authors do not intend), it would be easy to slide into dogma and rigid impressions. It would be easy to say “All stereotypes are true.” In other words, this powerful idea, much like Bayes’ Rule, could be a foundation for lazy thinking.

This isn’t about criminal justice, per se. And it isn’t something the authors condone. But it happens. Why? Because Bayes’ Rule is that powerful.

We can use that power to greater effect. Add more priors. Is a single prior great? Yes. But more priors, more variables, can help.

Just not too many. There is a lot of useful truth in the idea that “small data is big data in disguise.” There is truth in a third way, too. Between the poles of “small data” and “big data,” we have “big enough data”. This is the sort of analysis that is more accurate, more nuanced, and requires only a little bit more effort.

It’s amazing what happens when researchers add a few additional variables (aka priors) to a given formula. In some cases, it uncovers power law distributions. In other cases, it uncovers normal distributions. In every instance, it gives us better intuition.

I should mention, too, that criminal justice professionals first ask about priors. Then they ask how long ago those priors occurred. Just as Kurleycheck., et al. would recommend. So none of this is new to them, generally speaking. It’s the rest of us that have to be mindful.

Priors are powerful indicators. But they’re better when they aren’t isolated.