Today I Learned: Bayes Theorem, Library Views, and Xbox Controller Screws

Today I Learned:
1) A great example for teaching Bayes' Theorem, courtesy of one Deniz Senyuz.

Here goes my attempt to explain Bayes' Theorem on Facebook.

Bayes' Theorem is really just a simple equation that tells you how to correctly change how much you believe something based on what you knew before and what you see. In other words, it's the math that explains how learning works (if done correctly).

A quick review of terms:
P(X) = the probability of x, which is a number between 0 and 1
P(X|Y) = the probability of X *given* that Y is true.
P(X,Y) = the probability of X and Y both being true.
Bayes' Theorem: P(X|Y) = P(Y|X) * P(X) / P(Y)
or, as it shows up in scientific contexts:
P(Hypothesis | Data) = P(Data | Hypothesis) * P(Hypothesis) / P(Data)

First, a quick derivation of Bayes' Theorem. I find that going through this proof and knowing how to reproduce it are helpful for understanding how the theorem works. It's a very simple theorem, but feel free to skip this bit to see the example, which is the novel thing I actually learned today.

<PROOF>

First, we note that P(X,Y) = P(Y,X). After all, the order you list them in really doesn't matter.

Now note that P(X,Y) = P(X|Y) * P(Y). For X and Y to be true, Y certainly has to be true, thus the P(Y) term. Once you know that Y is true, the probability of X is, by definition, P(X|Y).

Similarly, P(Y,X) = P(Y|X) * P(X).

Thus, P(X|Y) * P(Y) = P(X,Y) = P(Y,X) = P(Y|X) * P(X)

Divide both sides by P(Y), and you get P(X|Y) = P(Y|X) * P(X) / P(Y).

Next... there is no next. I just wrote Bayes' Theorem. That's the whole proof.

</PROOF>

<AN ASIDE ON THE PARTS OF BAYES'S THEOREM>

Before I actually get to Bayes' Theorem, let me mention how Bayes' Theorem describes the mathematically correct way of updating knowledge when you get some data. As I wrote before, a common use of Bayes' Theorem is the specific case of

P(Hypothesis | Data) = P(Data | Hypothesis) * P(Hypothesis) / P(Data)

The number on the left -- P(Hypothesis | Data) is the confidence you have, as a probability, that some hypothesis about the world is true, given that you just saw some evidence (the Data). This term is usually called the "posterior probability" or "posterior likelihood", that is, the probability (of the hypothesis) after seeing the data.

The first term on the right -- P(Data | Hypothesis) -- describes how probably the data were *IF* the hypothesis is correct. It's often called the "likelihood", short for "likelihood of the data". You can kind of intuit why the posterior likelihood might be proportional to the likelihood. After all, if your hypothesis strongly predicts an outcome, and you get that outcome... you're more likely to believe that hypothesis than if the hypothesis was wishy-washy on the outcome, or make a contradictory prediction.

The second term on the right -- P(Hypothesis) -- is the confiecne, as a probability, that you placed in the hypothesis *before* you saw the data. It's usually called the "prior", short for "prior probability", or the probability of the hypothesis *prior* to the data. (If it weirds you out that Bayes' Theorem relies on your prior beliefs... you're not alone. That's a bigger topic than I'm going to address here, but the single-sentence response is that any kind of learning or statistics is going to involve prior beliefs, and Bayes' Theorem makes them nice and explicit).

The last term is the probability of the data. Not under any particular hypothesis, mind you, just the probability of getting it under *any* possible hypothesis -- to calculate this, you have to sum up all of the probabilities of the data under the different hypotheses, weighted by the (prior) probability of those hypotheses. That's a hell of a pain, but for a lot of purposes you don't actually need to calculate this, for reasons I'll get to in the example. There's a name for this term, but it's terribly unhelpful, so I won't repeat it here. I also don't have a better word for it. For the purpose of this post, I'll call it the "overall likelihood of the data".

</AN ASIDE ON THE PARTS OF BAYES'S THEOREM>

Ok, now the example.

Say you just met me. You know I'm from the US, and you want to know what state I live in. I won't tell you, but I *will* tell you that my district representative is a Democrat. How likely is it that I come from Indiana?

The competing hypotheses here are statements like "Sam lives in Indiana" and "Sam lives in Virginia". The data here is (and yes I know it's "datum" in the singular but SCREW THAT PARTICULAR LINGUISTIC CONVENTION I'M AN ADULT I DO WHAT I WANT) that I live in a district with a Democratic Representative. Bayes Theorem will tell you how likely it is that I'm from a particular state.

What is the likelihood? That is, what's the likelihood that my representative is a Democrat, under the assumption that I live in Indiana? To save some math, let's assume that every representative has an equal number of constituents, so that if I'm from Indiana, I'm equally likely to have any of the Indiana representatives. The probability of my representative being a Democrat (given that I'm from Indiana), then, is the fraction of Indiana representatives that are Democrats. In this case, as of this writing, that number is 2/9. Not very likely.

What is the prior probability? That is, how likely did you think it was that I'm from Indiana *before* you learned that my representative is a Democrat? Well, that depends on what information you have coming in. You might say that since you don't have any idea what state I live in, you assign equal probabilities to my living in any state. You could also say that I'm a random person, so the probability that I'm from Indiana is the fraction of Americans who are from Indiana. Maybe you hear my accent and don't think it sounds very Indianaian, or you see my drivers' license and see that it's actually from Michigan, in which case the prior probability would be very low.

Let's make a pretty minimal assumption and use the assumption I made above, namely that every representative has the same number of constituents. Then the probability that I'm from Indiana is the same as the fraction of total representatives that are from Indiana, which is 9/435 ~ 2%.

What's the overall-likelihood-of-the-data? Well, this would be the probability of my having a Democratic representative if you have *no* idea what state I'm from. Currently, the Democrats hold 188 out of 435 seats, so the overall-likelihood-of-the-data is 188/435 ~ 43%.

Now we just plug those numbers into Bayes' Theorem and it tells you how likely I am to live in Indiana. In this case, it's (2/9) * .02 / 0.43 ~ 0.01, so it's about 1% probable that I'm from Indiana.

Incidentally, there's another question you might ask, which is "what state is Sam most likely to be from?". This kind of question gets asked a lot in science. If you want find the *most* likely state, you can calculate the posteriors for each of the 50 hypotheses involved and see which one's highest. If you want to save some calculation, you might note that the overall-likelihood-of-the-data, P(Democrat), is equally likely no matter which hypothesis you're considering. It doesn't care what state you're asking about. Since *every single* hypothesis you're considering is being divided by the same term, you could multiply all of them by that term and it wouldn't change which is most likely. That's why you often just don't bother calculating the overall-likelihood-of-the-data -- if you're comparing *relative* likelihoods of different hypotheses, it doesn't really matter and can be dropped.

What I love about this example is that there's a nice graphical, visual way to think about the terms of Bayes' Theorem in this example, which helps illustrate why it works. Draw out (or imagine drawing out, if you prefer (or go look up a picture of)) a map of the US divided up by representative districts, with the districts colored in by representative-party. In fact, here's a link to such a map: http://tinyurl.com/p6dojc6. Pretty, no?

Now you can start seeing the terms of Bayes' Theorem visually. The likelihood, for instance, is the probability that a given Indiana representative is a Democrat, so you can blot out all the states except Indiana, and the likelihood is the fraction of the districts that are left that are blue. What's the prior? It's the fraction of *all* of the districts that are in Indiana. What's the overall-likelihood-of-the-data? That's the fraction of Democratic districts that are in Indiana -- blot out all of the Republican districts, and the fraction of what's left is the overall-likelihood-of-the-data.

If you've never seen or used Bayes' Theorem, I hope this teaches you something and convinces you how awesome the theorem is! If you *have* seen Bayes' Theorem before, I hope this explanation helps!

2) Our library has a ninth-floor lounge with a fantastic view of the mountains. Those mountains are a lot more impressive when you're nine stories up and they look just as big.

Also, downtown Pasadena has a lot more trees than I thought.

3) Xbox One controllers still use the same screws as Xbox 360 controllers, the kind that have a pin in the middle so you need a special screwdriver to unscrew them. I also learned that it's supposedly possible to unscrew those with a 2mm flathead screwdriver, though I don't have one myself and thus wasn't able to test it myself.

Today I Learned

Saturday, October 17, 2015

Bayes Theorem, Library Views, and Xbox Controller Screws

No comments:

Post a Comment

Blog Archive