# Intro to Bayesian Inference

Before I start randomly posting mathematical content I thought I should make at least one post that covers the basics of Bayesian Inference. Luckily for me Bayesian methods are all the result of one basic theory and idea: Bayes’ Theorem.

While Bayesian Inference is often touted as modern, exciting and controversial (well as exciting and controversial as statistics can be), Bayes’ Theorem itself emerges naturally from probability theory. Traditionally we consider two “events”, these can be more accurately thought of as a statement that can be either true or false: “it will rain tomorrow”, “I’m 5 feet tall”, “tigers are green”, etc. The joint probability of two events $A$ and $B$ is denoted $\mathrm{Pr}(A, B)$ which means “the probability that $A$ and $B$ are both true”; we also regularly consider conditional probability, $\mathrm{Pr}(A|B)$, which is “the probability that $A$ is true given that $B$ is true”. Joint probability can be defined using conditional probability.

$\mathrm{Pr}(A, B) = \mathrm{Pr}(A|B)\mathrm{Pr}(B)$

The equation above tells that if we know the probability of event $B$ and the probability $A$ will happen given $B$ has happened, then we can work out the probability of them both happening together.

Now we’ll take a look at some properties of joint and conditional probabilities. Let’s say that event $A$ is “it is raining” and $B$ is “it is cloudy”. Clearly saying “it’s raining and cloudy” is equivalent to saying “it’s cloudy and raining” so $\mathrm{Pr}(A, B) =\mathrm{Pr}(B, A)$. On the other hand you can’t reverse conditional probability as asking “what is the probability it’s raining given that it’s cloudy?” is going to have a very different answer to “what is the probability that it’s cloudy given it’s raining?”. The latter being almost certain and the former rather less likely. While it seems obvious here when considering the weather, in general people have a bad habit of assuming $\mathrm{Pr}(A|B) = \mathrm{Pr}(B|A)$.

We can use the reversibility of joint probability to derive Bayes’ Theorem. If we take our previous equation and divide by $\mathrm{Pr}(B)$ we get

$\mathrm{Pr}(A|B) = \frac{\mathrm{Pr}(A, B)}{\mathrm{Pr}(B)}$

We can also re-express $\mathrm{Pr}(A, B)$ as $\mathrm{Pr}(B, A) = \mathrm{Pr}(B|A)\mathrm{Pr}(A)$ and hence

$\mathrm{Pr}(A|B) = \frac{\mathrm{Pr}(B|A)\mathrm{Pr}(A)}{\mathrm{Pr}(B)}$

The above equation is the basic formulation of Bayes Theorem. I hasten to add that the actual development of the theory was not quite so straightforward. For a whistle-stop tour of the history of Bayes’ Theorem I would recommend The Theory That Would Not Die by Sharon Bertsch-McGrayne, which I’ll be reviewing here soon. The actual theory itself as applied to two generic events is used widely by both frequentist and Bayesian statisticians because it is simply a consequence of maths and probability. The way Bayesian use Bayes’ theorem is the controversial part.

The statisticians who actually use Bayesian methodology consider very specific events in place of $A$ and $B$. In general, statisticians want to infer something about a parameter, $\theta$, from some observational data, $y$. We can rewrite Bayes’ Theorem as

$p(\theta|y) = \frac{p(y|\theta)p(\theta)}{p(y)}$

We have changed from using $\mathrm{Pr}(x)$ to $p(x)$ that’s because we’ve changed from looking at the a specific probability to a more general probability distribution. This is a function that describes how likely each possible value of $x$ is. Below is an example, the normal distribution (AKA the “bell curve”). Note that strictly speaking we need to integrate over an interval to get an actual probability but it describes how likely we are to get any particular value relative to all the other possible values (we could imagine integrating of an arbitrarily small interval for example).

Bayesians actually ignore $p(y)$ because we already know our data $y$ before we start making any inferences. This means that $p(y)$ is simply a constant and consequently so is $\frac{1}{p(y)}$. When something is equivalent up to a multiplicative constant (i.e. $y=5x$) you can say they’re proportional to each other denoted $y \propto x$. Hence Bayesians generally consider

$p(\theta | y) \propto p(y|\theta) p(\theta)$

To figure out why Bayesianism is considered to be so powerful we need to look at this equation from a scientific perspective. We can re-express this equation by thinking of $p(\theta)$ as a hypothesis, made prior to the experiment in which we collected the data $y$, about the parameter $\theta$. Is it likely to be near zero? Do we have some evidence about its value? Consider the average height of a giraffe: what kind of guess would you make about it? Most likely you aren’t sure but you know they’re pretty tall and taller than say 2m (a tall human) but shorter than 20m. It is possible to quantify this sort of thinking in a probability distribution which we can then assign to $p(\theta)$. We also have the probability of our data given specific values of $\theta$, $p(y|\theta)$. Our data is a random sample from this probability distribution. Using the above the equation we can update our hypothesis, prior, with our data, via the likelihood,  and get an improved hypothesis $p(\theta|y)$, called the posterior. For example if our prior assumed that giraffes were on average around 1m tall and our data had an average height of 5m the posterior of the mean height would be moved away from 1m and towards 5m. The data can correct or support our initial hypothesis. Note that we have to allow for this shift, it is possible to have “the mean height is 1m with certainty and it’s impossible that it’s any other value” as our prior hypothesis but we cannot update this as the function isn’t defined anywhere but at 1m. So priors tend to either be defined from $-\infty$ to $\infty$ or over a large plausible range.

The ability to update your hypotheses with data is very philosophically appealing from a scientific perspective. This is one of the major advantages it has over frequentist (AKA classical) statistics which is based on the properties of long-run frequencies and is less easy to connect to the scientific process.