Introduction

I came across the two envelopes problem a couple of weeks ago. I read the Wikipedia article, and this article by a mathematician named Keith Devlin, and for a while I felt like I understood the solution. I've been thinking about this a lot since then, and now I believe that those two articles are wrong about the solution.

I've decided to post my thoughts here, so that I can get some input from others. If I'm wrong about something, I'd like to know about it.

The problem

For those who aren't familiar with the problem, here it is:

You have two indistinguishable envelopes in front of you. Both of them contain money, one of them twice as much as the other. You're allowed to choose one of the envelopes and keep whatever's in it. So you pick one at random, and then you hesitate and think "maybe I should pick the other one". Let's call the amount in the envelope that you picked first A. Then the other envelope contains either 2A or A/2. Both amounts are equally likely, so the expectation value of the amount in the other envelope is

$\displaystyle \frac{1}{2}\cdot 2A+\frac{1}{2}\cdot \frac{A}{2}=A+\frac{A}{4}=\frac{5}{4}A$.

Since this is more than the amount in the envelope we have, we should definitely switch.

This conclusion is of course absurd. The symmetry of the problem alone is enough to guarantee that it doesn't matter if we switch or not. A more formal way of showing it is to note that if the smaller amount is X, thenbothexpectation values are equal to

$\displaystyle \frac{1}{2}\cdot X+\frac{1}{2}\cdot 2X=\frac{3}{2}X$,

so it doesn't matter if we switch or not.

The problem is that we have two calculations that look correct, but only one of them can be. It's obvious that the second calculation is the correct one, but it's surprisingly difficult to understand exactly what's wrong with the first one.

Why this post is so long

Because this problem is much more difficult than it seems, and because I've been discussing this problem with a few friends who aren't very familiar with Bayes's theorem. In order to make it easier for them (and others who are interested) to understand the disscussion, I will go through the basics of Bayesian probabilities in the next section. Then I will examine the line of reasoning that Devlin claims solves the problem.

If you already know about Bayesian probabilities, you can skip to the section titled "The solution?".

Probabilities, conditional probabilities and Bayes's theorem

Let S be a set. We define a function P that takes a subset of S to a number in the closed interval [0,1] by

$\displaystyle P(X)=\frac{|X|}{|S|}$,

where |X| and |S| are the number of elements in X and S respectively. We call P(X) theprobabilityof X. Note that this agrees with our intuitive idea of what a probability is: For example, if S is the set of all people and X the subset of all people with dark hair, then P(S) is the probability that a person chosen at random has dark hair. We call S thesample space.

We now define another function, also represented by the letter P, that takestwosubsets of S to a number in the closed interval [0,1] by

$\displaystyle P(X|Y)=\frac{P(X \cap Y)}{P(Y)}$,

where the P on the right-hand side is the probability function we defined above. To interpret this, we note that the right-hand side is equal to

$\displaystyle \frac{|X \cap Y|}{|Y|}$.

Suppose for example that X is the set of all people with dark hair, and Y the set of all females. Then P(X|Y) is the probability that a randomly chosen female has dark hair. We call P(X|Y) theconditional proabilityof X,givenY.

The definition of the conditional probability implies that

$\displaystyle P(X \cap Y)=P(X|Y)P(Y)$

but since the left-hand side is symmetric under exchange of X and Y, the right-hand side must be too, so we must have

$\displaystyle P(X|Y)P(Y)=P(Y|X)P(X)$.

Bayes's theorem is obtained by solving for P(X|Y).

$\displaystyle P(X|Y)=\frac{P(Y|X)P(X)}{P(Y)}$

When we need to calculate conditional probabilites $\displaystyle P(X_i|Y)$, where the $\displaystyle X_i$ are members of a family of pairwise disjoint subsets that cover S (e.g. $\displaystyle X_1=$"people aged 0-9", $\displaystyle X_2$="people aged 10-19", and so on), we often already know the $\displaystyle P(X_i)$, so it's convenient to put Bayes's theorem in a different form.

$\displaystyle P(Y)=P(\bigcup_i{Y \cap X_i})=\sum_i{P(Y \cap X_i)}=\sum_i{P(Y|X_i)P(X_i)}$

$\displaystyle P(X_i|Y)=\frac{P(Y|X_i)P(X_i)}{\sum_j{P(Y|X_j)P(X_ j)}}$.

This is all very abstract, so let's look at an example.

An example of how to use Bayes's theorem

Drawer 1 contains 10 white socks, Drawer 2 contains 10 white socks and 90 black socks. First a drawer is chosen at random, and then a sock from that drawer is chosen at random. This sock is white. What is the probability that we took it from drawer 1?

Let X be the set of events such that a sock from drawer 1 is chosen. Then $\displaystyle X^c$ is the set of events such that a sock from drawer 2 was chosen. The information that the drawer was chosen at random tells us that the number of elements in $\displaystyle X$ and in $\displaystyle X^c$ (=S-X) must be the same. This implies that $\displaystyle P(X)=P(X^c)=1/2$. Let $\displaystyle Y$ be the set of events when a white sock was picked. What we're looking for is the conditional probability of X (drawer 1), given Y (white sock), so we use Bayes's theorem:

$\displaystyle P(X|Y)=\frac{P(Y|X)P(X)}{P(Y|X)P(X)+P(Y|X^c)P(X^c) }$

$\displaystyle P(Y|X)$ is the probability that a random sock from drawer 1 is white. This is 10/10=1. $\displaystyle P(X)$ and $\displaystyle P(X^c)$ are both 1/2. $\displaystyle P(Y|X^c)$ is the probability that a random sock from drawer 2 is white. This is 10/100=1/10. So we get

$\displaystyle P(X|Y)=\frac{1\cdot\frac{1}{2}}{1\cdot\frac{1}{2}+ \frac{1}{10}\cdot\frac{1}{2}}=\frac{10}{11}$

If we had not known that the sock we picked is white, we would have said that the probability that we picked it from drawer 1 is P(X)=1/2. When we learned that it's white, our assessment changed to P(X|Y)=10/11. Because of this, P(X) is often called thepriorprobability of X, and P(X|Y) theposteriorprobability of X.

What if the sample space is infinite?

The equation that defines probability above has the number of elements in the sample space in a denominator. If the sample space is infinite, all probabilities would be zero, unless the subsets we consider are also infinite, in which case the probability wouldn't be well defined. In the discussion that follows, I'm assuming that the definitions above can be generalized to the infinite case, and that Bayes's theorem still holds. If someone has objections to this, or would like to show us the definitions and proofs, I would appreciate it.

The solution?

I don't know who suggested this solution first, but I read about it in Devlin's article. The difference between what I've written here and what Devlin wrote is mainly in the notation. I'm using the same notation as I did in the sections preceding this one.

The first thing we need to realize is that when we try to calculate the expectation value of the contents of the other envelopeas a function of the value in the first envelope, we need to use conditional probabilites. For example, the first 1/2 in the calculation that gave us the answer 5/4*A must be replaced by the conditional probability that the first envelope contains the smaller amount,giventhat the smaller amount is A. Are these conditional probabilities also 1/2? Devlin argues that they can't be. We will soon see why.

Just as in the example with the two sock drawers, we have an sample space that is split in equally large halves (whatever that means when the sample space is infinite) by the requirement that one of two options is chosen at random. Let X be the set of events such that the smaller amount is chosen first. Then X^c is the set of events such that the larger amount is chosen first. Let Y be the set of events such that the envelope that gets chosen first contains A.

With these definitions, the expectation value can be expressed as

$\displaystyle P(X|Y)\cdot 2A+P(X^c|Y)\cdot A/2$

The requirement that the two envelopes are chosen at random means that $\displaystyle P(X)=P(X^c)=1/2$, but the conditional probabilites aren't necessarily both 1/2. Let's find out if theycanbe. Bayes's theorem tells us that

$\displaystyle P(X|Y)=\frac{P(Y|X)P(X)}{P(Y|X)P(X)+P(Y|X^c)P(X^c) }$

We already know what P(X) and P(X^c) are, but what are P(Y|X) and P(Y|X^c)? For example, P(Y|X) is the conditional probability that the envelope we choose first contains A, given that we have chosen the envelope with the smaller amount. Another way of saying that is that P(Y|X) is the probability that the amounts in the envelopes arespecificallyA and 2A, rather than some other amounts.

This is when we must realize that it's not possible to make sense of the expression that we have claimed is the expectation value, unless we postulate the existence of something like a function $\displaystyle Q0,\infty]\rightarrow[0,1]$ that tells us the probability Q(x) that the smaller value is x.

Actually we should postulate the existence of aprobability distributiong on the set of positive real numbers. This is basically a "function" that assigns a probabilitydensityg(x) to each x. The reason I put quotation marks around the word "function" is that g isn't technically a function, since it can have infinitely sharp peaks like a delta function (which is also not a function...yeah, I know). Given the probability distribution g, we could obtain the probability that the smaller value is in the interval (a,b) by integrating g over that interval.

For the moment I will only consider discrete probability distributions. This is equivalent to saying that we are allowing ourselves to use the function Q as I defined it above, with the added requirement that it can only be non-zero on a countable subset of its domain of definition. (This subset can still be unbounded, and hence infinite).

Now that we have defined Q, we see that

$\displaystyle P(Y|X)=Q(A)$

$\displaystyle P(Y|X^c)=Q(\frac{A}{2})$

This means that

$\displaystyle P(X|Y)=\frac{Q(A)\cdot\frac{1}{2}}{Q(A)\cdot\frac{ 1}{2}+Q(\frac{A}{2})\cdot\frac{1}{2}}=\frac{1}{1+\ frac{Q(\frac{A}{2})}{Q(A)}}$

This is only =1/2 if

$\displaystyle Q(A/2)=Q(A)$.

If this doesn't hold for arbitrary A, then the expression we've been using for the expectation value simply isn't valid. We're assuming that it is valid, so we must have Q(x/2)=Q(x) for all x. But we can't have Q(A/2)=Q(A) for arbitrary A, because that would make the sum of all Q(x) infinite. That sum has to be =1 of course, since it's the sum of the probabilities of all possible values of the smaller amount.

According to Devlin, this result solves the problem. He doesn't say much after arriving at this result, and ends his article with the following words:

Originally Posted byDevlin

Why I don't think that this solves the problem

If the above line of reasoning is correct, then the result of the calculation shouldn't just be different from 5/4*A. It should beexactlyA, i.e. we should have

$\displaystyle P(X|Y)\cdot 2A+P(X^c|Y)\cdot A/2=A$

$\displaystyle \frac{Q(A)\cdot\frac{1}{2}}{Q(A)\cdot\frac{1}{2}+Q (\frac{A}{2})\cdot\frac{1}{2}}\cdot 2A+\frac{Q(\frac{A}{2})\cdot\frac{1}{2}}{Q(\frac{A }{2})\cdot\frac{1}{2}+Q(A)\cdot\frac{1}{2}}\cdot \frac{A}{2}=A$,

$\displaystyle Q(A)\cdot 2A+Q(\frac{A}{2})\cdot \frac{A}{2}=\big(Q(A)+Q(\frac{A}{2})\big)A$,

$\displaystyle Q(A)=\frac{1}{2}\cdot Q(\frac{A}{2})$.

So if the expression we've been using for the expectation value really is correct, then Q must satisfy Q(x)=1/2*Q(x/2) for all x. But if it does, then Q has a form that's even more absurd then the uniform distribution that we dismissed earlier, since Q(x) goes to infinity as x goes to zero.

Devlin found that there's no Q that makes $\displaystyle P(X|Y)=P(X^c|Y)=1/2$ (and hence the expectation value equal to 5/4*A) for arbitrary A, and concluded that this resolves the paradox. But we have found, using the same methods, that there's also no Q that makes the expectation value equal to A for arbitrary A.

This makes me think that that Devlin's solution is incorrect.

Interpretation of the probability distribution

When I started thinking about this problem, I thought of the probability distribution g (or the function Q) as representing themethodthat someone used to determine the amounts that he or she later put in the envelopes. The problem with that is that since that method isn't known to us, we won't ever be able to calculate the expectation value this way. The alternative is to think of it as representing ourbeliefabout what the smaller amount is. Then we can assume that the probability distribution is known to us, and we can calculate the expectation value, but we will only get the right answer if we had the right "belief" to begin with!

My conclusions

To define A to be the amount in the first envelope is to treat that value as if it's already known to us (regardless of whether we actually know it or not). Because of this, the factors of 1/2 in the first calculation in this post must be replaced by conditional probabilities, as we did in the section titled "The solution?". These conditional probabilities must be calculated using Bayes's theorem, and but when we try, we see that the result depends on a probability distribution g that isn't known to us. This means that wecan'tcalculate the expectation value this way.

However, since we know that the correct expectation value is A, we might be able to use Bayes's theorem to find a class of probability distributions that are reasonable instead. A "reasonable" probability distribution would be one that makes the expectation value equal to A, when we calculate it using Bayes's theorem. Since the probability distibution can be interpreted as representing our belief about what the smaller amount is, this procedure might tell us which beliefs are reasonable and which ones are absurd. We have only investigated the discrete probability distibutions, and found thatallof them must be considered absurd. However,thatdoesn't seem reasonable, so now I'm back to thinking that there's something fundamentally wrong with this method of calculating the expectation value.

So what is the correct solution of this problem? I still don't know. Maybe I'll find it in some other article.

The mistakes in the Wikipedia article

The simple "solution" that's suggested early in the Wikipedia article isn't a solution at all.

The "paradox" is that we have two calculations, the first and the second in this post, that give us different answers. The challenge isn't to discover the second calculation, but to explain what's wrong with the first one. The text I just quoted is equivalent to saying "you should use the second instead of the first". This is to ignore the problem, not to solve it. It can't possibly be wrong to define A as the amount in the first envelope, and once that definition has been made, A is a constant.Originally Posted byWikipedia

The Wikipedia article is also wrong when it claims that the problem is more difficult when you're allowed to look inside the first envelope and find out what A is. It seems to me that the author has misunderstood why we are using conditional probabilites. The reason is that we're calculating the expectation value as a function of A. That forces us to treat A as if we know its value, even though we don't.