My humble attempt at a clarification

:

Plainly, the joint distribution of (X,Y) tells you $\displaystyle P(X\in A,Y\in B)$ for any subsets A,B.

Then, if you take $\displaystyle B=\mathbb{R}$, it gives you $\displaystyle P(X\in A)$ for any subset A, which is the distribution of X. So, at least, when you know the joint distribution of (X,Y), you know the distributions of X and Y.

But you know much more. For instance, X and Y are independent iff $\displaystyle P(X\in A,Y\in B)=P(X\in A)P(Y\in B)$, and this condition only involves the joint distribution. So the joint distribution tells you if X and Y are independent.

More generally, it contains the way the values of X and Y relate to each other. The very fact that X=Y (almost surely) can be read from the joint distribution, while it is not readable from the distributions of X and Y. For the same distribution $\displaystyle \mu$, there are many variables (X,Y) such that X and Y have distribution $\displaystyle \mu$; extreme cases are X=Y of law $\displaystyle \mu$, and X,Y independent of law $\displaystyle \mu$.

Maybe if will be clearer if you think that the joint distribution not only tells you the distribution of X but also the conditional distribution of X given Y: $\displaystyle P(X=k|Y=l)=\frac{P(X=k,Y=l)}{P(Y=l)}$ (the right-hand side depends only on the joint distribution).

A "visual" way: The joint distribution is a probability measure on $\displaystyle \mathbb{R}^2$ that describes how the values of $\displaystyle (X,Y)$ are distributed in the plane. You can think of hot spots (or peaks) where the measure gives more probability, and it gets colder and colder at infinity (nearer to 0). Then for instance you may have some very hot spot near (1,2), which means that $\displaystyle (X,Y)$ has high probability to be near that point, i.e. with high probability X is near 1 and

*at the same time* Y is near 2.

Now, one can see the distributions of X and Y in this setting: they are distributions on each of the axes obtained by averaging the measure on the whole line projecting to the chosen point of the axis. For instance, P(X=x) is obtained by averaging the previous measure on the (vertical) line of equation "X=x"; like a "projection" of the measure. If there was at hot spot at (1,2), then there will be a hot spot at x=1 as well by projection, and at y=2.

But if there are for instance a hot spot at (1,2) and another at (3,4), you will have two spots at 1 and 3 for X, and at 2 and 4 for Y. In that case, you don't know if (1,4) is a likely spot for (X,Y) or not from the distributions of X and Y.

I don't know if this has clarified anything... You'll probably get used to it and understand the concept progressively.