Calculation of probability that two sets of outcomes are always disjoint

Hello all,

I am a bioinformatician with a probability calculation problem. It is seemingly trivial, but I get lost very quickly with probibilities.

I need to infer a calculus, a formula, with which I can assess the probability, no matter how small or big, that two urns with balls give disjoint sets of outcomes by design.

Lets say that we have 2 urns, where there are steel balls and teflon balls. The sampling device (with replacement) is magnetic so the teflon balls are never drawn.

Our urns have exactly 20 balls each, and each ball is represented only once, so our theoretical set of outcomes is 1-20.

So, I know that there are some teflon balls (not drawable) in my 2 urns. **I am suspecting that the remaining sets of steel balls are disjoint** between the 2 urns.

A working example:

Urn A, 20 draws with replacement, [1,2,3,4,5,6,7,8,9,10] the outcomes.

Urn B, 20 draws with replacement, [13,14,15,16,17,18,19,20] the outcomes.

Can I calculate the probability, no matter how small or big,

that in urn A, balls [13-20] are in teflon, given that they were not drawn in the first 20 trials,

**AND**

that in urn B, balls [1-10] are in teflon, given that they were not drawn in the first 20 trials.

In other words, a theoretical calculation of the eventuality that the two urns can only have disjoint outcomes - no matter how small or big the probability of that.

Please let me know if my description is poor, in which case I can rephrase.

Thank you very much in advance.

Re: Calculation of probability that two sets of outcomes are always disjoint

Quote:

Originally Posted by

**noplacebos** Hello all,

I am a bioinformatician with a probability calculation problem. It is seemingly trivial, but I get lost very quickly with probibilities.

I need to infer a calculus, a formula, with which I can assess the probability, no matter how small or big, that two urns with balls give disjoint sets of outcomes by design.

Lets say that we have 2 urns, where there are steel balls and teflon balls. The sampling device (with replacement) is magnetic so the teflon balls are never drawn.

Our urns have exactly 20 balls each, and each ball is represented only once, so our theoretical set of outcomes is 1-20.

The formulation of your problem is somewhat difficult to understand. Are the balls basically 'labelled'? Is that what you mean by "each ball is represented only once"? Also, if the sampling device only picks up steel balls, then the probability is 1 (i.e. 100%) that you will draw a steel ball. The teflon balls in this case are irrelevant to the question, unless there is some complicated mechanism preventing the steel balls from being picked up, etc.

Does one urn have n steel balls labelled 1 to n ((20 - n) are teflon), and the other urn have m steel balls labelled 1 to m ((20 - m) are teflon)? Is that more accurate?

Quote:

Originally Posted by

**noplacebos** So, I know that there are some teflon balls (not drawable) in my 2 urns. **I am suspecting that the remaining sets of steel balls are disjoint** between the 2 urns.

A working example:

Urn A, 20 draws with replacement, [1,2,3,4,5,6,7,8,9,10] the outcomes.

Urn B, 20 draws with replacement, [13,14,15,16,17,18,19,20] the outcomes.

Can I calculate the probability, no matter how small or big,

that in urn A, balls [13-20] are in teflon, given that they were not drawn in the first 20 trials,

**AND**

that in urn B, balls [1-10] are in teflon, given that they were not drawn in the first 20 trials.

The probability, according to you, of drawing a teflon ball is zero. Therefore the probability of **not** drawing them is 1. Therefore, you may just be asking what the probability of selecting 20 steel balls, with replacement, and never drawing any steel ball more than once? Is that correct?

Quote:

Originally Posted by

**noplacebos** In other words, a theoretical calculation of the eventuality that the two urns can only have disjoint outcomes - no matter how small or big the probability of that.

Please let me know if my description is poor, in which case I can rephrase.

Thank you very much in advance.

It feels like you are trying to give us an analogy to your problem. Perhaps a succinct description of the original issue would be best, if you can't come up with an adequate analogy. Many of us here are degree holders of one form or another and won't shy away if you throw some biology at us.

Re: Calculation of probability that two sets of outcomes are always disjoint

Thank you for your quick reply majamin. I apologise for the poor description.

I will attempt a simplified version of the biological problem, although the math involved surely applies to a great variety of different scenarios. I didn’t want to give biological details because they tend to confuse and turn the attention away from the mathematical problem itself.

So, proteins are constituted by a sequence of amino-acids, which form peptide chains. There are 20 available amino-acids to choose from for each position of the peptide chain.

Each one of the 20 available amino-acid is represented by one letter,

e.g. M for methionine, A for alanine, S for serine, etc.

This is a group of 14 oligopeptides, let's call it set A:

**M**DKFRV

**Q**GPTRL

**Q**GEVTI

**S**GAKNA

**A**LPILF

**A**LLAEE

**P**VEIQN

**V**PKLKD

**I**DTTMK

**M**LTQLG

**I**GTFCV

**P**TDCER

**Q**WDREF

**P**HYTER

Then, we have another group of 12 oligopeptides, we call it set B:

**T**KVERD

**G**SVWID

**H**SNVNN

**F**SAPYD

**W**KTMRA

**K**IWALG

**R**PVARF

**G**QGQVS

**L**PGGCA

**F**GARPV

**W**RTHYD

**L**IWFTR

Now, these two sets come from different experiments.

The only thing we care about at this stage, is the first amino acid in their sequence, noted in bold.

In set A, the first amino-acid is always one of the following:

M, Q, S, A, P, V, I - 7 amino-acids appearing out of the 20 available (note: of course, since we only got 14 peptides, the maximum number of different amino-acids that could appear was just 14).

In set B, the first amino-acid is always one of the following:

T, G, H, F, W, K, R, L – 8 amino-acids appearing out of the 20 available (although, again, here we could only have a maximum of 12 different amino-acids appearing, since the 12 peptides in the set).

These amino-acids, by themselves, are irrelevant to us.

What we do care about, is that our two sets A and B of peptides, never show the same amino-acid at the beginning of their sequence. In other words, the sets of leading amino-acids are disjoint.

I would like to assess the probability of this event (that the sets of leading amino-acids remain disjoint between a group of 14 peptides –setA- and another of 12 peptides -setB). Ultimately, I would like to be able to recognise whether this event is significantly improbable, e.g. with a probability p<0.001. For the biological part, if this event is significantly improbable, the fact that we observed it nonetheless, means something.

Of course, this is just a random example. The two sets A and B, could contain hundreds of peptides each.

I hope all this makes my question more clear – if it is preferable perhaps we could go back to a model with urns and balls?

In the analogy part, urns are the leading position in the peptide sequence,

amino acids are balls in the urn, 20 available, labelled 1-20,

draws, are the peptide population of each set,

steel balls are the amino-acids that can appear at the leading position of that peptide set,

teflon balls are the amino-acids that can never appear at that position of that peptide set.

Therefore, the amino-acids that are observed in our 2 experiments, A and B, correspond to steel balls in each set. We don’t know how many more there are. We only know, that our two experiments, have disjoint outcomes so far. Is it significant, given the trials?

Thanks again for your time.

Re: Calculation of probability that two sets of outcomes are always disjoint

This makes sense, mathematically anyway (I'm no biology major). For set A, there are $\displaystyle 20^{14}$ combinations of oligopeptides (concerned with only the first letter). For set A, there are $\displaystyle {20 \choose 14}=38760$ different combinations of oligopeptides that have non-repeating first letters.

Hence, the probability is $\displaystyle \frac{{20 \choose 14}}{20^{14}}$ which is very small indeed (on the order of $\displaystyle 10^{-13}$).

Similar calculation follows for Set B (again, a tiny probability).

Re: Calculation of probability that two sets of outcomes are always disjoint

This probability corresponds to non-repeated first letters in the 14 peptides of set A. But set A has repeated letters at first position, e.g. in our example, 'Q' appears 2 times, 'P' appears 3 times, etc. The same goes within set B peptides.

The probability of the event I would need to calculate is the observation that within set A, the first letter is never a first letter that appears in set B, and within set B, we never obtain a first letter that appears in A.

The first letters in A and B, are mutually exclusive, i.e. **none** of the [M, Q, S, A, P, V, I] that appeared in A, did appear in B as well.

And **none** of [T, G, H, F, W, K, R, L] that appeared in B, did appear in A as well.

Two urns,disjoint sets of outcomes - can we calculate this probability (that they came out disjoint). Was it normal, given the trials and the draws?

Re: Calculation of probability that two sets of outcomes are always disjoint

I may be way off, but isn't this problem down to multiplying the probabilities of A not drawing an outcome of B, and B not drawing an outcome of A?

P(A) = P(not T, G, H, F, W, K, R, L)^14 = (1-(8/20))^14 = (12/20)^14

P(B) = P(not M, Q, S, A, P, V, I)^12 = (1-(7/20))^12 = (13/20)^12

P(disjoint outcomes between A and B) = P(A)*P(B) =~ 4*10^-6

Perhaps I am missing something?

Re: Calculation of probability that two sets of outcomes are always disjoint

Quote:

Originally Posted by

**noplacebos** I may be way off, but isn't this problem down to multiplying the probabilities of A not drawing an outcome of B, and B not drawing an outcome of A?

P(A) = P(not T, G, H, F, W, K, R, L)^14 = (1-(8/20))^14 = (12/20)^14

P(B) = P(not M, Q, S, A, P, V, I)^12 = (1-(7/20))^12 = (13/20)^12

P(disjoint outcomes between A and B) = P(A)*P(B) =~ 4*10^-6

Perhaps I am missing something?

I've been thinking about this one for a bit now, and the probability depends on how many symbols appear in set A. I've been looking at the problem on a case by case basis where you check each amount (are you perhaps checking for a fixed number of symbols? not varying?). I'll get back to you, soon hopefully.

Re: Calculation of probability that two sets of outcomes are always disjoint

Quote:

Originally Posted by

**majamin** I've been thinking about this one for a bit now, and the probability depends on how many symbols appear in set A. I've been looking at the problem on a case by case basis where you check each amount (are you perhaps checking for a fixed number of symbols? not varying?). I'll get back to you, soon hopefully.

Thanks for that. No, there are no fixed constraints.

You may get a case where set A "spits" 12 unique symbols over 100 draws (peptides), while set B shows 2 unique symbols over 2 draws, which don't appear in A.

You may get another case where set A has 5 unique symbols over 5 draws, while set B has 5 unique symbols over 200 draws, different than A.

The only standard parameters are that there are 20 possible outcomes per draw, and that we have observed sets A and B giving mutually exclusive outcomes - what appears in A doesn't appear in B and vice versa (which means that there is no disjoint event if one set shows all 20 symbols after n draws, or if both sets share at least one common symbol between them after m and n draws - sorry, this is just an obvious definition of "disjointness", but I am repeating the description to limit any confusions).

Re: Calculation of probability that two sets of outcomes are always disjoint

Quote:

Originally Posted by

**noplacebos** I may be way off, but isn't this problem down to multiplying the probabilities of A not drawing an outcome of B, and B not drawing an outcome of A?

P(A) = P(not T, G, H, F, W, K, R, L)^14 = (1-(8/20))^14 = (12/20)^14

P(B) = P(not M, Q, S, A, P, V, I)^12 = (1-(7/20))^12 = (13/20)^12

P(disjoint outcomes between A and B) = P(A)*P(B) =~ 4*10^-6

Perhaps I am missing something?

The only thing I would need to say is that this is a probability of one particular outcome, and you, I'm sure, are interested whether this disjointedness occurs in other ways (i.e. counting how many different ways can the sets be disjoint).

I made a quick table in excel showing some possibilities for set B, that is, if there are n symbols in set A, then here are the probabilities of not seeing said n symbols in set B (for the first 9, this would go up to 12, of course):

n | Probability |

1 | 0.54 |

2 | 0.28 |

3 | 0.14 |

4 | 0.069 |

5 | 0.032 |

6 | 0.014 |

7 | 0.006 |

8 | 0.002 |

9 | 0.0008 |

Using the calculation $\displaystyle \left(\frac{20-n}{20}\right)^{12}$ for each row. Now, this gives the conditional probability P(B not any of Symbols in A|Symbols A), and if indeed set A and set B are independent events, then one only has to rely on the above table, otherwise, you may have to calculate P(B disjoint from A) by other means.

Re: Calculation of probability that two sets of outcomes are always disjoint

That's great for P(B not any of Symbols in A|Symbols A). But in the same time we want no symbols from B to exist in A, so P(A not any of Symbols in B|Symbols B).

In our working example:

P(A not any of Symbols in B|Symbols B) = ((20-8)/20)^14 =~ 0.00078

P(B not any of Symbols in A|Symbols A) = ((20-7)/20)^12 =~ 0.0057

So for the total probability of disjointness for both A and B, given one another, we would get:

P(total) = P(A not any of Symbols in B|Symbols B) * P(B not any of Symbols in A|Symbols A) = 4.4*10^-6

Does this seem right?

1 Attachment(s)

Re: Calculation of probability that two sets of outcomes are always disjoint

Quote:

Originally Posted by

**noplacebos** That's great for P(B not any of Symbols in A|Symbols A). But in the same time we want no symbols from B to exist in A, so P(A not any of Symbols in B|Symbols B).

In our working example:

P(A not any of Symbols in B|Symbols B) = ((20-8)/20)^14 =~ 0.00078

P(B not any of Symbols in A|Symbols A) = ((20-7)/20)^12 =~ 0.0057

So for the total probability of disjointness for both A and B, given one another, we would get:

P(total) = P(A not any of Symbols in B|Symbols B) * P(B not any of Symbols in A|Symbols A) = 4.4*10^-6

Does this seem right?

Hopefully I'm not mistaken, but, if you are calculating the significance of disjointedness, then it is not sufficient to show the probability of a specific outcome, but a range of them. You want to calculate the total probability as in:

$\displaystyle \displaystyle P(\text{A is disjoint from B}) =\sum_{i=1}^{12} P(\text{A disjoint from B}|\text{i symbols in B}) * P(\text{i symbols in B})$

The conditional probabilities can be calculated from $\displaystyle \left(\frac{(20-n)}{20}\right)^{12}$, but P(n symbols in A) and P(m symbols in B) need to be determined. I think I've found a way, but it involved an excel spreadsheet, which I've attached. The calculation of P(n symbols in A) was determined by counting the number of arrangements of n symbols, chosen from 20, i.e. $\displaystyle {20 \choose n} n^{14} - \text{(Previous Combinations to avoid double counting)}$.

I get P(A is disjoint with B) = 2.61268 E -05 and P(B is disjoint with A) = 2.24132 E -05

This makes sense, since it seems extremely unlikely that both sets would have no peptides in common by chance, and it makes more sense for the probability of B being disjoint with A to be lower than vice versa, since A could potentially have more symbols in it.

Re: Calculation of probability that two sets of outcomes are always disjoint

Thanks again majamin, your help is precious. Your procedure makes sense at first sight. I will study it in more detail and implement the code in my tools to test it on my data.

I will let you know if the use of this formula gives meaningful results regarding significance.

(By the way, the total probability would still be P(A is disjoint with B)*P(B is disjoint with A), since I see no reason why the two experiments wouldn't be independent, right?)

Re: Calculation of probability that two sets of outcomes are always disjoint

Quote:

Originally Posted by

**noplacebos** Thanks again majamin, your help is precious. Your procedure makes sense at first sight. I will study it in more detail and implement the code in my tools to test it on my data.

I will let you know if the use of this formula gives meaningful results regarding significance.

(By the way, the total probability would still be P(A is disjoint with B)*P(B is disjoint with A), since I see no reason why the two experiments wouldn't be independent, right?)

To have disjointedness, I believe you need an observation first. If you observe 14 peptides (of set A) first, then the probability that B will be disjoint is as above. Vice versa for set B. But, as I don't understand the procedure of the experiment, it is hard to determine what is dependent and independent. My hunch, not knowing all of the details, is that multiplying the probabilities is not correct.

Re: Calculation of probability that two sets of outcomes are always disjoint

The nature of the experiment is as follows:

All members of set A share one common property, e.g. they are BLUE (sorry for the oversimplification).

All members of set B are RED.

The hypothesis that is explored is that the color of each set is defined/regulated by the symbols that appear at the leading position of the sequences of their members.

In order for this hypothesis to be correct, we first check whether the respective symbols are specific to each set, i.e. we check if they are mutually exclusive. In our example, the sets of symbols being disjoint is an encouraging sign, but then we need to check whether the observation of disjointness is also statistically significant and that it wasn't just another probable random outcome.

I am certainly missing a lot of subtle details with how probabilities work in more complex problems, but as the two sets are put together based on an independent property (cannot be blue and red at the same time), I would think that their probabilities of disjointness need to be considered together. Otherwise, depending on each specific setting of the data, P(A is disjoint with B) could be wildly different than P(B is disjoint with A). And I can't really tell if set A turned blue first, or was it set B that turned red first ;-)

I am not saying that you are wrong - just trying to understand myself, which one of the P(disjoint) I should consider if not both.

Re: Calculation of probability that two sets of outcomes are always disjoint

I spotted a small problem with the above calculation.

If n=20, we get a non-zero probability of disjointness (of the 10^-6 order). It should give 0 probability, if a set contains all 20 symbols after r(>20) draws.

Also I noticed that the SUM(i=0 to n) of P(A is n) [i.e. sum of C column], is always equal to 1, regardless of n... The same is true for SUM(i=0 to m) of P(B is m) [i.e. sum of H column]. I might be wrong, but shouldn't this sum be equal to 1 when and only when n=20 (or m=20 for B)?

Finally, I believe you are right, only one of P(A is disjoint with B) or P(B is disjoint with A) should be considered as the total probability we want. But in that case, the calculation of these two probabilities should give the exact same result?