probability of DNA sequence errors

I am a molecular biologist with a probability question. Before I get to the question, I will give you a little background information about DNA. As you may know, DNA consists of two long polymers of simple units called nucleotide bases (4 types: A, C, G or T) arranged in an antiparallel double helix. To characterize a DNA molecule, one strand can be sequenced using various procedures to determine the order of the nucleotide bases (to "call the bases"). As the two DNA strands are complementary (A always pairs with T, and G always pairs with C), this information also tells you the sequence of the other strand.

e.g. If your sequence is AAGTCGTTCAAA>, the sequence of the complementary strand is TTTGAACGACTT> (> gives strand orientation). So in the DNA molecule the strands will be arranged:

AAGTCGTTCAAA>

<TTCAGCAAGTTT

Now here is the question. In one commonly used method of DNA sequencing, the accuracy of each base call is 99%, i.e. the probability of a wrong base call is 1 in 100. What I would like to know is if this method is used to sequence separate copies (or clones) of the same DNA molecule, how do I work out the percentage of these sequences with errors? In my work I have sequenced 195 clones of a 113-base pair long DNA and there are differences that may be attributed to this sequencing error. Given that there is a probability of 1 in 100 of an error at each base, how many of these 195 sequences should be identical (i.e. error free) and how many will have single or multiple errors? Can you tell me how to calculate this? Thanks in advance for your help.

Re: probability of DNA sequence errors

Bioinformatician here (all be it a fairly new one).

To answer your question you can use binomial distribution. This page: Binomial Calculator has a calculator. And this page: Binomial Distribution has a little information on how binomial distribution is calculated. A Google search will pull up a lot more too.

To solve the problem we need to calculate the probability of there being a problem in 113 bases. You can use the formula in the second link or use the calculator in the first:Probability your base pair is accurately sequenced (P)

Probability of success on a single trial: 0.99

Number of base pairs (n)

Number of trials: 113

For 0 failures

Number of successes (x): 113

Now if you've gone for the calculator option you want:

$\displaystyle P(X = 113) = 0.321201074564791$

Using the formula if you've got it right you should get the same value for $\displaystyle b(x; n, P)$

So now we have the probability of a given trial containing no errors. Now it's a simple matter to multiply the probability by the 195 of sequences.

$\displaystyle 195 * 0.3212 = 62.634$

So of those 195 sequences on average approximately 63 will be correct and identical. There's also a good chance (well good enough to code for anyway) that there will be a couple more which are wrong and identical.

Hope this helps.

-Matthew

Re: probability of DNA sequence errors

Thanks very much for your answer Matthew. It matches one that I received from a friend who seems to have solved the problem in a slightly different way:

Each base can be represented as a box – the number of boxes is N (113 in this case)

The likelihood of each base being called correctly is 0.99 (i.e. 99% accuracy)

The probability P that the sequence of length N has no errors is (0.99)^{N}

Here P = (0.99)^{113} = 0.321

The number of copies is C (195 in this case)

The expected number of correct copies is C x P = 195 x 0.321 = 62.63

The variance is C x P x (1 - P) = 195 x 0.321 x (1 - 0.321) = 62.63 x 0.679 = 42.52

The SD is the square root of the variance, i.e. √C x P x (1 - P) = √42.52 = 6.52

The number of error free sequences is likely to be 62.63±6.52 (±1SD)

How would I work out the confidence interval in this case, or for ±2SD or ±3SD?

My friend also said that I should be able to use the random number function in Excel to do the experiment in silico and plot the distribution as a nice curve. Can anyone tell me how to do this?

Re: probability of DNA sequence errors

Yeah, your friends way is easier. It's a specific case of Binomial distribution. If you want a nice curve in Excel though, you're going to have to use binomial distribution. I believe Excel comes with a binomial distribution function, if not:

$\displaystyle b(x; n, P) = nCx * P^x * (1 - P)^{n - x}$

Set it up starting with 195 as your x in your first row, 194 in your second select both rows and drag the bottom right hand corner down until x=0. Should note that you may have problems with this method though (explained later).

If you have a lot of memory you can do it by experiment use =IF(RAND()<0.32,1,0) and drag it to the right 194 places. Then sum the row at the end. Select the row and drag downwards as much as you like, the more you do, the more accurate the curve will be. Should note with this method that you won't get accurate numbers on the tails as the probabilities drop as low as $\displaystyle 6.62999*10^{-97}$ and last time I checked Excel only had 65536 rows per sheet. Also note that I don't have Excel and haven't checked the formula works.

For finding the confidence interval, again Bionomial distribution. Obviously binomial distribution is discrete and SD is continuous, so it will need to be rounded.

**Example**

2SD = 13.04

$\displaystyle \theta = 62.63$

$\displaystyle 62.63-13.04 = 49.6$

$\displaystyle 62.63+13.04 = 75.7$

Both of these answers need to be rounded so we get 50 and 76.

So the confidence interval will be the sum of the binomial distributions up to 76 minus the sum of the binomial distributions up to 50.

There is a problem though:

195 C 76 = 24659266981287296105114525661007929017583646799238 439200

For most software 195 choose 76 is going to give you some significant errors (depends how they're representing the number) due to the size. Be wary of this if you use Excel (I don't know if it is the case and can't check, I'm on linux here). The stattrek calculator seems fairly decent but appears to be down at the moment but seeing as you've probably got a good use for it and I had a project with most of the code in already I quickly coded it up in C++ with GMP for you. If you want the code, just ask.

$\displaystyle P(X < 50) = 0.0203332$

$\displaystyle P(X < 76) = 0.974426$

$\displaystyle P(50 < X < 76) = 0.974426 - 0.0203332 = 0.95409$

So ±2SD is around the 95% confidence interval.

I also worked out ±3SD for you too as:

$\displaystyle P(X < 43) = 0.0007203$

$\displaystyle P(X < 82) = 0.998584$

$\displaystyle P(50 < X < 76) = 0.998584 - 0.0007203 = 0.99786$

So ±3SD is around the 99% confidence interval.

-Matthew

P.S. If Excel bugs out on you here are the values you should get:

Quote:

P(X = 0) = 1.54712e-33

P(X = 1) = 1.42756e-31

P(X = 2) = 6.55242e-30

P(X = 3) = 1.99468e-28

P(X = 4) = 4.53055e-27

P(X = 5) = 8.18936e-26

P(X = 6) = 1.22712e-24

P(X = 7) = 1.56779e-23

P(X = 8) = 1.74338e-22

P(X = 9) = 1.71406e-21

P(X = 10) = 1.5086e-20

P(X = 11) = 1.20058e-19

P(X = 12) = 8.71088e-19

P(X = 13) = 5.80238e-18

P(X = 14) = 3.56932e-17

P(X = 15) = 2.03802e-16

P(X = 16) = 1.08492e-15

P(X = 17) = 5.40551e-15

P(X = 18) = 2.52941e-14

P(X = 19) = 1.115e-13

P(X = 20) = 4.64295e-13

P(X = 21) = 1.83083e-12

P(X = 22) = 6.8519e-12

P(X = 23) = 2.43874e-11

P(X = 24) = 8.27024e-11

P(X = 25) = 2.67676e-10

P(X = 26) = 8.28174e-10

P(X = 27) = 2.4529e-09

P(X = 28) = 6.96414e-09

P(X = 29) = 1.89767e-08

P(X = 30) = 4.96872e-08

P(X = 31) = 1.25142e-07

P(X = 32) = 3.03481e-07

P(X = 33) = 7.09319e-07

P(X = 34) = 1.59924e-06

P(X = 35) = 3.48103e-06

P(X = 36) = 7.32083e-06

P(X = 37) = 1.48865e-05

P(X = 38) = 2.92888e-05

P(X = 39) = 5.5792e-05

P(X = 40) = 0.000102961

P(X = 41) = 0.000184186

P(X = 42) = 0.000319568

P(X = 43) = 0.000538049

P(X = 44) = 0.000879527

P(X = 45) = 0.00139653

P(X = 46) = 0.00215486

P(X = 47) = 0.00323253

P(X = 48) = 0.00471627

P(X = 49) = 0.00669509

P(X = 50) = 0.00925071

P(X = 51) = 0.0124454

P(X = 52) = 0.0163081

P(X = 53) = 0.0208209

P(X = 54) = 0.0259078

P(X = 55) = 0.0314284

P(X = 56) = 0.037179

P(X = 57) = 0.0429016

P(X = 58) = 0.0483015

P(X = 59) = 0.053072

P(X = 60) = 0.0569231

P(X = 61) = 0.0596113

P(X = 62) = 0.0609646

P(X = 63) = 0.060901

P(X = 64) = 0.0594366

P(X = 65) = 0.0566824

P(X = 66) = 0.0528303

P(X = 67) = 0.048132

P(X = 68) = 0.0428717

P(X = 69) = 0.0373389

P(X = 70) = 0.0318031

P(X = 71) = 0.0264946

P(X = 72) = 0.0215915

P(X = 73) = 0.0172147

P(X = 74) = 0.0134296

P(X = 75) = 0.0102524

P(X = 76) = 0.00765998

P(X = 77) = 0.0056017

P(X = 78) = 0.00400999

P(X = 79) = 0.0028102

P(X = 80) = 0.00192815

P(X = 81) = 0.00129536

P(X = 82) = 0.000852152

P(X = 83) = 0.000548976

P(X = 84) = 0.00034636

P(X = 85) = 0.000214027

P(X = 86) = 0.000129538

P(X = 87) = 7.67964e-05

P(X = 88) = 4.45982e-05

P(X = 89) = 2.53716e-05

P(X = 90) = 1.41399e-05

P(X = 91) = 7.72023e-06

P(X = 92) = 4.12964e-06

P(X = 93) = 2.16422e-06

P(X = 94) = 1.11125e-06

P(X = 95) = 5.59041e-07

P(X = 96) = 2.75555e-07

P(X = 97) = 1.33078e-07

P(X = 98) = 6.29714e-08

P(X = 99) = 2.91955e-08

P(X = 100) = 1.32624e-08

P(X = 101) = 5.90284e-09

P(X = 102) = 2.57409e-09

P(X = 103) = 1.09978e-09

P(X = 104) = 4.60359e-10

P(X = 105) = 1.88792e-10

P(X = 106) = 7.58502e-11

P(X = 107) = 2.98537e-11

P(X = 108) = 1.15105e-11

P(X = 109) = 4.34733e-12

P(X = 110) = 1.60829e-12

P(X = 111) = 5.82768e-13

P(X = 112) = 2.0682e-13

P(X = 113) = 7.18833e-14

P(X = 114) = 2.44666e-14

P(X = 115) = 8.15448e-15

P(X = 116) = 2.66112e-15

P(X = 117) = 8.50239e-16

P(X = 118) = 2.65944e-16

P(X = 119) = 8.14271e-17

P(X = 120) = 2.44027e-17

P(X = 121) = 7.1573e-18

P(X = 122) = 2.05427e-18

P(X = 123) = 5.76913e-19

P(X = 124) = 1.5851e-19

P(X = 125) = 4.26031e-20

P(X = 126) = 1.11996e-20

P(X = 127) = 2.87929e-21

P(X = 128) = 7.23803e-22

P(X = 129) = 1.77886e-22

P(X = 130) = 4.27344e-23

P(X = 131) = 1.00336e-23

P(X = 132) = 2.30196e-24

P(X = 133) = 5.15968e-25

P(X = 134) = 1.12965e-25

P(X = 135) = 2.41533e-26

P(X = 136) = 5.04226e-27

P(X = 137) = 1.02753e-27

P(X = 138) = 2.04351e-28

P(X = 139) = 3.96527e-29

P(X = 140) = 7.50532e-30

P(X = 141) = 1.38531e-30

P(X = 142) = 2.49281e-31

P(X = 143) = 4.37185e-32

P(X = 144) = 7.47037e-33

P(X = 145) = 1.24331e-33

P(X = 146) = 2.0148e-34

P(X = 147) = 3.17795e-35

P(X = 148) = 4.87711e-36

P(X = 149) = 7.27965e-37

P(X = 150) = 1.05636e-37

P(X = 151) = 1.48965e-38

P(X = 152) = 2.04047e-39

P(X = 153) = 2.71358e-40

P(X = 154) = 3.50192e-41

P(X = 155) = 4.38323e-42

P(X = 156) = 5.31821e-43

P(X = 157) = 6.25124e-44

P(X = 158) = 7.11425e-45

P(X = 159) = 7.83374e-46

P(X = 160) = 8.34041e-47

P(X = 161) = 8.57957e-48

P(X = 162) = 8.52051e-49

P(X = 163) = 8.16258e-50

P(X = 164) = 7.5365e-51

P(X = 165) = 6.70013e-52

P(X = 166) = 5.7297e-53

P(X = 167) = 4.70814e-54

P(X = 168) = 3.71308e-55

P(X = 169) = 2.80703e-56

P(X = 170) = 2.03146e-57

P(X = 171) = 1.40536e-58

P(X = 172) = 9.2791e-60

P(X = 173) = 5.83745e-61

P(X = 174) = 3.49247e-62

P(X = 175) = 1.98312e-63

P(X = 176) = 1.06636e-64

P(X = 177) = 5.4165e-66

P(X = 178) = 2.59183e-67

P(X = 179) = 1.16477e-68

P(X = 180) = 4.89917e-70

P(X = 181) = 1.92119e-71

P(X = 182) = 6.993e-73

P(X = 183) = 2.35067e-74

P(X = 184) = 7.25422e-76

P(X = 185) = 2.04102e-77

P(X = 186) = 5.19243e-79

P(X = 187) = 1.18252e-80

P(X = 188) = 2.38109e-82

P(X = 189) = 4.17299e-84

P(X = 190) = 6.23564e-86

P(X = 191) = 7.72419e-88

P(X = 192) = 7.61461e-90

P(X = 193) = 5.60077e-92

P(X = 194) = 2.7322e-94

P(X = 195) = 6.62999e-97