# A chemistry problem - here?!

• Jan 21st 2011, 11:52 AM
lavoisier
A chemistry problem - here?!
Hi everybody,
I'm new to this forum, and I'm not a mathematician, so please be patient with me. I'm not even sure this is the right subforum, to be honest.
There is a statistical problem I'm trying to solve, which originates in medicinal chemistry. Apologies for the lengthy description below.

Imagine that you have a generic molecule made of a core structure to which are attached 3 substituents, which we can call R1, R2 and R3.
R1, R2 and R3 can vary within given, finite sets, each specific for its R. So R1 can take the values {R_1,1 , R_1,2 , ..., R_1,n} ; similarly for R2 (1 to p) and R3 (1 to q).
If you define what R1, R2 and R3 are, you define an individual molecule.

By varying these 3 R's, we made a certain number of molecules (note: not the whole set of n*p*q possible molecules), and for each molecule we measured a physical property P (e.g. solubility in water).

In medicinal chemistry, we can make the assumption that the value of P is influenced by the identity of R1, R2 and R3. There are indeed some algorithms (e.g. Free-Wilson) that calculate coefficients for the various R_i,j in order to quantify their impact on P.
The problem is that these methods are computationally very heavy and not easily accessible to non-expert users (like myself).

So I thought a statistic-probabilistic approach could be easier to apply in practice.

Let's say that we can fix a threshold P0 such that molecules with P>=P0 are considered 'good' and those with P<P0 'bad'.

So I have a table with 4 columns: R1, R2, R3 and Good/Bad, and for each R_i,j I can simply count how many times it occurs in absolute, how many times it occurs together with P being good, etc.

What I'm trying to do is to use those counts to generate a meaningful 'score' S_i,j for each of the R_i,j substituents, such that the value of S_i,j quantifies how good (when positive) or how bad (when negative) each R_i,j is. This would help me rank the various R_i,j and decide what molecules to make next.

If you could please point me to the right method/theoretical field that can help me tackle this problem, it would be great.

In the meantime I'll tell you what I attempted so far.

I started with simple Bayesian approach, i.e. I calculated p(good|R_i,j) and subtracted the background p(good) from it. The result was already interesting, but the problem was that I didn't know how to account for the fact that some R_i,j are more represented than others.
For instance, an R_i,j resulting in 3 good outcomes out of 6 molecules containing it will have a p(good|R_i,j)=0.5, but so will an R_i,j with 30 good out of 60.
So one question would be how to account for the significance of the probability I calculate, in such a way that it is incorporated into the score itself, i.e. not as a separate value.
I tried the chi-squared, but I'm not sure if it's applicable in this case, and how I should then incorporate it into the score.

Then I read something about 'association rule learning' and 'data mining', and I found similar concepts to the ones in Bayesian analysis above, but with additional function definitions, such as 'confidence', 'lift', 'leverage'...
I didn't really get what each of these meant in practice; perhaps the one I need is already there, but if so I can't see exactly which one.

So this is where I got stuck. Another very important aspect I'm not sure the above analysis could detect is what we call 'synthetic bias'.
For instance, it can happen that most of the molecules containing R_2,1 will also contain R_1,1, because it was easier for us to make these than put R_2,1 in molecules with R_1,j where j is not 1.
However, this means that if R_1,1 is particularly bad for P, R_1,1 will have a low p(good|R_1,1), but so will R_2,1, simply because there will be very few or no molecules where R_2,1 appears independently from R_1,1. But note that there will most likely be molecules that contain R_1,1 and not R_2,1.
So the other question is whether the score I calculate will be able to discriminate between groups that really are bad for P and those that are only looking bad by bias.