Principal component analysis

Hi,

I have a large set of of experimental data (26 batches) with multiple variables (43) sampled at certaintime points.

the data is all centres around cell growth, so it includes the nutrients, product, waste products and cell growth, so it is all highly correlated and dependant.

i have tested varying one control level and measured the variables as responses. Can I use principal component analysis to examine this data, will it be worth while? I want to see how the control level effects the process and if it changes the correlation of the data.

would it be best to build a model using the control data (I.e. where the normal process) and then use the batches with the varied paramete as the validation data? Or is there a better method? And is there away to use one or more of the variables as the desired responses (ie product quantity)

Thanks,

Dan

Re: Principal component analysis

Hey DanGosling.

Principal Component Analysis is a technique that takes a set of random variables and de-correlates them. In the process it also sorts out variables by contribution of variation to the data from high contribution to low contribution.

If you are looking at using regression modelling (i.e. Y = f(X) where Y is the response and X is the predictors which can be a vector) then PCA is useful in some circumstances particularly when you do have high amounts of correlation or linear dependence between different random variables.

One particular technique that is used is to do a PCA on your vectors and whenever you have random vectors that give little contribution to the variation explained by your data (through the eigen-values obtained in the PCA process which represent this contribution to variation), then you basically throw these vectors out and do a regression using all of the others.

Intuitively, PCA basically "rotates" the random vectors to make them un-correlated just like you can rotate a bunch of vectors to be aligned with some specified axes or co-ordinate system.

If you end up using PCA and throwing out the vectors that aren't significant (due to the nature of the correlation), then you just take these vectors and do whatever regression you want.

If you want to look at correlations between many variables, then one way to do this is to use complex regression models with many interaction terms and using this you test whether the interaction term is 0 for evidence that correlation doesn't exist between the response and the interaction of predictors.

An example would be Y = B0 + B1*(A*B*C) where you test B1 = 0 against B1 != 0 for an interaction effect of (A,B,C) on Y.