I have been trying to find the best method for analysing my data for a while now - I wonder if anyone would be able to offer an outside perspective. I want to undertake some kind of regression analysis in order to investigate the main influences on a dichotomous variable. However, my data has certain properties that mean that a straightforward logistic regression is potentially not the best way forward because:
1) it is nonparametric data, no assumption of normal distribution
1) some variables are correlated
2) The events are rare
3) The dependent variable is dichotomous
4) The independent variables are a mixture of nominal, ordinal and numerical
5) There are a large number of observations (10 000)
To address 1) I have run a VIF analysis, but I don't want to just throw away variables that are redundant unless I have a little more information whether they are marginally or centrally important. So I wanted to undertake a conditional inference tree analysis instead (using party in R based on Carolin Strobl's work) but because of the amount of observations, the program wouldn't run. Is there another way to approach the influence of correlated variables?
I also want to account in some way for rare events bias. I have looked at King's relogit package in R, but I need to address (1) first before I know what variables to enter.
Another paper said that ridge regression may dampen down the MSE enough that rare events are not so much of an issue, so I am also trying the CATREG model in SPSS (homals in R) with ridge regression. However I don't want to keep changing over analyses as each has different assumptions, equations and results. Is there something obvious I have missed in terms of the best analysis for the many idiosyncracies of my data?
Thanks in advance