Hey everyone, I have a massive problem with Logistic regression. Don't let the bulk of this post put you off, you can pretty much ignore all the unnecessarily tech stuff at the beginning and jump to PROBLEM 2 and PROBLEM 3 at the end of the post.
Thanks so much for your help :)
I had a problem regarding Logistic regression (consult the thread "Logistic Regression in Social Sciences", although the material described in this question is largely self contained), which I was helped with by another user.
Now I have a better understanding of Logistic regression, and what my main issue was: the variable selection phase. I understood that multicollinearity was something to be avoided in the variable selection stage, but with a literature that suggests 6 or 7 variables may be potential confounders to my exposure-risk problem(exposure = "Does the student like school" LIKESCHOOL, risk = "Has the respondent ever drunk alcohol" DRUNKALC), I wasn't sure which of these variables to choose. On top of this, what about other confounders (namely products of these variables?), and interaction terms of the form DRUNKALC*confounders. The method of variable selection that Kleinbaum [Logistic Regression: A self-learning text] suggests is as follows:
Firstly, just to remind the readers (and those who want to answer my questions to come), the model I am discussing is the E,V,W model - we have an exposure variable (LIKESCHOOL), confounders V (such as AGE, SEX, etc, and also products and powers of these), and interaction terms W (of the form DRUNKALC*V's).
1. "we recommend that the choice of V's be based primarily on prior research or theory, with some consideration of possible statistical problems like multicollinearity that might result from certain choices."
So, if I had two variables that literature suggested that DRUNKALC depends on, and so they would need to be considered as confounders in my model, how would I test to see whether they were collinear, and in doing so determine whether one (of both) should be chucked into the model? A simple 1-dependent, 1-independent linear model?
2. "we would recommend any of the latter four variables be placed in the initial model [he is refering to placing products and powers of the confounders in the initial model] if prior research or theory supported their inclusion in the model."
He also then mentions that such variables may be omitted from consideration if multicollinearity is found (I presume multicollinearity is found with variables already chosen)... Again, how would I test this?
Either way, I think it's safe to say I don't need to include any terms like AGE*SEX, etc... in my model... right? This is hammered home with his summary: "the simplest choice for the V's is the C's [the indepedent variables AGE, SEX, etc...] themselves. If the number of C's is very large, it may even be appropriate to consider a smaller subset of the C's considered to be most relevant and interpretable based on prior knowledge."
4. The next step is to determine the W terms - which are variables which go into the model in the form E*W (being interaction terms). I quote: "We recommend that the choice of W's be restricted either to the V's themselves or to product terms involving two V's. Correspondingly, the product terms in the model are recommended to be of the form E*V_i and E*V_i*V_j".
5. NOTE: A condition that is adopted in variable specification is The Heirarchically Well Formulated specification. This states that if, say, E*V_i*V_j is chosen in the initial model, then all lower order components of this variable - namely E, V_i, V_j, V_i*V_j, E*V_i and E*V_j - must be chosen to be in the initial model too.
Thus at the end of the variable specification process we have chosen the variables E, V and W to give the intial model, via the methods above.
Now for the reduction of the model: here we go through a number of steps that look at each variable, and decide whether we keep a particular variable (based on a statistical test or interpretation), keeping in mind the Heirarchical Principle [any variable "decided" (by some means) to be kept in the model in this next stage of variable select/model reduction, has the property that the components of this variable are automatically decided to be kept]. For example, if for some reason the term DRUNKALC*SEX*AGE is decided to be kept in a reduced version of the intial model, the Heirarchical Principle states that we should keep Sex, Age, DRUNKALC, etc... in this reduced model also.
The steps that we go through to decide which variables are kept in the reduced and final model are described by the Heirarchical Backward Elimination Approach. Given the variables of the initial model, we take the highest order variables first, and go through each one and (using some statistical test) decide whether to keep them in the reduced (and final) model (not forgetting to apply the Heirarchical Principle)...
What statistical tests do you impose to decide which variables should be kept in the reduced/final model?
ANSWER [Recently Added]
There are two tests you can impose to determine whether to keep a variable in the final model you use for analysis.
The first is the likelihood ratio test, which uses the log-likelihood statistic of the likelihood function L. The method is to compute the log likelihood statistic for the full model and for the model with the relevent parameter (corresponding to the variable you wish to test to see if it can be removed) set to 0. Run logistic regression for each model, and then compute the log-likelihood ratio from the outputs of these models. This statistic is approximately chi-squared with degrees of freedom = the number of parameters zeroed.
The second test is the Wald test, and is more time-efficient. There are two draw backs:
- You need a large sample size
- Unlike in the likelihood test, you cannot use the Wald test to test the null hypothesis of the form = "Two of the parameters are 0".
However, I was thinking for the second point above, would you just use a "nested Wald test" - namely if you wanted to know if B_1 = 0 and B_2 = 0, couldn't you just apply the Wald test to B_1, and if the test allows us to say that B_1 = 0, then fit a new logistic model with the variable X_1 corresponding to B_1 deleted, and repeat the Wald test for B_2? It's just he mentions this as a draw back in the book I have been referencing (Kleinbaum), and I guess it is BECAUSE you can't simple use the same regression model...
Ok so, the question:
Are these the tests you apply to determine what variables get removed from the initial model, eventually leaving you with the final model ready for analysis?
Once I have determined my final model, which say is of the form:
what do I do? The aim was to get rid of the confounders from the models, because we inevitably want to relate the disease variable D to our exposure variable(s) of interest E = SCHOOLDISLIKE, controlling for AGE,SEX etc. But now I have a model with some confounders in, so when I run the analysis (presumably to get some odds ratios) does it not matter that I've got all these confounders in the model?
Thanks again (Wink),