# Thread: Logistic Regression in Social Sciences

1. ## Logistic Regression in Social Sciences

*** NOTE TO ANYONE INTERESTED IN THE CONTENTSOF, OR RESPONDING TO, THIS THREAD: a new thread has been created on the subject, titled "Logistic Regression 2". ***

I have a number of variables regarding teenage drinking and attitudes towards school. These include a mixture of binary/dichotomous variables and variables of the form "You teachers treat you fairly" (Strongly Agree, Agree, Disagree, Strongly Disagree). I want to analyse what the causal variables are on responses to "Have you ever drunk alcohol?", and take it that the way forward is logistic regression. I am using SPSS.

Here is what I plan to do, if anyone has any advice please respond!

Run a logistic regression for my dependent variable being "Have you ever drunk alcohol?", and with the independent variables as all of those variables I consider to have an affect on this dependent variable

Question 1: is there a concrete method for testing which variables have a "significant" affect on this dependent variable, and as a result I can justify using this as an independent variable in my logistic regression model?.

My uncle told me that the best thing to do is to look at a matrix of collinearity for all the independent variables I plan to chuck in, and if two variables are stongly correlated then you should leave one of them out (which one?) because this would lead to instability in the model. Can someone clarify this and is there a more concrete method of this process.

Question 2: So when I've got all my independent variables right I'll have my model, and all my beta_i's etc... but what kind of conclusion can I draw if, say, my final model has beta_1 = 0.8 (this being the parameter corresponding to age)?

Basically, I have a lot to learn.

Thanks.

L-dawg

2. Logistic is fine, but in SPSS make sure you treat any independent variables that have more than 2 categories as "Categorical", and then for each specify the "cornerpoint" or low risk category as the baseline. The rest of the continuous variables age and binary models can be added to the model.

The easiest approach is to run forward stepwise logistic using e.g. the Wald of Likelihood ratio method of forward stepping. This will only build a model with significant predictors. Stepwise is somewhat biased, however, since you are only selecting the "fish from the of the bucket." Arguable, backwards stepping is a method to preserve subset correlation, resulting in a model that preserves correlation.

Most psychologists or psychometricians would probably use the hierarchical approach with SPSS, where all the family variables are added as a group, all the school environment variables added as a group, peer group, etc, added as a group. This will give a chi-square test statistic (degrees of freedom = number of group variable minus one), that you can use to determine if each group of of variables is a siginificant predictor. Using this appraoch will also reflect that, by design, you have a theoretical idea about constructs and domains that predict risk. In fact, in psychometrics, I probably would never simply throw all variables into a logistic regression equation and see what happens.

The last is to evaluate univariate models, and then add single variables whose beta coefficients are significant (e.g., p<0.25) into a full model. This full model can contain risk factors, adjustments (age, family income, grade point average), and nusaince factors(variable you don't want to study but are significantly different across drinkers and non-drinkers.

Finally, I would recommend identifying variables that are significantly different across drinkers/non-drinkers, which are not really of interest to you, but nevertheless are different across the groups. In disease research, these are commonly the comorbidities that patients have, since patients have multitudes of problems at older ages (depression, electrolyte imbalnce, hypertension, etc.). Once you identify these variables, run a logistic regression (same dependent variable on only these variables). Before run-time, specify in SPSS you want the "logit". The logit is called the "propensity score." Do the run, and at the far rightmost column of the data set you will see "logit_1". Next, in your risk prediction models, use your primary risk and adjustment factors, and the logit to represent all the junk variables (which were different across drinker.non-drinker but not really of primary interest to you -- these are also called confounders). This latter model with the propesnity score representing nuisance factors may be better than models including all the nuisance variables.

Last, for any regression model, I apply my own "DACOD" principle.

D-check the distributions of the independent variables and the model residual.

A-don't forget the important adjsutment variables. AGe is usually in every model since disease or risk prevalence correlates with age.

C-coefficients, check the significance of coefficients and their sign. (+ or -).

O-outliers. Try to identify them and remove them, since they usually strongly bias almost every regression model.

D-Diagnostics, look at residuals, DFFITS, DFBETES, leverage, variance inflation factors (VIFs).

For your question about a beta value of 0.8 for age, it means that for each additional year of age, there is an increase in the odds ratio of exp(0.8).

Get Hosmer & Lemeshow's "Applied Logsitic Regression" textbook (New York, Wiley) if you want to read on logistic. Also, UCLA has a good stats resource for almost every type of regression, which is geared for STATA software, but the principles are the same.

3. Originally Posted by lep11
Logistic is fine, but in SPSS make sure you treat any independent variables that have more than 2 categories as "Categorical", and then for each specify the "cornerpoint" or low risk category as the baseline. The rest of the continuous variables age and binary models can be added to the model.
By "low risk" do you mean, say, for a variable which is Drink frequency with values of Never through to Everyday, the baseline category would be "Never"?

Originally Posted by lep11
The easiest approach is to run forward stepwise logistic using e.g. the Wald of Likelihood ratio method of forward stepping. This will only build a model with significant predictors. Stepwise is somewhat biased, however, since you are only selecting the "fish from the of the bucket." Arguable, backwards stepping is a method to preserve subset correlation, resulting in a model that preserves correlation.
Do you mind ellaborating on this? I would be most grateful. I have a strong pure maths background, but haven't done any statistics for 4 years (I left off my statistics career having completed the basics of normal distribution!). I'm not asking you to go into the smallest of detail with everything, but at the same time to put everything in laymans terms would be amazing!

Originally Posted by lep11
Most psychologists or psychometricians would probably use the hierarchical approach with SPSS, where all the family variables are added as a group, all the school environment variables added as a group, peer group, etc, added as a group.
Not really sure what you mean by this, could you explain?

This will give a chi-square test statistic (degrees of freedom = number of group variable minus one), that you can use to determine if each group of of variables is a siginificant predictor. Using this appraoch will also reflect that, by design, you have a theoretical idea about constructs and domains that predict risk. In fact, in psychometrics, I probably would never simply throw all variables into a logistic regression equation and see what happens.
I feel here that you're really answering my question, but I'm a bit lost...

The last is to evaluate univariate models, and then add single variables whose beta coefficients are significant (e.g., p<0.25) into a full model. This full model can contain risk factors, adjustments (age, family income, grade point average), and nusaince factors(variable you don't want to study but are significantly different across drinkers and non-drinkers.
Whats a univariate model?

Finally, I would recommend identifying variables that are significantly different across drinkers/non-drinkers, which are not really of interest to you, but nevertheless are different across the groups. In disease research, these are commonly the comorbidities that patients have, since patients have multitudes of problems at older ages (depression, electrolyte imbalnce, hypertension, etc.). Once you identify these variables, run a logistic regression (same dependent variable on only these variables). Before run-time, specify in SPSS you want the "logit". The logit is called the "propensity score." Do the run, and at the far rightmost column of the data set you will see "logit_1". Next, in your risk prediction models, use your primary risk and adjustment factors, and the logit to represent all the junk variables (which were different across drinker.non-drinker but not really of primary interest to you -- these are also called confounders). This latter model with the propesnity score representing nuisance factors may be better than models including all the nuisance variables.
So this is tackling the so called "nuisance variables", which is great. Before I try get my head around that i want to understand the stages you explained up to this point.

All in all I can't thank you enough for you reply, and hope to hear from you (or anyone else who can add to this post) soon.

4. Originally Posted by lep11
For your question about a beta value of 0.8 for age, it means that for each additional year of age, there is an increase in the odds ratio of exp(0.8).
So I've done some extra reading since I posted a reply this morning, and seem to have a better grasp of the basics of logistic regression.

Just to check: We have two sets, defined by an age difference of 1 (denoted $Age1$ and $Age2$). Suppose $X$ is a k-tuple of our independent variables, and $X_{Age1}$ and $X_{Age2}$ represent k-tuples that arise from a change of the independent variable Age from $Age1$ to $Age1 + 1 = Age2$. Then the odds for $Age1$ is given by:

$a=\frac{P(X_{Age1})}{1-P(X_{Age1})}$

and for $Age2$ by:

$b=\frac{P(X_{Age2})}{1-P(X_{Age2})}$.

The odds ratio is $\frac{a}{b}$.

You are saying that if we increase Age2 again by one unit, and repeat the process to get an odds of c, then the odds ratio $\frac{a}{c}=\frac{a}{b}e^{0.8}$.

Very sloppy I know, but let me know what you think.

5. By "low risk" do you mean, say, for a variable which is Drink frequency with values of Never through to Everyday, the baseline category would be "Never"?
Yes

Do you mind ellaborating on this? I would be most grateful. I have a strong pure maths background, but haven't done any statistics for 4 years (I left off my statistics career having completed the basics of normal distribution!). I'm not asking you to go into the smallest of detail with everything, but at the same time to put everything in laymans terms would be amazing!
Just try a run with variable selection using the options "Forward" and "Wald"

The rest is:

Hierarchical linear models commonly used by psychologists or psychometricians add groups of variables into a model simultaneously. Just scroogle "hierarchical logistic regression." The test to determine if entry of a group of variables is significant is called a likelihood ratio test (LRT) -- that is, you will get a p-value for the LRT. The test statistic happens to be a chi-square test, and chi-square tests are based on degrees of freedom, which happens to be equal to the number of variables in the group (added to the model) minus one.

I would start scroogling on terms you don't know about. And if serious about logisitic, then look at Hosmer & Lemeshow's Applied Logistic Regression textbook.

Univariate models only have one variable in them. If the beta p<0.25, for example, then you can add the variable to a larger model with many variables, and then run the larger model.

6. Hi again,

So yesterday I finished off reading the first few chapters of Kleinbaum - an excellent book on Logistic regression. The only trouble is that this text seems to concentrate on everthing apart from what variables should be considered in a model. I now know that a general model takes "the shape" of an $E,V,W$ model, i.e. it has exposure variable(s) E of interest, confounders and interaction confounders. My task now is to decide which variables are the confounders and interaction confounders (?)... but first I obviously need to decide what hypothesis I am testing so I can decide on my exposure variable(s)... right? A the task of deciding confounders is mention in the text you gave above (the stuff on nuisance variables and the "logit" approach in SPSS)?

Thanks,

Ldawg5962

7. YEs, exposure should be in your mode, then adjustment covariates such as age and maybe gender, and then nuisance factors (confounders). I wouldn't get formal with these terms, since they really amount to study factors and nusaince factors. You should try to get rid of confounders and show that odds ratio of the risk factor (E) is not altered significantly by the confounder(s). Confounders are basically problem (nuisance), so you would rather not deal with them. However, confounders in your study would be variables significantly different between drinkers and non-drinkers that are not your exposure variable.

9. Hey lep11, I don't know if you've checked out my Logistic Regression 2 post, but I am still having difficulties with Logistic Regression. I am going to run you through everything as best I can, and I would really appreciate it if you could explain to me in laymans terms what I have to do.

I've decided which variables I am going to use, I have an exposure variable E=DISLIKE OF SCHOOL and the disease variable D=DRUNKALC. I have other covariates: AGE, SEX, TEACHERS (teachers expect too much of you), RULES (rules are too strict in school), bullied (you have been bullied by another student).

If I run this initial model and find that, say TEACHERS is not significant (by Walds test) I can remove it and get reduced model that better fits my data (correct lep11?).

So... you said:

Originally Posted by lep11
You should try to get rid of confounders and show that odds ratio of the risk factor (E) is not altered significantly by the confounder(s).
and since presumably I haven't got rid of all the confounders (for example RULES is a confounder that I haven't been able to get rid of, yet we may assume from the literature that it is a variable that will effect whether a student has drunk alcohol, but since it is not an exposure variable of interest we must control for it - CAN YOU CLARIFY THAT THIS PHILOSOPHY IS CORRECT?) I must do as you suggested and "show that the odds ratio of the risk factor (E) is not altered significantly by the confounder(s)." What do you mean by this? Can you give me an example in terms of the (assumed) confounder RULES and my exposure variable E, because you seem to know what you are talking about but I am not a statistician, so what you say keeps passing over me.

Thanks very much,

L-dawg