# Binary Logistic Regression but everything is significant

#### mkmath

I have a question related to Binary Logistic Regression in SPSS.

I have a dataset containing longitudinal data related to deaths. The lived/died (Event) indicator is given by 0(live)/1(die).

The dataset contains 10,000,000 observations of 200,000 people. Every person experiences the Event at some time in the dataset. Therefore, there are 200,000 Event observations and 9,800,000 No Event Observations.

The data set contains fields laid out similar to the following:

 Date ID Event X1 X2....X19 1/1/1900 Alice 0 1.2 34.5 2/1/1900 Alice 0 1.4 23.4 3/1/1900 Alice 1 1.6 12.5 7/1/1942 Bob 1 2.3 98.3

The are 19 possible regressors (X1,X2,X3,...,X19), 2 of them are categorical.

I decided to use Binary Logistic Regression to determine a model for likelihood of death, like so:

However, no matter what I do every possible regressor is being returned as significant, whether I try 1 regressor or all 19 of them. What fundamental assumption/error am I making?

#### chiro

MHF Helper
Hey mkmath.

When you have lots of information you tend to get an effect where the variance gets really small really quickly.

If you are measuring survival information then I'm wondering why you aren't using survival analysis (which is an entirely distinct branch of bio-statistics and has its own complications from a mathematical point of view).

Also - there are going to be lots of conditional relationships and asking to estimate a probability without factoring them in (especially with the information you have) is going to be difficult.

Basically there are so many points of variation for your problem that it is difficult to give feedback.

If you can add some specifics then please do so.

#### mkmath

Thank you so much Chiro this is what I was looking for
When you have lots of information you tend to get an effect where the variance gets really small really quickly.
I have seen examples which employ the counting process data style to employ survival analysis but the SPSS version I have does not handle counting process style data. Therefore, I turned to basic Logistic Regression as a fallback to see if I could get anything useful without having to resort to different software.

Let me give you some more information on the dataset:
The data was collected monthly between June 2006 and March 2016.
Each month end the individual's Alive/Dead state was recorded to determine whether the individual experienced a "Death" event during the previous period.
No observations are made of an individual after experiencing a "Death" event.
Individuals enter the dataset at different calendar months (e.g. I have an individual with observations between Jun 2006-Dec 2008, and another with observations from Jun 2006-Jul 2006)
Most regressors are calendar time dependent (e.g. GDP, National Unemployment Rate, etc.).
Some regressors are time independent (e.g. Geographic location of the individual, Race).

A more realistic (but synthesized) dataset representation is shown below:
 Individual Observation Number Observation Date GDP Unemployment Rate Race Event Alice 1 Sep-06 0.5 8 Alice 2 Oct-06 0.48 7.8 Alice 3 Nov-06 0.5 9 1 Bob 1 Jun-06 0.3 6 1 Bob 2 Jul-06 0.5 7 1 Bob 3 Aug-06 0.3 7 1 Bob 4 Sep-06 0.5 8 1 Bob 5 Oct-06 0.48 7.8 1 1 Chris 1 Jan-08 0.7 9 1 Dave 1 Oct-06 0.48 7.8 1 Dave 2 Nov-06 0.5 9 1 1

Last edited:

#### chiro

MHF Helper
There is a lot of categorical data - have you used any categorical data analysis?

#### mkmath

I have not aside from the binary logistic regression. Do you happen to have an opinion on what type of analysis I should perform if I want to answer the question: "given a mix of categorical and time dependent continuous variables what is probability that a given person will experience the event in the next time period?".

The more I read about applying survival analysis and mixed model approaches to my data, the more difficult it appears...

#### chiro

MHF Helper
Basically it depends on the response variable (dependent variable) at the focus of a regression.

So if that value is across the real line then you use normal linear techniques.

If it isn't then you have to use what are called Generalized Linear Models (GLM) to do this. Based on you measuring events, you will have to use a GLM.

If you run the regression for a given model, you will get the contributions of variance from each parameter in the model (the coefficients you are estimating) along with what is left over in the residual.

The larger the residual term is (in terms of its variance), the more information you need to describe the dependent variable of interest.

Also regarding GLM's - if you are using longitudinal data (which it looks like you are) then there are techniques specific to that. You should tell us if you are familiar with these techniques (longitudinal data includes correlation and covariance information that non-longitudinal data doesn't). I would recommend you look into this and ask questions as necessary on the topic of longitudinal (also known as panel) data.

If you can post this information (after performing a regression) we can give further advice.

#### mkmath

Wow! Thank you for such a detailed reply chiro

I will investigate implementing GLMs on longitudinal data. I have implemented GLMs before but not on longitudinal data.

I managed to get access to SAS which allows me to put my data into counting process format for implementing extended proportional hazard models (I say extended because it allows me to use time varying variables).

I will attempt to use GLM and survival analysis models and interpret/compare the results. I'm sure I will need advice on interpreting the results so I'll post when I get stuck

I would love to share my progress as I go as well so others can learn from my mistakes, but I wonder if it is appropriate to do so on this forum. Otherwise I may add it to a blog...

#### chiro

MHF Helper
There should be no problem doing that - just make sure that anything confidential stays that way.

Similar threads