I have a large amount of data that I wish to analyse. The data is structured as follows:
- 1 dependent variable
- 9 independent variables
- The independent variables are categorical and not of the same size. That is, one category has 4 values, one has 10, one has 8 etc.
- The dependent variable is numeric (quantitative)
- Approx 15,000 data points, representing various, but not all, combinations of the 9 independent variables.
I want to work out what is the relevant impact of each of the independent variables. Common questions will be: “What is the effect of having value X compared to value Y”, “Which variable has the largest effect on the dependent variable?” etc.
I was advised that stepwise regression would be the best approach, however I believe that to use this all the categories must be the same size. Is this true?
I am at a total loss as to how to tackle this, any suggestions appreciated!
What you are looking to do is pretty common in statistics and is referred to as correspondence analysis.
My first stab at analysing data is always to perform PCA (Principle Component Analysis) on it. However, PCA is only applicable if your data can be assumed to be continuous in its domain. I don't know much about stepwise regression, but I found a paper on an extension of PCA to categorical data here: http://www.dm-lab.ws/www/contents05/covEigGiniRep16.pdf .
Furthermore, if you are looking for a stats package that can perform correspondence analysis, R does it and is freely available: http://www.r-project.org/ .
I found a disussion on correspondence analysis in R here: http://www.jstatsoft.org/v20/a03/paper .