Couple Regression Questions
I'm using a regression to look at the effect tax progressivity (how large the difference between the top and bottom marginal rate) has an GDP growth.
My thought was to use the US GDP to control for exogenous variables and have the regression be
State GDP = alpha + beta1*progressivity + beta2*usGDP
1. How do I determine whether or not to just use the GDP/GSP, log(GDP/GSP), or the actual rate of change? (GSP is gross state product, basically GDP of a state).
2. The data I have is annual and exists for all 50 states + DC. What sort of issues do I need to be looking for when running regressions with this sort of data?
I ran some regressions trying all three of those and wasn't finding the model to be very good. I decided to add in dummy variables for the 50 states and DC. Now I'm seeing a very high R^2 and flatness, GDP are statistically significant.
This image is the output from eviews, flatness what the progressivity variable was named. How does that look?
Its pretty big so I'm just linking to it instead of embedding it.
http://i.imgur.com/SO3Ko.png
Re: Couple Regression Questions
One important assumption that you are overlooking is homogeneity. The 50+1 states just aren't that similar in almost any respect. Adding the catagory variable for state should have helped.
National GDP is a conglomerate of the 50 states (plus a few spares). I would suggest a lack of homogeneity here, too.
I dare you to look me straight in the eye and tell me that the sorts of relationships you are trying to model are ONLY linear. If you're not modelling at least a first difference, or including at least a quadratic factor, you are barking up the wrong tree.
Oftentimes, a model can be improved by conditioning some variables. Generally, if all your inputs are roughly the same order of magnitude, the outcome is improved. Think of it this way, "Given a rational estimate for the mass of the Earth, how would you revise the estimate after including one more grain of sand?" The answer to this question suggests that variables should be significant - and one should not dominate all others. Sometimes, I see guidelines like this, "If your coefficient is less than 0.0001, you should consider discarding the variable associated with that coefficient." Generally, that is just wrong. If your input data are nowhere near the same order of magnitude, such a coefficient might be quite reasonable.
Lastly, have you any formal background in this area? It is not an area that should be dabbled in unless you have some reasonable expectation to understand the results. One can be quite misled by sexy looking results, if one fails to consider what might or might not be valid.
My Views. I welcome others'.