Multiple regression analysis and qualitative variables

Hello all,

I am currently analysing a data set on household energy usage and trying to build a statistical model for predicting the daily energy consumption for any house. So far I've determined that house type, household occupancy and socio-economics are the key predictor variables. (Note that socio-economics in the UK is often measured using the ACORN system which is a categorical ranking system with A basically been the highest ranking (rich people) and then wealth decreasing as you go through the alphabet).

So basically I have one quantitative predictor variable (occupancy rate - the number of people in the house) and two qualitative (categorical) variables (house type such as detached, semi-detached, flats and terraced) and ACORN.

At some point I want to perform a multiple regression analysis on the data and derive and equation which predicts y (household energy consumption) from the three variables.

Right - so here's my question: for the *qualitative* variables is it necessary to have data for ALL the possible household types in order to perform a regression analysis? For example I have households with all possible occupancies 1-6, however I DO NOT HAVE data for a detached 1 person household in ACORN group B (or many other ACORN groups either). My question is does this matter? Do I need to have a sample size so large that it covers every combination that is possible with the categorical variables? (Many hundreds of possible combinations in this instance...)

Many Thanks

-Rob

Re: Multiple regression analysis and qualitative variables

I assume your regression model is of the form:

Where X1 and X2 are categorical variables.

My thinking:

You dont need data for all combinations because the assumption of the (basic) OLS model is that the *effect of each variable on E(Y) is independent of the values of the other variable (unless you use combined indicators, but you haven't said anything to suggest you intend to do that).*

Of course the assumption in italics is strong and you will have a weaker chance of detecting any error if you dont collect data from all combinations of caategories.

Does that make sense?

Re: Multiple regression analysis and qualitative variables

Hello,

Yes that does make sense thanks, and you're correct in that the model will be of the form *y = a +b1x1 + b2x2 + ...bnxn*. The predictor variables are independant of each other so from what you posted it appears as though I don't need to have all possible combinations of categorical variables covered (thankfully!)

Thanks for the reply. :)

-Rob