I am currently analysing a data set on household energy usage and trying to build a statistical model for predicting the daily energy consumption for any house. So far I've determined that house type, household occupancy and socio-economics are the key predictor variables. (Note that socio-economics in the UK is often measured using the ACORN system which is a categorical ranking system with A basically been the highest ranking (rich people) and then wealth decreasing as you go through the alphabet).
So basically I have one quantitative predictor variable (occupancy rate - the number of people in the house) and two qualitative (categorical) variables (house type such as detached, semi-detached, flats and terraced) and ACORN.
At some point I want to perform a multiple regression analysis on the data and derive and equation which predicts y (household energy consumption) from the three variables.
Right - so here's my question: for the qualitative variables is it necessary to have data for ALL the possible household types in order to perform a regression analysis? For example I have households with all possible occupancies 1-6, however I DO NOT HAVE data for a detached 1 person household in ACORN group B (or many other ACORN groups either). My question is does this matter? Do I need to have a sample size so large that it covers every combination that is possible with the categorical variables? (Many hundreds of possible combinations in this instance...)