# Thread: Regression Analysis - Transformations and Non-Normality

1. ## Regression Analysis - Transformations and Non-Normality

Hello,

I am fairly new to regression analysis and have a few questions..

I have a fairly large dataset containing information about 144 municipalities in a country. I am trying to create a model that uses a number of independent variables (urban ratio, literacy rate etc.) to model the migration in each municipality. My independent variables are a mix of binary (0,1) and continuous. I was wondering if anyone could point me in the right direction of where to start with analysing my variables? I have done a simple linear regression and that has produced no significant results. It has been suggested to me that I could try and transform some of the variables (as they are non normal) or to use a model that does not assume normality.

What sort of transformations are there and why might they be useful? What are the other models available that do not assume normality?

I have also been told to look at GLMs....

Sorry for the long message.. but I desperately need help!

2. ## Re: Regression Analysis - Transformations and Non-Normality

Hey larrikinlover.

There are what we call GLM's with link functions. You can also use additive models which transform your input random variables (which sounds like something you should check out).

The thing though is that you need to describe what you are modelling: what are the variable types of your response and predictor variables? This answer will make a big difference on what you will choose with regards to analytic techniques as well as whether the assumptions are valid for those techniques.

With normal multiple linear regression, you can transform variables using various transformations (as is commonly done), but ultimately you need to say what you will be doing with the results and how you will use the model.

3. ## Re: Regression Analysis - Transformations and Non-Normality

Attach a sample of complete variables, definitions, etc...and required stat info.

4. ## Re: Regression Analysis - Transformations and Non-Normality

Thank you for your help chiro

I have data from 2000 and 2010. I am trying to find out whether an increase in an industrial activity has had a more than proportionate effect on population growth (or migration). I am going to do this by comparing municipalities that contain employment in an industrial activity to those without. I have been asked to construct a simple model first with population change between 2000 and 2010 (percentage increase) as the dependent variable and the presence of the industrial activity (as a percentage of total employment), the distance from the capital (in km) and the presence of a federal road (binary - 0,1) as independent variables. I am still unsure how this will help in deciding whether there is a more than proportionate population increase in industrial municipalities but perhaps that is something I will get to later.

I suppose my problem is deciding what analytical techniques to use for the data I have. Could you possibly explain a little more about additive models? This is something that has been brought up but I am having a hard time understanding what they are. Also, is there a systematic way of finding out which model/transformations are best or am I just trying to max the R2 value?

Thank you again, this is a big help!

5. ## Re: Regression Analysis - Transformations and Non-Normality

Also, I have uploaded a sample of my data (PopChange is DV)

6. ## Re: Regression Analysis - Transformations and Non-Normality

Since you are using count data, you probably want to use a GLM with your Y being Poisson distributed with the appropriate link function (we will discuss this later).

As for having categorical variables, you simply setup the right dummy variables in your model to take care of these: it will depend on how many categories, how independent/dependent they are (if they are dependent you will have interaction terms and you can check whether these are significant or not)

Additive models are basically Y = f1(X1) + f2(X2) + ... + fn(Xn) so they try to find general functions to fit the best model given the data.

Personally I think you should think about transforming your data rather than looking at these initially since the above is more complex and if you are new to this, you want to go with the simplest thing that are comfortable with understanding: I have to stress that it's better to use a simpler technique that you understand than a complex one that you have no idea about because stuff you don't understand tends to blow up in peoples faces.

I've just noticed your pop-change data is in decimal form (so not integers): Can you explain what this variable is referring to? (Is it some kind of relative change)?

I would take a look at either the Poisson or the exponential distribution for rate distributions and see what kind of link functions are used for these as well as what these link functions are used for in the context of your experiment (and please share your thoughts with us). The link function will relate directly to the model you are using (i.e. the predictor variables).

The main difference with the Poisson and exponential relates to continuous or discrete changes.

7. ## Re: Regression Analysis - Transformations and Non-Normality

I would like to use a Poisson distruibution with a log link function. However, when I run the model in R it comes up with a warning because some of the DV data is negative... could I get round this by transforming the DV (log, sqrt etc.)?

The PopChange data is the percentage increase in population between 2000 and 2010. This could always be changed to an integer value if this is easier.

I do not know how to determine which transformations are appropriate for the IVs... what do I need to look at?

Thanks again, and apologies, I'm really new to all this.

8. ## Re: Regression Analysis - Transformations and Non-Normality

Industy employment values are positive only and I suspect you have not included layoffs! So I use this fact and use it as dependent var then you can find the inverse function and parameters.

Case numbers are indicated on outliers but they are not excluded in this model.

A quick check indicates a pattern with Tweedie+log distribution of industry employment in terms of other variables.

Each parameter seems to be significant in this model:

It is interesting to note that industry employment is higher in localities with no fed roads
This is a quick & crude analysis & model just as a guide and should be properly checked in detail.