# Thread: Detecting significant change in relationship between variables over time

1. ## Detecting significant change in relationship between variables over time

Hi,

I am trying to monitor the relationship between two variables over time and detect when a statistically significant change in the relationship between them occurs. This process should occur online, so the analysis takes place whilst the variables are being received.

I have only a low level of statistically knowledge, however from what I have read I have come up with the following and I would appreciate it if somebody could let me know if it makes sense, is legitimate or if I am completely off track. If the latter, a point in the correct direction would be great.

The two variables, A and B, are sensed data each recorded at every time step. My approach is to record 30 values for each variable. Once the initial 30 have been gathered, future values are only added after the oldest value is deleted. In this way, a sliding window, the number of readings for each remains at 30 for every time step.

To measure the relationship between the two variables I periodically compute the Pearson Correlation Coefficient. This gives me a measure of their relationship in terms of how much they correlate together. After every Pearson calculation, I record the result, C, in a time-series and as above, use a sliding window technique.

By analysing this time-series I can detect significant changes in the relationship when they occur using an algorithm such as anomaly detection using CUSUM or Shiryayev Sequential Probability Ratio. So if the variables are always strongly correlated that is not what I am looking for. I am looking for when the variables have been strongly correlated and then stop being correlated, or vice-versa.

Is the process described above valid from a statistical point of view?

All feedback is greatly appreciated.

2. ## Re: Detecting significant change in relationship between variables over time

Hey tuathal.

Correlation statistics only look at changes in a linear sense: Any non-linear relationship will be represented in a non-linear relationship and the way that non-linear relationships are assessed in general
is through regression modeling.

The benefit with regression modeling is that you can specify any model to fit the data to (what you are doing is essentially "projecting" the data to some high-dimensional space) and you can
look at the fit statistics to evaluate how strong the fit really is.

In the general case, you basically look at the coefficients of the fitted model (for example Y = a + bx + cX^2 + dX^3 will have coefficients a,b,c,d which are themselves are random variables with
distributions) and based on that you can look at how the relationship changes based on the rates of change expressed within these statistics (and their variance measures).

Note that regression modeling can be applied to time series models and those which have correlation structures between observations. Common statistical packages like SAS and R allow you
to specify these constraints if you need to have them. You can also use things like Markov-Chain Monte-Carlo (MCMC) if you have complicated distributions with all sorts of conditional
relationships between the variables themselves. MCMC is based on theorems that dictate that the posterior distribution converges to the stationary distribution and this result basically
allows you to specify all kinds of complicated relationships with a probabilistic guarantee that given enough iterations, you will get the right distribution. If you need to model
complicated conditional distributions, then I would suggest looking at MCMC and something like WinBUGS (which does MCMC) or its frontend programs in R. WinBUGS is free by the way.

In terms of the relationship, after you fit a regression model, then look at the coefficients in terms of the rates of change. Remember that if you are plotting a mean response function
(which is typically what you do in regression: you fit a population mean model), then consider rates of changes in terms of derivatives. Although you can't really do this properly
for random variables (you can but it's a lot more complicated due to them being random variables and having different measures from a measure-theoretic point of view), you can
use simple calculus to get the rates of change if you only focus on the population mean model (which is deterministic) as opposed to the differential for the random variables.

What I mean by the above is that you focus on dP/dx where P(x) is the mean response at a given x instead of focusing on dY/dX where Y and X are random variables. The
stochastic calculus is very complicated, but you do not need it if you are looking for mean response models.

3. ## Re: Detecting significant change in relationship between variables over time

Hi Chiro,

In terms of the regression model and looking at the coefficients over time, will that work when the exact nature of the relationship between the variables is not known ahead of time? In this sense I mean that the analysis of the variables will be automatic and undertaken by a programmed agent. Additionally, would your suggestion hold with the "sliding window" approach I mentioned above, or would all historic values need to be maintained?

Thanks
Tuathal

4. ## Re: Detecting significant change in relationship between variables over time

Although I do know something about 'simple statistics' I am certainly no statistician so please bear with me if I ask an obvious question. In doing modeling like this don't you really need to assume (measure and deduce) some of the underlying properties of your distributions before you can choose your time window [sample length]. As a simple example, if your processes were periodic and your sliding time window were 'significantly less' than the period, incorrect conclusions might be drawn. In one sense, I think this has to do with the stationary of the data as was mentioned above.

5. ## Re: Detecting significant change in relationship between variables over time

The first thing you should do is describe the kinds of classes of regression models you wish to fit the data to.

If you have correlations between random variables or aspects found in time series, then you should include this as a constraint.

One must remember that the coefficients, if they are based on means (which they should be) will be normally distributed given enough information (look at the Central Limit Theorem for more information).
The only thing you have to decide is the distribution of the variable you are modelling (like the Y in Y = a + bX + cX^2 + dX^4 as a non-linear example). The Y variable has a range of values and if
it doesn't fit the whole real line (remember you are estimating expectation in a population response model) then you need to turn to Generalized Linear Models.

Unless you can describe a class of models for your relationship, there is not that much we can do to help you. You know the models and its details far better than any of us.

6. ## Re: Detecting significant change in relationship between variables over time

Apologies for the lack of detail, I realise now that some additional context might help.

I am a computer scientist and I am working on this approach in the area of agent-based systems. An agent is a computational entity, so a program, device etc, that percieves and acts upon the environment in which it is located and is autonomous in its behaviour.

It was my desire to devise a somewhat generalised approach that would allow the agents to detect changes in the relationship between variables they generate (inner variables) and variables they sense from their environment (outer variables). As the approach is general, I don't know in advance what the relationship between the variables will be in all cases as it should be applicable in a number of different domains.

In terms of the distribution of the variables, I have read about the effect of averaging, as you suggest chiro, and that this can move the distribution to a normal distribution. So we can assume that the two variables have a distribution at least approaching normality.

It's important to note that the relationship may change over time as other agents will act on the environment or something will naturally occur in the environment, or the agent may change their behaviour. It's not a static relationship. So from there, what I want to do is track the relationship between the two variables overtime. I don't know what the relationships between the variables will be, linear or non-linear. So I guess the first step will be to figure that out in each case?

However, assuming the relationship is quantifiable, I want to detect when the relationship becomes statistically different from what the agent has experienced up to now. Chiro suggested using calculus for rates of change in historical values of the coefficients in a regresison model, however, can this tell when a statistically significant change has occured? As a simple example, I had thought that if I track the regression coefficients or initially the correlation coefficient as a time-series, I would obtain a mean and standard deviation for the observed coefficients. Then when the relationship changed the mean and/or standard deviation would change. However, I do not know of a technique to do this while observations continue to arrive.

Thanks for patience

7. ## Re: Detecting significant change in relationship between variables over time

In statistics (and all of applied mathematics in general), you have to specify some constraints and these include a model definition.

The most general kind of system for learning is a neural network and this is not the same as say a statistical regression or some sort of dynamic system that you would find in applied mathematics.

In terms of detecting changes, I would suggest you construct a conditional distribution and use that to construct hypotheses in which you could test them. A conditional distribution could
be used to generate a test statistic that could be tested against some specific value and you either accept or reject based on your decision boundary. I would recommend you work on
defining exactly what hypothesis you want to test and then take it from there to construct a distribution so that you can construct the test statistic from your sample and use other
statistic techniques (like CLT and Likelihood Ratio Test) to make that decision. Any graduate book on Statistical Inference will give you all the theory and techniques to do this.

With regards to time-series, you still need to have a model that you fit your data to. Most time series specify correlation structures for correlation between random variables in some pattern and
also specify distributions for residuals as well. You use this model to derive distributions for estimators and use this to get your mean and variance for the estimator of the random variables of
interest.

I just want to point out something about statistics: parameters in statistics are considered as constant when they are not random variables. In Bayesian statistics they can be random variables but
in classical statistics they are not. If you are measuring parameters, then you need to be aware of this fact. Parameters can reflect rates of change and all sorts of complex information but they
are considered constant none the less. By focusing on getting the right parameters, you can get the right information to help you identify what you need to. Means can represent static means,
rates of change in various derivatives, association (correlation), and many other things. You can only get the right parameters by understanding what it is you are trying to accomplish. Fitting
general models basically hide what is important within a tonne of variables while specific models show what is important while trading off accuracy and representation if the model is too "simple".

Your question is too general to answer in the form you have put it in. If you want a truly general approach then take a look at neural networks. If you want a more specific approach then I would
start off by considering a regression model. You can also have systems of regression models if you would like to as well, but you need something to start off with.

To give you some final advice I would recommend modeling the regression coefficients in terms of rates of change like you do with say a differential equation. This way you are measuring rates of
change rather than static means. I would also recommend you look into time series and longitudinal analysis for the theory and techniques. With these models you can basically keep incorporating
all new information to get updated estimates. I would also like to point out that you can actually make conditional estimates (ie using conditional distributions) for the mean and variance given a
specific data point - something you asked in the last paragraph of your response. All the theory for that is in an applied regression textbook along with the formulas and techniques.

In summary - neural networks as a general approach and regression as a more specific approach. If you have any other questions I'll do my best to address them.