# Thread: Classifying degree of covariation/correlation

1. ## Classifying degree of covariation/correlation

As a non-mathematician, any suggestions would be greatly appreciated.

I have a real-world problem, I am trying to get a handle on the best approach to use to determine correlations between multiple variables.

An analogy to the problem might be:
- the average January temperature is calculated for say 50 different cities, forming the first row of a matrix
- there will be correlations to different extents between the temperatures of these cities: some might be in a similar geographical location (strong positive correlation), or the same hemisphere (weak positive correlation), a different hemisphere (weak negative correlation) etc. No assumption is made about the extent or nature of these correlations, however
- The temperatures for Febrary to December are recorded in successive rows of the matrix
- The result is a 50x12 matrix, city temperatures in across rows, different months down columns

Here is what I want to do: without taking into account physical location at all, classify the cities into several 'bins', according to how strongly their temperatures are correlated with the rest of the entire dataset.

For example, if 40 cities are located in India and Pakistan, 9 are drawn from USA and Canada, and one is from an island in the pacific ocean, one would expect the 40 Indian cities to show strong average correlation with the 50 cities (bin 1), 9 North American would show mild average correlation with the sample of 50 cities (bin 2), and the island city would show very weak correlation with the overall dataset.

Principal components analysis identifies the eigenvectors and eigenvalues that describe the source of the variation in the sample. But when it comes to classification, as I want to do, I don't think that it's quite the right technique.

If anyone can steer me in the right direction, I would be most appreciative.
vj

2. I believe that what you are asking for is clustering. Conventionally, though, the items (cities, in your case) are stored in the rows, and the features ("variables") are stored in the columns.

Clustering methods attempt to group items into categories ("clusters"), so that items within a cluster are similar, and those in differing clusters are dissimilar. In any given situation, though, the following are subject to interpretation:

How is similarity determined?
How many clusters should there be?

Probably the most common algorithm for performing clustering is k-means clustering, which uses a geometric distance measure to determine similarity, but the analyst needs to indicate the number of clusters, k.

Unlike classification, there isn't an objective "right answer" in clustering. Some clusters might make more sense than others, but it is all up to which features are used, what clustering algorithm is chosen and how its parameters are set.

-Will Dwinnell
Data Mining in MATLAB