As a non-mathematician, any suggestions would be greatly appreciated.
I have a real-world problem, I am trying to get a handle on the best approach to use to determine correlations between multiple variables.
An analogy to the problem might be:
- the average January temperature is calculated for say 50 different cities, forming the first row of a matrix
- there will be correlations to different extents between the temperatures of these cities: some might be in a similar geographical location (strong positive correlation), or the same hemisphere (weak positive correlation), a different hemisphere (weak negative correlation) etc. No assumption is made about the extent or nature of these correlations, however
- The temperatures for Febrary to December are recorded in successive rows of the matrix
- The result is a 50x12 matrix, city temperatures in rows, months in columns
Here is what I want to do: without taking into physical location at all, classify the cities into several 'bins', according to how strongly their temperatures are correlated with the rest of the entire dataset.
For example, if 40 cities are located in India and Pakistan, 9 are drawn from USA and Canada, and one is from an island in the pacific ocean, one would expect the 40 Indian cities to show strong average correlation with the 50 cities (bin 1), 9 North American would show mild average correlation with the sample of 50 cities (bin 2), and the island city would show very weak correlation with the overall dataset.
Principal components analysis identifies the eigenvectors and eigenvalues that describe the source of the variation in the sample. But when it comes to classification, as I want to do, I don't think that it's quite the right technique.
If anyone can steer me in the right direction, I would be most appreciative.