Hi there,

Let me give you a little background on the problem. I am working on a quasi-experimental design and I need to match each treatment with a control that is 1) geographically close and 2) exhibits similar total energy consumption throughout the year. My first approach was to define a maximum radius around each treatment and then choose the control with the closest winter and summer usage using the Euclidean distance between the treatment and control points.

I realized this doesn't account for the correlation structure between variables so I computed the PCA scores and then measured the Euclidean distance (i.e., the mahalanobis distance) between the points. This got me thinking: maybe I can expand the radius to half a mile or so and then add the coordinates to the distance calculation. When I did this I got reasonable results, but the loading on PC1 and PC2 were really high on the WE-coordinate, which seems strange to me. I am just not sure it makes sense to do this from a practical standpoint.

Another thing I was considering was adding all twelve months of usage to the PCA calculation, including the coordinates and measuring the distance between the first 2 or 3 treatment PC scores and control scores. Using all 12 months would help match on 1) total consumption and 2) usage patterns rather than matching consumption alone.

Anyways, I apologize for how wordy the question is. I am mostly curious to know what you think about including the coordinates in the distance calculation and if PCA can be used on a time series like this (i.e., the monthly consumption).

Thanks in advanced.