# Thread: Which clustering algorithm could I use to group 2D points over a time series?

1. ## Which clustering algorithm could I use to group 2D points over a time series?

I am a software developer struggling to understand which clustering method/algorithm would be most appropriate to spatially group 2-dimensional point data (x,y) that is arranged over a time series (an example might be location data as someone moves around a 2D map).

I've read up on techniques such as K-Means, but it doesn't seem like it would respect the ordering of the data at all.

In terms as simplistic as I can explain;

* The sample size is variable.
* I will not know the number of clusters that the algorithm should produce ahead of time. The number of clusters is also not linked to the sample size in any meaningful way.
* I will be able to provide the maximum area/spread that a cluster should cover (in other words; how tightly grouped the points must be to be considered part of the same cluster).
* I will want to ignore outliers to these clusters as the point data may have erroneous values.
* The exact time stamps are not important, but the ordering of the data is critical. For example if the series of points was focussed around a central point 'A', moved away to be focussed around a central point 'B', and then returned to be focussed around a central point near 'A' again, the ordering of the data should dictate that there should be at least 3 clusters produced, rather than the 2 that I believe K-means clustering would produce.

I intend to post this same query on mathforum.org, mymathforum.com, mathhelpforum.com and mathoverflow.net (and possibly any other active forums I might find). I hope this is not unacceptable duplication.

Thanks,
Julius

2. ## Re: Which clustering algorithm could I use to group 2D points over a time series?

Hi, I think maybe you need to illustrate more about under what condition some points should be clustered, e.g., points within a certain distance.

Have you viewed this post? -- Clustering Algorithm for Mapping Application maps - Clustering Algorithm for Mapping Application - Stack Overflow

Seems you could get some clues there.

3. ## Re: Which clustering algorithm could I use to group 2D points over a time series?

Hi Kyle, thanks for your reply. Interesting post, but the clustering they discuss doesn't seem to be on a time series (unless I didn't read it carefully enough!).

Now that you ask me to clarify I'm appreciating how much of a difference the criteria of a cluster would make. I'm not sure what the most meaningful approach would be, but for a set of points to be classed as an area of focus I think it would have to have a minimum number of points. Beyond that I can't really define in my own head any limitations to do with cluster area (which I realise is a big failing on my part), but let me try by explaining the scenario better...

Imagine you had fairly inaccurate GPS location data for a man walking around a 2D map over a span of time. You are interested in the places where he stops to smell the roses (ie areas of focussed points in the time series, that may or may not include erroneous 'bad' data that needs to be excluded from the clusters). He might double back on himself to the same places, but we need to recognise that he left the area and that the subsequent visits to the same areas are later, or that he travelled more than a defined distance from the focus area before returning (or a combination of these things). He may also visit focus points that are fairly close, which need to be grouped desperately, so I will probably have to define a minimum distance between the clusters, if that is possible.

4. ## Re: Which clustering algorithm could I use to group 2D points over a time series?

Originally Posted by iluvyoulongtime
Hi Kyle, thanks for your reply. Interesting post, but the clustering they discuss doesn't seem to be on a time series (unless I didn't read it carefully enough!).

Now that you ask me to clarify I'm appreciating how much of a difference the criteria of a cluster would make. I'm not sure what the most meaningful approach would be, but for a set of points to be classed as an area of focus I think it would have to have a minimum number of points. Beyond that I can't really define in my own head any limitations to do with cluster area (which I realise is a big failing on my part), but let me try by explaining the scenario better...

Imagine you had fairly inaccurate GPS location data for a man walking around a 2D map over a span of time. You are interested in the places where he stops to smell the roses (ie areas of focussed points in the time series, that may or may not include erroneous 'bad' data that needs to be excluded from the clusters). He might double back on himself to the same places, but we need to recognise that he left the area and that the subsequent visits to the same areas are later, or that he travelled more than a defined distance from the focus area before returning (or a combination of these things). He may also visit focus points that are fairly close, which need to be grouped desperately, so I will probably have to define a minimum distance between the clusters, if that is possible.
Hi, your explanation helps me understand the purpose of your clustering algorithm much better. It must be easier for others to discuss and suggest as well. Below is what I think you could consider to implement the clustering alg:

1. I assume that GPS location points are sampled periodically, e.g., one location point per minute. And they are already sorted in a time series, say (x1,y1), (x2,y2) ... for locations at the first and the second minute, respectively.

2. You'd like to find spots where the man stayed for quite a while; then you probably first need to first define the coverage of a spot. For example, when you find that the man lingers on around (x5,y5) after the 5th minute, how close (like 2m or 5m) from each of (x6,y6), (x7,y7)... to (x5,y5) you are going to say the man stays around (x5,y5).

3. If the above pieces fit somewhat your thoughts, here's a possible implementation plan:
__for i in [2, n] %say n location points (x1,y1) ... (xn,yn)
____calculate the distance d[i,i-1] between (xi,yi) and (x[i-1], y[i-1]);
____if d[i,i-1]<spot_coverage_distance %say the distance you use to measure the closeness
______cluster (x[i-1],y[i-1]) and (xi,yi); %say point (x[i-1],y[i-1]) could be a focused spot
______calculate the distance d[i-1,i+1] between (x[i+1],y[i+1]) and (x[i-1],y[i-1]) % to test whether the man still stays around (x[i-1],y[i-1]) after 2 mins;

similar testing for the rest of locations;

Hope they are useful and welcome further discussions

5. ## Re: Which clustering algorithm could I use to group 2D points over a time series?

H Kyle. Thanks again.

Your suggestion was an excellent starting point for me. I have a few ideas now that I've added on top of your idea; including a bit of added resilience to ignore bad data points when creating the cluster. I also thought it would be a good idea to calculate a centre point of all points in the cluster and comparing the distance from that to the new point, rather than comparing all points to an initial point location. Before adding the new point you could check that the proposed new centre point (after adding the currently considered point) would not be too far away from any of the points in the cluster so far. If it would drag the centre point too far away from any already clustered point then not include the new point. Does that make sense?

Thank you again - you gave me a push in the right direction.

6. ## Re: Which clustering algorithm could I use to group 2D points over a time series?

Originally Posted by iluvyoulongtime
H Kyle. Thanks again.

Your suggestion was an excellent starting point for me. I have a few ideas now that I've added on top of your idea; including a bit of added resilience to ignore bad data points when creating the cluster. I also thought it would be a good idea to calculate a centre point of all points in the cluster and comparing the distance from that to the new point, rather than comparing all points to an initial point location. Before adding the new point you could check that the proposed new centre point (after adding the currently considered point) would not be too far away from any of the points in the cluster so far. If it would drag the centre point too far away from any already clustered point then not include the new point. Does that make sense?

Thank you again - you gave me a push in the right direction.
Hi, I'm really glad to hear that you find the suggestions useful.

Your consideration of the cluster center is definitely more thoughtful and necessary. My only further suggestion is that you could start from the basic framework as you illustrated above, test till it meets the basic requirements, and then refine it with more sophisticated features.

Hope you could nail the project soon!