hi, Im a new forum member so apologies if this post is in the wrong section but it seemed most appropriate.


I have a set of data that I want to optimally group or cluster. The data itself is non numeric. I want to know what technique(s) can do this?


So, as a basic example, imagine I have 4 people (1,2,3,4) and each one can have 2 characteristics : town and name So my data might look like this. Assume that jane in London is the same as Jane in Luton...she just has different information on differing documents.

Document/ person. Town Name
1. London. Jane
2. Liverpool. Amy
3 Liverpool. Amy
4. Luton. Jane



There is no dominant characteristic. There are multiple possible groups . One answer might be to group (2,3,4) together and leave (1) on its own. This would be a poor solution because although group (1) looks fine when you look at group (2,3,4) you can see that there are 2 different towns(liverpool &london) and 2 different names (Amy and Jane). Also Jane has been split between two groups which isnt good


Another solution would be (1) (2,3) &(4) this looks pretty good , all the Amys are together, all the liverpools are together, the only issue is that Jane has been split between London and Luton.


So the 'error' can be reduced further by going (1,4) and (2,3)....although this still has error (2 towns for Jane) it feels like the best grouping.

In my real problem i have about eight categories and thousands of lines. Instead of the name structure here i am actually using forenames plus surnames plus date of birth, instead of town i am using addresses.

Is there a mathematical technique /algorithm to do this sort of thing?

Any help gratefully received

Thanks