# Similarity Score to use in Clustering vectors

• Oct 11th 2012, 08:06 PM
aryanh
Similarity Score to use in Clustering vectors
Hi everyone,

Is there any method to measure similarity between two vectors in a set of vectors that takes into account the distribution of each attribute in that set ?

To explain by example : let's assume we have 10000 vectors <x1,x2,....,xn> , now we want to know how much is the similarity between two specific vectors in this dataset <t1,t2,....,tn> & <r1,r2,....,rn> , is there any method that takes into account the distribution of x1 & x2 & x3 & ... separately ?

I want to use in clustering data ,

Thanks

Arian
• Oct 11th 2012, 10:21 PM
chiro
Re: Similarity Score to use in Clustering vectors
Hey aryanh.

There are many different ways of classification and this kind of thing is tackled in data mining.

One similarity technique in data mining that uses what is called "code-blocks" relate to self-organizing maps:

Self-organizing map - Wikipedia, the free encyclopedia

However if you have a distribution, then one technique that is at the heart of statistical inference is that you can do a statistical test of whether two parameters are statistically significantly the same or different.

An example of this is done with two means: we can construct a hypothesis under various models to say whether one population mean is the same as another population mean with different samples corresponding to different populations.

Now one analogue (and it's far from the only one) is that if you have say a multi-variable parameterized distribution for your vector or some parameterization that accurately captures the "thing" you want to test, then you can do a statistical hypothesis test if you know it's distribution and then test whether they are the same.

However before looking at the above idea of statistical inference, the first thing I would recommend you do (in spite of the above) is that you look at a way of defining a metric on your space that calculates a "distance" function between the two vectors, in which you can use this metric to decide similarity.

Data mining does this exact thing but in a variety of contexts: it assigns different kinds of metrics between points and uses a variety of tests to see if something is "similar". Usually, the real thing that is advanced is transforming the data from one space to another and then applying a metric in the new space rather than the old one.

There are lots of reasons to transform something to a new space (even a much higher dimension one which is done in multi-dimensional scaling) but the real core of this is to transform it so that you get a particular attribute that you don't see in the un-transformed space.