Results 1 to 2 of 2

Math Help - Similarity Score to use in Clustering vectors

  1. #1
    Newbie
    Joined
    Oct 2012
    From
    Montreal
    Posts
    2

    Similarity Score to use in Clustering vectors

    Hi everyone,

    Is there any method to measure similarity between two vectors in a set of vectors that takes into account the distribution of each attribute in that set ?

    To explain by example : let's assume we have 10000 vectors <x1,x2,....,xn> , now we want to know how much is the similarity between two specific vectors in this dataset <t1,t2,....,tn> & <r1,r2,....,rn> , is there any method that takes into account the distribution of x1 & x2 & x3 & ... separately ?

    I want to use in clustering data ,

    Thanks

    Arian
    Follow Math Help Forum on Facebook and Google+

  2. #2
    MHF Contributor
    Joined
    Sep 2012
    From
    Australia
    Posts
    4,026
    Thanks
    741

    Re: Similarity Score to use in Clustering vectors

    Hey aryanh.

    There are many different ways of classification and this kind of thing is tackled in data mining.

    One similarity technique in data mining that uses what is called "code-blocks" relate to self-organizing maps:

    Self-organizing map - Wikipedia, the free encyclopedia

    However if you have a distribution, then one technique that is at the heart of statistical inference is that you can do a statistical test of whether two parameters are statistically significantly the same or different.

    An example of this is done with two means: we can construct a hypothesis under various models to say whether one population mean is the same as another population mean with different samples corresponding to different populations.

    Now one analogue (and it's far from the only one) is that if you have say a multi-variable parameterized distribution for your vector or some parameterization that accurately captures the "thing" you want to test, then you can do a statistical hypothesis test if you know it's distribution and then test whether they are the same.

    However before looking at the above idea of statistical inference, the first thing I would recommend you do (in spite of the above) is that you look at a way of defining a metric on your space that calculates a "distance" function between the two vectors, in which you can use this metric to decide similarity.

    Data mining does this exact thing but in a variety of contexts: it assigns different kinds of metrics between points and uses a variety of tests to see if something is "similar". Usually, the real thing that is advanced is transforming the data from one space to another and then applying a metric in the new space rather than the old one.

    There are lots of reasons to transform something to a new space (even a much higher dimension one which is done in multi-dimensional scaling) but the real core of this is to transform it so that you get a particular attribute that you don't see in the un-transformed space.
    Follow Math Help Forum on Facebook and Google+

Similar Math Help Forum Discussions

  1. Fuzzy c-mean clustering
    Posted in the Discrete Math Forum
    Replies: 0
    Last Post: September 30th 2012, 03:38 AM
  2. graph clustering, clustering numbers
    Posted in the Advanced Statistics Forum
    Replies: 4
    Last Post: October 8th 2010, 02:23 AM
  3. Z score, t score, CLT? I don't remember.
    Posted in the Advanced Statistics Forum
    Replies: 5
    Last Post: June 24th 2010, 07:01 PM
  4. Clustering vectors with different length
    Posted in the Discrete Math Forum
    Replies: 0
    Last Post: January 13th 2010, 02:41 PM
  5. SVD and clustering with 3D data
    Posted in the Advanced Statistics Forum
    Replies: 0
    Last Post: August 31st 2009, 11:14 AM

Search Tags


/mathhelpforum @mathhelpforum