Say I have two lists, List1 and List2 containing elements such as words. Some words are common two both List1 and List2. I want to create a distance metric that tells me how far apart the two lists are based on a similarity "score". The similarity score and distance metric are as follows:
Similarity score: Intersection(List1,List2) / Union(List1,List2)
Distance = 1 - Similarity Score
In other words, the similarity score is just the percentage overlap between the two lists and the distance is 0 when the two lists are the same. Say I generalize this to n lists and I calculate the distances between lists (a symmetric matrix of distances). My question is, is this distance formula valid? In other words, does it satisfy the triangle inequality? How do I check this?
A first attempt if all list sizes are equal:
Let N = size of the list, x, y & z = pairwise overlaps between three lists. You must have that x >= y+z-N. Distance between lists that give you x is 1-x/(2N-x). With some algebra you can conclude that the triangle inequality is satisfied for all valid x, y & z.
Is this true for arbitrary list sizes?