# numerical problem: efficient euclid distance with very large numbers

• Dec 17th 2007, 11:29 AM
sirandreus
numerical problem: efficient euclid distance with very large numbers
Hi all,

Iīm looking for an efficient solution for the following problem:

Let V be a set containing m vectors V={v1,v2,...,vm}. Given an input vector vi I need to find the vector from V which has the smallest euclid distance to vi (find best matching unit).
Normally not an issue. However, every element of Vīs vectors and the input vector vi always have a value from the open interval (0,1).

example 2D vector:
v1 = (0.0123123423423 , 0.00023567887)

Assume that every number has several thousand or even million digits. So practically spoken, if you implement euclid distance for this kind of numbers, you will need special high precision libraries, where every mathematical operation (like add, mult or sqrt) will run many times slower than with primitives such as double.

Recently, I had the idea to tokenize a long number into smaller peaces and execute the euclid distance only on this peaces.
so for instance lets assume we are in 1d (for simplicity) and we have the following two vectors v1, v2 and the input vector vi

v1 = 0.001222345566
v2 = 0.123456649900
vi = 0.1923343233232

lets say we are tokenizing the numbers in peaces of length 5, then the first token (t1) for each vector would look like this:

v1_t1 = 00122
v2_t1 = 12345
vi_t1 = 19233
(here I am leaving away "0." since it is shared by all numbers)

here we see that the 5 first numbers of v2 are closer to vi than v1īs. Thus whatever comes after this 5 numbers, v2 will always be closer to vi. And thus vi is the best matching unit.
If the first tokens of v1 and v2 had the same (or very close) distance to vi, then the the second token containing the next 5 numbers would be compared.

The questions:

does this approach make sense ? (with approach I mean tokenizing the numbers and comparing them part by part)
Am I overlooking anything fundamental ?
How can this idea be improved ?

edit @ 18.12.07, 17:56 :
for new readers and future responses : Please accept the fact that Iīm using huge numbers, this is simply a requirement and canīt be changed.

Andreas
• Dec 17th 2007, 08:42 PM
Constatine11
Quote:

Originally Posted by sirandreus
Hi all,

Iīm looking for an efficient solution for the following problem:

Let V be a set containing m vectors V={v1,v2,...,vm}. Given an input vector vi I need to find the vector from V which has the smallest euclid distance to vi (find best matching unit).
Normally not an issue. However, every element of Vīs vectors and the input vector vi always have a value from the open interval (0,1).

example 2D vector:
v1 = (0.0123123423423 , 0.00023567887)

Assume that every number has several thousand or even million digits. So practically spoken, if you implement euclid distance for this kind of numbers, you will need special high precision libraries, where every mathematical operation (like add, mult or sqrt) will run many times slower than with primitives such as double.

Recently, I had the idea to tokenize a long number into smaller peaces and execute the euclid distance only on this peaces.
so for instance lets assume we are in 1d (for simplicity) and we have the following two vectors v1, v2 and the input vector vi

v1 = 0.001222345566
v2 = 0.123456649900
vi = 0.1923343233232

lets say we are tokenizing the numbers in peaces of length 5, then the first token (t1) for each vector would look like this:

v1_t1 = 00122
v2_t1 = 12345
vi_t1 = 19233
(here I am leaving away "0." since it is shared by all numbers)

here we see that the 5 first numbers of v2 are closer to vi than v1īs. Thus whatever comes after this 5 numbers, v2 will always be closer to vi. And thus vi is the best matching unit.
If the first tokens of v1 and v2 had the same (or very close) distance to vi, then the the second token containing the next 5 numbers would be compared.

The questions:

does this approach make sense ?
Am I overlooking anything fundamental ?
How can this idea be improved ?

Note that the solution to this problem does not need to be exact. Thus approximations are welcome as well :)

Andreas

If it does not need to be "exact" then why not just truncate all the numbers
to 10 significant digits and work with those in standard double precission
floating point?

ZB
• Dec 17th 2007, 11:02 PM
sirandreus
hi,

the representation of the numbers must be in a high precision, but practically spoken "not exact".

I cant just consider the first 10 digits, see counterexample:

0.00000000000002354534534
0.00000000000001123545455

You can not compare this two numbers, if you cut everything but the first 10 positions.

In my original idea (see first post), you would first compare the 10 first numbers, which are equal. Because they are equal you would compare the next 10 positions (assuming you are moving in steps of 10).
There you would compare:

0002354534
0001123545

cheers,
Andreas
• Dec 17th 2007, 11:38 PM
CaptainBlack
Quote:

Originally Posted by sirandreus
hi,

the results have to be exact (corrected original post as well).

cheers,
Andreas

Why do you need such high precision?

Working with say $13$ digit precision the probability of two distances being
indistinguishable is $\sim 10^{-13}$ (with a bit of scalling and to a hand waving
approximation), so you will need a lot of vectors before the chance
of encountering a problem becomes significant.

RonL
• Dec 18th 2007, 12:11 AM
sirandreus
hi Captain,

why I need such high precision is descripted here: http://www.mathhelpforum.com/math-he...html#post90666

In short words: having an (0,1)^n as an inputspace, and k vectors v1,...,vk in that space, this k vectors are mapped uniquely into one vector in (0,1)^n.
This works by interleaving the decimal expansions of the vectors and thus a very huge vector can be created.

Sorry CaptanBack, but I canīt figure out how your statement about indistinguishable distances is related to my problem :(
As I understand you are talking about the probability that 2 vectors have the same digits (with a length of 13). In fact, this probability is not relevant, because Im looking for the best matching unit (for a vector from the set V which is the closest (in terms of euclid distance) to the input vector v_i )

kind regards,
Andreas
• Dec 18th 2007, 03:36 AM
CaptainBlack
Quote:

Originally Posted by sirandreus
hi Captain,

why I need such high precision is descripted here: http://www.mathhelpforum.com/math-he...html#post90666

In short words: having an (0,1)^n as an inputspace, and k vectors v1,...,vk in that space, this k vectors are mapped uniquely into one vector in (0,1)^n.
This works by interleaving the decimal expansions of the vectors and thus a very huge vector can be created.

Sorry CaptanBack, but I canīt figure out how your statement about indistinguishable distances is related to my problem :(
As I understand you are talking about the probability that 2 vectors have the same digits (with a length of 13). In fact, this probability is not relevant, because Im looking for the best matching unit (for a vector from the set V which is the closest (in terms of euclid distance) to the input vector v_i )

kind regards,
Andreas

You are doomed to failure unless you recast your problem.

Almost all reals in (0,1) have non-terminating decimal expansions, which
leaves you with infinitly long calculations.

The interleaving of the decimal expansions is an "in-principle process" in
practice it is impossible to do exactly.

RonL
• Dec 18th 2007, 03:46 AM
sirandreus
ok lets assume exactness is not a major requirement. For me its more important if the modified euclid distance computation works (see first post)

In practice you could use a max precision, such as 100000 digits. In practice nothing is exact :) or at least not efficiently possible
• Dec 18th 2007, 04:03 AM
sirandreus
Quote:

Originally Posted by CaptainBlack
You are doomed to failure unless you recast your problem.

Almost all reals in (0,1) have non-terminating decimal expansions, which
leaves you with infinitly long calculations.

RonL

Correct, but not an issue for me. Because I already have numbers in a finite representation before I do the interleaving.

I dont care what this numbers should exacly be. I only see the numbers and assume that they are accurate. From a practical and numerical perspective there is nothing else you can do. I mean if your sensor gives you a 0.33333333335 thats it, you canīt ask yourself if it could mean 0.3333333333333.....33333....

cheers,
Andreas
• Dec 18th 2007, 04:24 AM
CaptainBlack
Quote:

Originally Posted by sirandreus
Correct, but not an issue for me. Because I already have numbers in a finite representation before I do the interleaving.

I dont care what this numbers should exacly be. I only see the numbers and assume that they are accurate. From a practical and numerical perspective there is nothing else you can do. I mean if your sensor gives you a 0.33333333335 thats it, you canīt ask yourself if it could mean 0.3333333333333.....33333....

cheers,
Andreas

1. Check your sensors real accuracy, 24 ADC's don't give 24 bit accuracy
may be 20 bits. You may not have as much real precision as you think.

2. Work with int64's (or if you have them int128)

3. Look at getting an arbitary precision arithmetic packge/class (google for
such for the programming language you intend to use).

RonL
• Dec 18th 2007, 07:53 AM
sirandreus
Hi,

captain thank you for your feedback.

However, the main question of this thread remains still unanswered. Does the modified euclid distance work ? Please read first post for that.

If no, is it possible to modify it so that it works ?
Or is the idea fundamentally wrong ? I mean the idea of tokenizing the numbers and comparing the number part by part

kind regards,
Andreas