Hi all,

Iīm looking for an efficient solution for the following problem:

Let V be a set containing m vectors V={v1,v2,...,vm}. Given an input vector vi I need to find the vector from V which has the smallest euclid distance to vi (find best matching unit).

Normally not an issue. However, every element of Vīs vectors and the input vector vi always have a value from the open interval (0,1).

example 2D vector:

v1 = (0.0123123423423 , 0.00023567887)

Assume that every number has several thousand or even million digits. So practically spoken, if you implement euclid distance for this kind of numbers, you will need special high precision libraries, where every mathematical operation (like add, mult or sqrt) will run many times slower than with primitives such as double.

Recently, I had the idea to tokenize a long number into smaller peaces and execute the euclid distance only on this peaces.

so for instance lets assume we are in 1d (for simplicity) and we have the following two vectors v1, v2 and the input vector vi

v1 = 0.001222345566

v2 = 0.123456649900

vi = 0.1923343233232

lets say we are tokenizing the numbers in peaces of length 5, then the first token (t1) for each vector would look like this:

v1_t1 = 00122

v2_t1 = 12345

vi_t1 = 19233

(here I am leaving away "0." since it is shared by all numbers)

here we see that the 5 first numbers of v2 are closer to vi than v1īs. Thus whatever comes after this 5 numbers, v2 will always be closer to vi. And thus vi is the best matching unit.

If the first tokens of v1 and v2 had the same (or very close) distance to vi, then the the second token containing the next 5 numbers would be compared.

The questions:

does this approach make sense ?

Am I overlooking anything fundamental ?

How can this idea be improved ?

Note that the solution to this problem does not need to be exact. Thus approximations are welcome as well

thanks in advance,

Andreas