I have two sets of data, each with approximately 300,000 rows, and 20 columns. My goal is to find the row in set 'b' that would require the fewest changes for it to match the row in set 'a'. The order of the 20 columns is important. The data in each column can be any one of about 300 different values. The data in each column can be utilized as a number or string.
A simple scenario using 10 columns of data:
Given a row 'a' of: 1,3,19,4,12,56,200,9,110,15
Using the sample rows 'b' below, row 3 would be the one I'd want as it would require the fewest changes to become row 'a'
1: 1,3,19,4,99,22,30,66,110,15 (matches 6 columns)
2: 12,56,1,3,19,4,110,15,200,9 (all same values, but different order)
3: 64,1,3,19,4,12,56,200,9,110 (removing 1st & adding last produces 100% match)
I've been looking over different matching algorithms (Levenshtein Distance, etc.) at String Similarity Metrics for Information Integration but haven't found anything I could adapt to my need.
For reference, this is data in a SQL Server database, but can be pulled out and worked on in a client application if necessary.
Thank you in advance for any help you may be able to provide.
Edit - I did want to mention that I'd attempted scoring each row on both sides, and matching based on the closest scores. My plan was to then go through the scores iteratively and find the best one to use. Unfortunately, I wasn't able to produce a score that provided values I could rely on.