Coming soon – bayesian estimates.

Let’s say that you have a collection of results that you wish to sort based on a number of criteria, how do you do this? There are a number of methods to do this, particularly if you can allow user actions to drive your relevance calculation. You can also employ some statistical methods to calculate the combined relevance.

But for us, let’s just assume that we have some data in an unsorted list without such info and that our data is simple enough such that we don’t need to resort to statistical methods to determine the relative relevance of the various attributes. A simple approach is based on using the \(L_2\) norm to determine the “distance” from the ideal result for each data point. In its simplest form, one normalizes each individual coordinate such that the distance along one column yields the same relative importance as the corresponding distance along another coordinate.

For instance, let’s say we have 3 attributes:

- date of birth
- first name
- last name
- geo coordinates

Individually, we might use the following criteria to sort the data. For date of birth, we can convert the whole date into seconds since 1970 and then subtract the two longs to determine their distance. we might also do something a little more complicated such as considering dates by looking at their component month, day and year portions and calculating a distance based on that. The idea is that the larger the distance, the worse the match. For this discussion, we’ll stick to just determing the number of seconds between the two dates. Hence we’d have

\( dateDistance = \sqrt{ (secondsSince1970For1stRecord – secondsSince1970For2ndRecord )^2 } \)For the names, we might just calculate the Levenshtein distance between them. If we get fancy, we can take into account word stems, use synonyms, nicknames etc.

\( firstNameDistance = levenshteinDistance(firstNameFor1stRecord, firstNameFor2ndRecord) \)ditto for the last name. Finally for the geo coordinates, we’ll use another metric to calculate the geographical distance

\( geoDistance = geoDistanceCalculator(geoCoordinateFor1stRecord, geoCoordinateFor2ndRecord) \)Combining these. We need to be able to compare the importance of having the correct birth vs having the same last name or at the same geo coordinates. Ideally, we’ll be able to pick a weight such that if the distance between to birthdates is 1, that will be equally important as the weighted distance between two last names. Ditto for the rest of the attributes.

so our new distance becomes:

\( multifactorDistance = \sqrt{( dateWeight * dateDistance^2 + firstNameWeight * firstNameDistance^2 + lastNameWeight * lastNameDistance^2 + geoWeight * geoDistance^2 ) } \)Note we can adjust the weights to change the relative importance of the various attributes when sorting.

Finally, since all we care about is the sort order, we actually don’t need to do the outer square root \( \sqrt{ – } \) . So we can just calculate:

\( multifactorDistanceSquared =( dateWeight * dateDistance^2 + firstNameWeight * firstNameDistance^2 +

lastNameWeight * lastNameDistance^2 + geoWeight * geoDistance^2 )

\)

And we’re done.