Tuesday, June 21, 2011

SimHash: Hash-based Similarity Detection

"SimHash: Hash-based Similarity Detection", Sadowski, Levin, 2007.

This paper outlines a hash algorithm that can be used for similarity detection. Most hash algorithm are designed to offer low collision and hash values for similar strings can vary quite a bit. This hash basically sets out to achieve the opposite, higher collision and hash keys for similar strings are similar if not the same.

If you cannot use term frequency and need a numeric representation of a string for statistical processing, what process can be used? Using an integer-based hash is one way to achieve this, though in my opinion it is not the most sophisticated of approaches.

An implementation of this algorithm showed that it is a reasonable approach for hashing strings with the intent to determine similarity. I found minor issues which I will address by altering the algorithm.

Results showed that this is a reasonable approach....

No comments: