Tuesday, June 21, 2011

Canopy Clustering

"Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching", McCallum,Nigam,Ungar,http://www.kamalnigam.com/papers/canopy-kdd00.pdf

This paper discusses a different type of clustering, titled canopy clustering. It is an interesting idea. There are basically two thresholds, using a 'cheap distance metric', we evaluate a list of points. Threshold 1 is > than Threshold 2. Pick one point to compare with all the other points in the list. When the distance between the two points falls within T1 put the points into a canopy. If the distance falls within T2 then remove point from list. We generate the canopies this way and work through the list until empty.

We can then apply our second level of clustering to each canopy and are pretty much guaranteed that if two points do not fall into the same canopy then they are likely not to be co-referent and therefore do not need to be evaluated.

This is efficient and elegant. Currently the only implementation that I found of canopy clustering is in Mahout. I am building my own implementation though to get a feel for how well it works.

SimHash: Hash-based Similarity Detection

"SimHash: Hash-based Similarity Detection", Sadowski, Levin, 2007.

This paper outlines a hash algorithm that can be used for similarity detection. Most hash algorithm are designed to offer low collision and hash values for similar strings can vary quite a bit. This hash basically sets out to achieve the opposite, higher collision and hash keys for similar strings are similar if not the same.

If you cannot use term frequency and need a numeric representation of a string for statistical processing, what process can be used? Using an integer-based hash is one way to achieve this, though in my opinion it is not the most sophisticated of approaches.

An implementation of this algorithm showed that it is a reasonable approach for hashing strings with the intent to determine similarity. I found minor issues which I will address by altering the algorithm.

Results showed that this is a reasonable approach....