Tuesday, June 21, 2011

Canopy Clustering

"Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching", McCallum,Nigam,Ungar,http://www.kamalnigam.com/papers/canopy-kdd00.pdf

This paper discusses a different type of clustering, titled canopy clustering. It is an interesting idea. There are basically two thresholds, using a 'cheap distance metric', we evaluate a list of points. Threshold 1 is > than Threshold 2. Pick one point to compare with all the other points in the list. When the distance between the two points falls within T1 put the points into a canopy. If the distance falls within T2 then remove point from list. We generate the canopies this way and work through the list until empty.

We can then apply our second level of clustering to each canopy and are pretty much guaranteed that if two points do not fall into the same canopy then they are likely not to be co-referent and therefore do not need to be evaluated.

This is efficient and elegant. Currently the only implementation that I found of canopy clustering is in Mahout. I am building my own implementation though to get a feel for how well it works.

No comments: