Tuesday, April 12, 2016

Nice GitHub Deep Learning Summary and Notes

I can't say enough about the list of resources here.

Friday, April 1, 2016

Image Thresholding in Python

I found this article to be very useful.

http://opencv-python-tutroals.readthedocs.org/en/latest/py_tutorials/py_imgproc/py_thresholding/py_thresholding.html

Distilling the Knowledge in a Neural Network

http://arxiv.org/abs/1503.02531 

A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.

Data Exploration with Kaggle Scripts, Data Science, Data Exploratory Courses

This might be interesting at a surface level.  I haven't evaluated yet.

Data Exploration with Kaggle Scripts course.


Again more surface level stuff.


Intermediate Python for Data Science course.


This actually might have more substance, it is taught by a JHU professor.

Coursera course on Exploratory Data Analysis


Thursday, March 31, 2016

A Survey of Graph Theory and Applications in Neo4J - Talk

This is a link to a talk given at a recent meet-up in Arlington, VA.

The talk starts out with pretty introductory material but as it progresses it gets more interesting.  Definitely worth a read during a treadmill session.

Here is another relevant link.

My opinion of Neo4J after using for 1 year for experimental purposes is that it is a decent application but I highly doubt its scalability for big data.  I never tested this but it is a hunch based on my use.

Also if you are using Neo4J to store triples, no, don't do that, it is way too much work.  Just use a triple store.

Monday, March 28, 2016

CS231n: Convolutional Neural Networks for Visual Recognition Winter Course Project Report

There are lots of interesting reads on this page.   And this is a great course to take if you are research deep learning for image processing.

Tuesday, March 22, 2016

Sunday, March 20, 2016

3 Minute Thesis Competition - 3MT

Can you explain your dissertation in 3 minutes?

UMBC has a 3MT competition this Wednesday.  
 BALTIMORE, MD


If you are preparing for a 3MT, this is a good resource.



Other good 3MT videos:

2010 Trans-Tasman 3MT Winner - Balarka Banerjee from Three Minute Thesis (3MT®) on Vimeo.







Thursday, March 17, 2016

Markdown

I am starting to use markdown more.  For me, I wanted to know why I should care about markdown.  This article gives a good view of why to use and there is a link to a tutorial.

Read it here.

Tuesday, March 15, 2016

Flask

Flask...

"Flask is a microframework for Python based on Werkzeug, Jinja 2 and good intentions."

I toyed around with the tutorial and was able to get a few simple apps running.  I suppose if you are interested in building web sites, this might be interesting to try.  I haven't determined if it is useful for anything else.

http://flask.pocoo.org/

Image Processing

Scipy package documentation:

http://www.scipy-lectures.org/packages/scikit-image/

Monday, March 14, 2016

Spring Break.....

I love working on campus during spring break.  Front space parking, empty lab, no line for coffee.....ahhh nirvana....


No coffee! Bah!

Thursday, March 10, 2016

Ugh, dissertation

In those moments when you are frustrated with your dissertation, breathe, and know there are others feeling the same pain.....

The valley....



Ride the wave to finish this thing...


Tuesday, March 8, 2016

DL4J

I have been using Java for a long time but I find DL4J to be a bit cumbersome to use.  I prefer Torch/Lua or Theano for deep learning.

However because Java has been such a significant part of my life for so long, I will not give up on DL4J.

More to come once I get this working.

In the meantime, here are a few links, so I can close those tabs:-)....

word2vec in DL4J
deep autoencoders in DL4J
nd4j

Whooo, Ahhh, ....


Nonlinear PCA for Matlab, fun stuff here!

I like popcorn and I like bag of words

I think I am going to like this too.

Beginners, may not be very useful, but they said popcorn, so they have my attention....

Data sets for Presidential Debates


I want to play with these data sets, I really do, why haven't I done this yet....

Get them here....

swirl (no not that one Sematic Webbers)

swirl is a way to learn R and data science interactively.

Try it out here.

Overlapping histograms and Histogram Tutorial in Python

A thread that talks about this topic.  Much easier to do in R.


More on histograms

matplotlib



Examples using matplotlib.


Tutorial for matplotlib.

A little bit on density plots.

And an introduction to plotting in Python.

knitr


knitr

For dynamic report generation in R.


Link for knitr

Monday, January 25, 2016

How to read a research paper

http://www.sciencemag.org/careers/2016/01/how-read-scientific-paper

Friday, November 20, 2015

GenSim and LDA

This is a nice little simple tutorial on using GenSim to exercise LDA.
GenSim LDA Tutuorial

Saturday, April 25, 2015

Friday, March 13, 2015

Sunday, March 8, 2015

Thursday, February 26, 2015

Friday, January 9, 2015

My first teaching experience

I would have to say it was definitely a learning experience. I taught a database course to 40 mostly seniors and mostly male students. I spent the semester researching and learning which techniques were best used for teaching. I also learned how to create my own lectures, projects and exams. This experience definitely gave me a taste of what it would be like to teach professionally. I also learned a bit about myself. By the end of the semester I no longer felt uncomfortable with speaking in front of a group or answering questions and challenges on the fly. I also was impressed with many of the students in my class.

Now it is back to research!

And maybe teaching again in the future once I recover from this first one.

Friday, October 3, 2014

Taming Wild Big Data

Our latest paper for the AAAI Fall Symposium.
Abstract: Wild Big Data is data that is hard to extract, understand, and use due to its heterogeneous nature and volume. It typically comes without a schema, is obtained from multiple sources and provides a challenge for information extraction and integration. We describe a way to subduing Wild Big Data that uses techniques and resources that are popular for processing natural language text. The approach is applicable to data that is presented as a graph of objects and relations between them and to tabular data that can be transformed into such a graph. We start by applying topic models to contextualize the data and then use the results to identify the potential types of the graph’s nodes by mapping them to known types found in large open ontologies such as Freebase, and DBpedia. The results allow us to assemble coarse clusters of objects that can then be used to interpret the link and perform entity disambiguation and record linking.

Tuesday, September 30, 2014

Thursday, September 11, 2014

Learning Julia

julia is a dynamic programming language getting a bit of attention. I am running a few tutorials and learning the language. Some resources are listed below in case you are interested....

Learn about julia
Quick tutorial
Another tutorial
Google group

Thursday, December 19, 2013

Lego Car

This is just too cool, a lego car tha runs on compressed air.

Tuesday, November 26, 2013

Kano kit

Something to think about...

Source: https://twitter.com/AIIsAwesome/status/404341645509812224/photo/1

Monday, November 18, 2013

Just received notice Qualcomm Toq Smartwatch is available Dec 2nd

Qualcomm Toq Smartwatch is available as of Dec 2nd for a starting cost of $349.99 (OUCHY!).

Sunday, November 17, 2013

AAAI Symposium

I've been spending the weekend at the AAAI Symposium. There have been quite a few interesting talks.
John Laird gave an interesting talk on General Intelligence.
Andrew Ng also gave an interesting talk, on Deep Learning.

Thursday, October 31, 2013

3D on the Web - Introduction to WebGL

This is an interesting talk offered through ACM.
Supplementary Learning Resources from Alain Chesnais
Collada:
Official website http://collada.org/
Tutorials https://collada.org/mediawiki/index.php/Portal:Tutorials
WebGL:
Official website http://www.khronos.org/webgl/
Tony Parisi's Tutorials http://learningwebgl.com/
Three.js:
Official website http://threejs.org/
Ilmari Heikkinen's Tutorial http://fhtr.org/BasicsOfThreeJS/#2
X3Dom:
Official website http://www.x3dom.org/
Introductory tutorial http://x3dom.org/docs/dev/tutorial/firststeps.html

Friday, October 25, 2013

Saturday, October 5, 2013

4D Printers

This is pretty cool, researchers are trying to add a 4th dimension to 3D printing. What they are trying to do is print material that can change over time, not just shape, but based on some external stimuli, change of behavior. I find this incredibly interesting.

The article is titled, "’4D printing’ adaptive materials" and is found on kurzweilai.net
Read the article at http://www.kurzweilai.net/4d-printing-adaptive-materials.

A great Ted talk on this same topic:

Thursday, October 3, 2013

Google Glass

Our lab, ebiquity, has been playing with Google Glass. So now I am hooked and wondering what sort of app I could develop for Google Glass. I've been doing a little reading on Google glass and came across this blog. The developer, Lance Nanek developed an interesting implementation of panning using head movement.

Watch the video to learn more:


Very cool stuff.

More information for developers is here.

And yet more info for developers.

And if you want Google Glass, then keep up with the latest news.

Wednesday, October 2, 2013

Mid-Atlantic Student Colloquium on Speech, Language and Learning

UMBC is hosting the Mid-Atlantic Student Colloquium on Speech, Language and Learning.

Register online.

The schedule is now posted.

Snapshot of the schedule:

09:00-09:45 Registration, set up

09:45-10:00 Opening

10:00-11:20 Oral presentations I

Lushan Han, Abhay Kashyap, Tim Finin, James Mayfield and Jonathan Weese (UMBC & JHU). Semantic Textual Similarity Systems

Keith Levin, Aren Jansen and Ben Van Durme (JHU). Toward Faster Audio Search Using Context-Dependent Hashing

Shawn Squire, Monica Babes-Vroman, Marie Desjardins, Ruoyuan Gao, Michael Littman, James MacGlashan and Smaranda Muresan (UMBC & Brown). Learning to Interpret Natural Language Instructions

Viet-An Nguyen, Jordan Boyd-Graber and Philip Resnik (UMCP). Lexical and Hierarchical Topic Regression

11:20-12:10 Poster session I

Posters

12:10-12:40 Lunch

12:40-1:40 Panel

"How to be a successful PhD student and and transition to a great job"
Marie desJardins (UMBC)
Mark Dredze (JHU)
Claudia Pearce (DoD)
Ian Soboroff (NIST)
Hanna Wallach (UMass)


1:50-3:10 Oral presentations II

Qingqing Cai and Alexander Yates (Temple). Large-scale Semantic Parsing via Schema Matching and Lexicon Extension

William Yang Wang and William W. Cohen (CMU). Efficient First-Order Probabilistic Logic Programming for Natural Language Inference

Xuchen Yao, Benjamin Van Durme, Chris Callison-Burch, Peter Clark (JHU & UPenn & AI2). Semi-Markov Phrase-based Monolingual Alignment

Wei Xu, Alan Ritter and Ralph Grishman (NYU). Gathering and Generating Paraphrases from Twitter with Application to Normalization

3:10-4:00 Poster session II

Posters

4:00-5:00 Breakout sessions

NLP in low resource settings, Ann Irvine (JHU)

Dynamic Programming: Theory and Practice, Alexander Rush (Columbia/MIT)

NELL: Never Ending Language Learning, Partha Pratim Talukdar (CMU)

5:00 Closing

5:15 - 7:00 Wine down

Wine and beer at Flat Tuesdays, UMBC Commons

Sunday, September 29, 2013

Originality

“It is better to fail in originality than to succeed in imitation.”
--Herman Melville

Friday, September 27, 2013

Integrity

“Truth at last cannot be hidden. Dissimulation is of no avail. Dissimulation is to no purpose before so great a judge. Falsehood puts on a mask. Nothing is hidden under the sun.” ― Leonardo da Vinci

“Stars hide your fires; let not light see my black and deep desires: The eyes wink at the hand; yet let that be which the eye fears, when it is done, to see” ― William Shakespeare, Macbeth

Wednesday, September 25, 2013

ubuntu 12.04 lts - No video mode activated

After 5 hours of debugging, I finally was able to work around the issue.
In /etc/default/grub
Commented out:
GRUB_DEFAULT
GRUB_HIDDEN_TIMEOUT
GRUB_HIDDEN_TIMEOUT_QUIET
GRUB_TIMEOUT
Changed
GRUB_CMDLINE_LINUX_DEFAULT="nomodeset"
Get rid of the reference to "splash" because that is what is crashing.
Ref 1
Ref 2
Ref 3
Ref 4
Ref 5
I also did the following:
cd /usr/share/grub/
sudo cp *.pf2 /boot/grub
Finally I could boot but the mouse pointer was MIA.
Did this to get it back:
sudo modprobe -r psmouse
sudo modprobe psmouse proto=imps




NOW I CAN FINALLY CATCH UP ON MY WORK!

Tuesday, September 24, 2013

References for graphs in R

This link provides pretty simple examples but useful. Adding error bars is not completely straightforward, this link gives a good explanation.

Wednesday, September 18, 2013

Data Fusion

My poster on data fusion.

How to write a good research paper and give a good research talk

Good reference for research paper and presentation advice.

Tuesday, September 17, 2013

Interesting Blog for Graduate Students

I found a few posts that were useful here.

Sunday, September 15, 2013

IEEE International Conference on Semantic Computing 2013

The IEEE ICSC.

NLP Resources

I am spending time this semester learning NLP since my work overlaps with NLP. I found a free ebook for learning NLP in Python. Another link for this book is here.
At the UMBC NLP course web site there are also slides available.
In particular, I have been working with OpenNLP which so far has been very useful. However, I am at the point of doing in-doc coref and having a lot of problems. There is not really any documentation on the in-doc coref api calls but I did find someone who dug into the code a bit and found a way to do it. However, so far, when executing my implementation, it runs forever. I hope to have an update soon.

Never Ending Learning (NELL)

Attended a very interesting talk related to Nell. The work is from Carnegie Mellon. If you are interested in machine learning, particularly how to perform learning over time, you may find this work of interest.

UMBC ebiquity group

Our group is performing interesting research by blending semantic computing with security, mobile devices, medical informatics and big data problems. Some of this research overlaps quite a bit with natural language processing. Read our publications to see the latest research coming from our lab.

Monday, September 9, 2013

Powerpoint Alternatives

I have been looking for an alternative to powerpoint that works reasonably well. So far OpenOffice and LibreOffice are not sufficient, especially when working on a team of Windows users. The following article provides a good list. I don't have a suggestion yet but would love to hear what others are using....

MASC 2013 Call For Papers Still Open

The 2013 Mid-Atlantic Student Colloquium on Speech, Language and Learning is still open.

Tuesday, July 23, 2013

System76 Ubuntu laptops

I spent over 6 months researching laptops and finally decided to go with a System76 Ubuntu laptop. It has been just less than 1 month but I am loving my new lemur. Review to come... System76 Lemur Ultra

Saturday, May 11, 2013

Bias and Variance Tradeoff

There is a great blog entry that describes this from a practical standpoint. If references the following lecture also.

Tuesday, April 9, 2013

How to store billions of triples...for free

So far I like, http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=GettingStarted need this too: http://www.openrdf.org/doc/sesame2/users/ch07.html

LDA - Step by Step in R

Was able to get an example running very quickly with this tutorial. http://www.rtexttools.com/1/post/2011/08/getting-started-with-latent-dirichlet-allocation-using-rtexttools-topicmodels.html

Topic Modeling - Great Resource

Excellent resource! http://www.cs.princeton.edu/~blei/topicmodeling.html

SVMs - Good Description

http://nlp.stanford.edu/IR-book/html/htmledition/support-vector-machines-the-linearly-separable-case-1.html

Building Templates for Basic Entity Types

http://schema.org/Person http://schema.org/Organization http://nlp.cs.nyu.edu/ene/version7_1_0Beng.html

Learning Programming by means of Python

This is a very basic beginners tutorial. Good for those just learning programming. http://cscircles.cemc.uwaterloo.ca/

Monday, September 3, 2012

PCA, SVD, LSA

Great links to learn and understand these concepts:
PCA Tutorial
SVD and PCA discussion
LSA Tutorial

Saturday, February 25, 2012

Mining Twitter Article (Not technical)

This is an interesting article http://isc.sans.edu/diary.html?storyid=5728 due to the fact that they have links that are interesting related to mining Twitter. It is a little outdated and a few of the links do not work but still semi-useful. Keep in mind data sets that were previously available, for example http://snap.stanford.edu/data/twitter7.html, are no longer available due to a Twitter request (read about it here). However you can still use the Twitter API to get data, you are just limited to the number of tweets you can get per day/session (can't remember). Also useful is this article.

Paper Summary - Short text classification in twitter to improve information filtering

Short text classification in twitter to improve information filtering, B. Sriram and D. Fuhry and E. Demir and H. Ferhatosmanoglu,2010
This paper describes research that classifies tweets using a reduced set of features. In this approach they try to classify text into the following set of classes "News, Events, Opinions, Deals, and Private Messages". The problem they present is the curse of dimensionality problem that results from trying to conquer the spareness issue related to classifying twitter messages. Other research typically uses external knowledge bases to support tweet classification. They argue this can be slow due to the need to excessively query the external knowledge base. Important points about this paper:
1. They provide a very useful discussion of Twitter and tweets
2. How they classify tweets is interesting worth another review

Latex Tip - Controlling placement of figures and tables

This was a useful article for controlling placement of figures and tables. http://robjhyndman.com/researchtips/latex-floats/

Paper Summary - Linking Social Networks on the Web with FOAF: A Semantic Web Case Study

Linking Social Networks on the Web with FOAF: A Semantic Web Case Study,J. Golbeck and M. Rothstein, 2008
Another paper related to FOAF and the Semantic Web. This paper examines linking FOAF identities across social networks. They propose that it would be better to merge these accounts across sites so a user has one social network. This may have been an interesting idea in 2008 but I'm not sure if this is relevant now seeing as some social networks let you bring in contacts from other sites. In general, this paper can be excluded from further analysis.

Resource - Writing and Presenting Your Thesis or Dissertation

Writing and Presenting Your Thesis or Dissertation, S. Joseph Levine, Ph.D. http://www.learnerassociates.net/dissthes/ Another useful resource for writing the dissertation. They have a section on writing the proposal which I found useful.

Paper Summary - Twitter Sentiment Classification using Distant Supervision

Twitter Sentiment Classification using Distant Supervision, A. Go and R. Bhayani and L. Huang, 2009, Technical report, Stanford Digital Library Technologies Project
This paper relates to classifying sentiment found on Twitter. They use machine learning and are able to construct training data by using the Twitter API and emoticons present among tweets. The standard :) and :( are used to determine if a tweet contains positive or negative sentiment. The key points are 1.) they picked an efficient way to construct their training sets, 2.) Tweets are harder to classify because their length can not exceed 140 characters, and 3.) their results were promising for classifying the sentiment of the tweets.

Friday, February 17, 2012

Data Mining Resources

This thread has a lot of useful links for resources to help one studying data mining. It is mainly to help one build the mathematical background.

Friday, February 10, 2012

Data Mining and Machine Learning

I wished to understand the distinction between data mining and machine learning. This presentation (Machine Learning and Data Mining: 01 Data Mining) is useful.

Tuesday, February 7, 2012

Khan Academy

If you need to review concepts from Calculus, Linear Algebra, or Probability.
This is a good resource:
Khan Academy
I used this to review Linear Algebra concepts for my Data Mining course. They have quite a few topics.

Cheers.

Wednesday, January 4, 2012

Advice Collection

I came across a useful collection of links for Ph.D. students. It offers dissertation advice, presentation advice, and more....I read some of these articles in the past but it is nice to have a central location for reference.

Saturday, December 24, 2011

Neural Network Framework for Java

I spent the last month learning how to use Self-Organizing Maps (SOM) for my Neural Network course. I used a SOM to perform instance matching (which is not typically what it is used for) with the intention that possibly it could be used in a 2-level fashion or with some other supervised approach. Think of SOM as a clustering technique. It maps high dimensional data into a lower dimension (typically 2-D) space and it enables one to visually see the data in 2-D. The beauty of a SOM is that it is unsupervised which means you do not need to specify the desired output for your training set.

It outperformed K-Means when running comparison tests using the OAEI IIMB benchmark both in F-Measure scores and CPU time.


I used Encog for the Neural Net framework. I experienced a lot of memory issues but for the most part I was quickly running a SOM, K-Means and SVM comparison test.

For the next test, I think using Matlab, Weka or R might be a better approach. As much as I like to keep things nice and clean in the code. I quickly have memory issues as I increase the number of instances to test.

Code Sample for Encog SOM (based on the Encog example):

//Set of the training data, no desired output in this case
MLDataSet training = new BasicMLDataSet(trainingInput,null);

// Create the SOM neural network with an input count and an output count
// This is basically your input node size and output node size
// One point of improvement
SOM network = new SOM(inputNodeSIze,ouputNodeSize);

//reset the nework
network.reset();

//Here you specify your parameters i.e. learning rate and neighborhood function
//I used the NeighborhoodSingle here but clearly that is not the best choice
//Next round of tests will use a RBF and a GaussianFunction
//.7 for learning rate is not unreasonable
BasicTrainSOM train = new BasicTrainSOM(
network,
0.7,
training,
new NeighborhoodSingle());
//new NeighborhoodRBF(sizes, RBFEnum.Gaussian)

//store the winner in a space in the 2-d array reserved
//calling code will lump instances that have the same winner
//to determine which instances are 'similar'
double[][] newItems = new double[input.length][];
int i=0;
for (double[] item: input)
{
item[item.length-2]=network.winner(new BasicMLData(item));
newItems[i] =item;
i++;
}
Encog.getInstance().shutdown();

Saturday, August 20, 2011

Paper Summary - Toward Conditional Models of Identity Uncertainty with Application to Proper Noun Coreference - Part 1

Toward Conditional Models of Identity Uncertainty
with Application to Proper Noun Coreference
A. McCallum and B. Wellner


This paper is interesting. They make the point that pairwise decisions may not always be independent of others. One may be able to resolve inconsistencies by using a dependence model. They mention work, Relational Probabilistic Model, which captures this dependence. However since it is a generative model, they state this could lead to complexities due to many features with varying degrees of granularity. They discuss Hidden Markov models and conditional random fields briefly and Relational Markov networks as a similar model but improved classification.

They then discuss their work specifically which is "three conditional undirected graphical
models for identity uncertainty" which make the coreference decisions. Their first model connects mentions, entity-assignments, and each attribute of the mention. Edges indicate dependence. There is the concept of a clique, parameters may be part of different cliques which results in patterns of parameters called clique templates. Parts of the graph that depend on a number of entities are removed and replaced with random variables indicating coreference (Read this paper again to make sure we are clear on this). Per-entity attribute nodes are removed and replaced with attributes of mention. They then use graph partitioning. There is a lot in this paper and really requires another read to understand their methods better.

Paper Summary - Disambiguation and Filter Methods in Using Web Knowledge for Coreference Resolution

Disambiguation and Filter Methods in Using Web Knowledge for Coreference Resolution
O. Uryupina and M. Poesio

They describe how they use Wikipedia and Yago to increase their Coreference Resolution performance. They use BART to support their efforts. Their classification consists of a anaphor and a potential antecedent, as they describe. Using their associated feature vectors, they use a 'maximum entropy classifier' to determine coreference. They used Wikipedia to improve their aliasing algorithm, which would perform string matching functions. They use Wikipedia based information as a feature, and to disambiguate mentions. They use Yago to supplement their efforts when there are too few features to make any reasonable decisions related to coreference. Yago information is also incorporated as a feature. They tested using ACE with reasonable scores.

Using these publicly available knowledge bases appears to improve performance (2-3% in this case). Something to think about....

Monday, August 8, 2011

Stanford Online AI Course

This course is offered for Fall 2011 semester and the instructors are Sebastian Thrun and Peter Norvig. It should be a good class. The formal title is "Introduction to Artificial Intelligence".

Join

Wednesday, July 20, 2011

Boom Time In Silicon Valley?

In case you missed this over the weekend, apparently in Silicon Valley the next big boom is occurring, at least according to the LA Times.

For those keeping their eye on the market for that special time to start-up, this may be good news. Still too early to know for sure.

Friday, July 15, 2011

Intuition

There was an interesting question posed on the Linked In AGI group discussion board:
Has anyone put out a suggestion on how to impliment an "Intuition" engine?

Responses were also quite interesting...

Has anyone successfully built an intuition engine? The responses ranged from a weak description of how one might do this to long detailed descriptions of current research that has been ongoing for many years.

If this sounds interesting you may want to join the Linked In group AGI — Artificial General Intelligence.

Or go to the AGI Conference in August .

Linked Data Paper Summary

"Linked Data - The Story So Far", C. Bizer, T. Heath, T. Berners-Lee, 2009, http://eprints.ecs.soton.ac.uk/21285/1/bizer-heath-berners-lee-ijswis-linked-data.pdf


My work will support working with linked data so I am attempting to build a better understanding of this topic.

This paper is good for providing a very basic understanding of linked data. If you are new to the concept of linked data, this is a good starting paper to read. It provides basic principles, examples, and isn't too technical.

If you already have a good understanding of the basics, skip this paper. I read it in about 10 minutes and it was pretty much just a review of what I already knew.

Hadoop Meet-up - Large Scale Graph Processing On HBase and Map/Reduce on Greenplum

I attended the Hadoop Meet-up on Tuesday titled "Large Scale Graph Processing On HBase and Map/Reduce on Greenplum". You can view the event here.

I attended this event for two reasons: It has been a couple of years since I worked intimately with Hadoop and I wanted to see how others are using it. I was also hoping the discussion on Large Scale Graph Processing would be useful for my research.

Though the presentations were somewhat stimulating, I didn't find much I could use for my work.

Tuesday, June 21, 2011

Canopy Clustering

"Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching", McCallum,Nigam,Ungar,http://www.kamalnigam.com/papers/canopy-kdd00.pdf

This paper discusses a different type of clustering, titled canopy clustering. It is an interesting idea. There are basically two thresholds, using a 'cheap distance metric', we evaluate a list of points. Threshold 1 is > than Threshold 2. Pick one point to compare with all the other points in the list. When the distance between the two points falls within T1 put the points into a canopy. If the distance falls within T2 then remove point from list. We generate the canopies this way and work through the list until empty.

We can then apply our second level of clustering to each canopy and are pretty much guaranteed that if two points do not fall into the same canopy then they are likely not to be co-referent and therefore do not need to be evaluated.

This is efficient and elegant. Currently the only implementation that I found of canopy clustering is in Mahout. I am building my own implementation though to get a feel for how well it works.

SimHash: Hash-based Similarity Detection

"SimHash: Hash-based Similarity Detection", Sadowski, Levin, 2007.

This paper outlines a hash algorithm that can be used for similarity detection. Most hash algorithm are designed to offer low collision and hash values for similar strings can vary quite a bit. This hash basically sets out to achieve the opposite, higher collision and hash keys for similar strings are similar if not the same.

If you cannot use term frequency and need a numeric representation of a string for statistical processing, what process can be used? Using an integer-based hash is one way to achieve this, though in my opinion it is not the most sophisticated of approaches.

An implementation of this algorithm showed that it is a reasonable approach for hashing strings with the intent to determine similarity. I found minor issues which I will address by altering the algorithm.

Results showed that this is a reasonable approach....