Tuesday, March 30, 2010

Paper Summary - Data Mining: An Overview from a Database Perspective

M. Chen and J. Han and P. Yu, "Data Mining: An Overview from a Database Perspective", IEE Transactions on Knowledge and Data Engineering, 8(6): 866-883, 1996

This is a seminal paper about mining information from large databases. It is a survey of data mining techniques from a database researcher perspective.

The paper discusses key feature and challenges:


  • Different types of data
  • Efficiency and Scalability of algorithms
  • Accuracy and usefulness of results
  • How results are conveyed
  • Multiple Abstraction Levels
  • Mining different sources
  • Privacy and security



They go on to classify different types of data mining schemes. They can be classified according to the data they are examining, according to the kind of knowledge they are mining, and according to the technique they implore. This paper focuses on the knowledge they are mining.


  • Association rules
  • Data Generalization and Summarization
  • Classification huge amount of data
  • Data Clustering
  • Pattern based similarity
  • Path traversal patterns


It describes each one of this items in great detail. The paper is a great paper to get a good foundation on this topic. It is quite long but detailed. I don't seem mention of confidence approach.

Paper Summary - Link-based text classification

Q. Lu and L. Getoor, "Link-based text classification",IJCAI workshop on text-mining and link-analysis, 2003

This paper examines machine learning when objects are linked and using the links as additional information for the classifier. They present a framework which models link distributions using a logistic regression model for both content and links. They found that using links actually improved the accuracy of the classifier.

They use an iterative classification algorithm since the attributes can be correlated. There is a joint distribution between links and content attributes.

The main points in the paper are:

* The statistical framework models link distributions
* They show through results how this improved accuracy of the classifier
* They show an evaluation of the iterative categorization algorithm

Paper Summary - Unsupervised named-entity extraction from the web: An experimental study

"Unsupervised named-entity extraction from the web: An experimental study", O. Etzioni and M. Cafarella and D. Downey and S. Kok and A. Popescu and T. Shaked and S. Soderland and D. Weld and A. Yates, Artificial Intelligence,91-134,2005

This paper describes a system that uses an unsupervised approach at entity extraction. It describe the architecture of the system and defines general principles for extraction. They present 3 ways improve recall and extraction without compromising precision:
Pattern Learning - domain-specific
Subclass Extraction - identifies sub-classes
List Extraction - locates lists of class instances

The paper starts with a motivation for using an unsupervised approach. It provides background in information extraction and some of the complexities involved. KNOWITALL uses extraction patterns and pointwise mutal information statistics calculated from the web using hit counts measuring degree of correlation between pairs of words. The PMIs are used as features for a classifier. It consumes information from search engines.

It uses a bootstrapping method which is a set of predicates. Labels are given for each class and symbolic names are given for each class. The Bootstrapper "uses the labels to instantiate extraction rules". The Extractor formulates queries to send to search engine and the Assessor evaluates the extractions.

The paper is quite extensive and describes the system well. Overall this paper is a good read and provides a number of tidbits for new ideas.

Thursday, March 18, 2010

Paper Summary - Performing Object Consolidation on the Semantic Web Data Graph

Hogan, A.; Harth, A.; and Decker, S. 2007. Performing object
consolidation on the semantic web data graph. In In Proceedings
of I3: Identity, Identifiers, Identification. Workshop at 16th
International World Wide Web Conference (WWW2007).

This paper describes identities and the integration of data. They present a method for merging instances across multiple data sources (**large scale**). They describe how they determine two instances represent the same entity using inverse functional properties. Their dataset includes over 72 million instances (wow).

Key points:

  • There isn't much agreement on use of common URIs to identify entities (optional in RDF) so URI represents multiple instances at times
  • There is a lack of formal specification for determining equivalences among entities
  • Existing methods that perform object consolidation rely upon probabilistic methods
  • The use of inverse functional properties in the Semantic Web world is widely used





Paper Summary - On Searching and Displaying RDF Data from the Web

A., H., and H., G. 2005. On searching and displaying RDF data
from the web. In Proceedings of Demos and Posters of the 2nd
European Semantic Web Conference.

This document describes an application built to show RDF in a user interface. It describes how it gathered the data and also a smushing technique it used. The goal was to determine if it was feasible to integrate data without an underlying ontology. They describe their challenges with the data in this paper.

The key points of this paper:


  1. To determine if it is possible to integrate data without using an underlying ontology to do so
  2. They describe their data retrieval process
  3. They describe how they smush data
  4. They describe their application and how they make this data available for a user interface


The paper is short and missing a lot of details. Though I referenced this in the "A Machine Learning Approach to Linking FOAF Instances" document, it really was just to give an example of a smushing process.

Wednesday, March 3, 2010

Paper Summary - Semantics for the Semantic Web: The Implicit, the Formal and the Powerful

A. Sheth, C. Ramakrishnan, and C. Thomas. Semantics for the semantic web: The implicit, the
formal, and the powerful. Semantic Web and Information Systems, 1(1), Idea Group, 2005.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.9929&rep=rep1&type=pdf

This paper discusses both the limitation of relying upon just description logics and the need to utilize different semantics to handle the complexity of exploiting semantic web data. In particular it organizes these semantics into three categories:
  • Implicit - from Patterns in the data, examples include co-occurrence and links
  • Formal - formal language which presents syntactical rules, Description Logic falls under this category
  • Powerful - statistical analysis that uses patterns in the data

What is interesting about this paper are the following statements:

"Even though it is desirable to have a consistent knowledge base, it becomes impractical as the size of the knowledge base increases or as knowledge from many sources is added. It is rare that human experts in most scientific domains have a full and complete agreement. In these cases it becomes more desirable that the system can deal with inconsistencies."

"Sometimes it is useful to look at a knowledge base as a map. This map can be partitioned according to different criteria, e.g. the source of the facts or their domain. While on such a map the knowledge is usually locally consistent, it is almost impossible and practically infeasible to maintain a global consistency. Experience in developing the Cyc ontology demonstrated this challenge. Hence, a system must be able to identify sources of inconsistency and deal with contradicting statements in such a way that it can still produce derivations that are reliable."

They then go on to discuss current approaches to deal with this inconsistency.

  • Probabilistic reasoning
  • Possibilistic reasoning
  • Fuzzy reasoning

It highlights drawbacks with these methods and proposes the need for a standardization in this area.

The paper then discusses correlating semantic capabilities with types of semantics in relation to the bootstrapping and utilization phases.

The last part of this paper discusses information integration, information retrieval and extraction, data mining and analytical applications.

Some of the interesting papers it references:

R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In P. Buneman and S. Jajodia, editors, Proceedings of the 1993.

D. Barbará, H. Garcia-Molina and D. Porter. The Management of Probabilistic Data. IEEE Transactions on Knowledge and Data Engineering, Volume 4 , Issue 5 (October 1992), Pages: 487 - 502

Jochen Heinsohn: Probabilistic Description Logics. UAI 1994: 311-318.
Int’l Journal on Semantic Web & Information Systems, 1(1), 1-18, Jan-March 2005

Hui Han, C. Lee Giles, Hongyuan Zha, Cheng Li, and Kostas Tsioutsiouliklis. "Two Supervised Learning Approaches for Name Disambiguation in Author Citations" , in Proceedings of ACM/IEEE Joint Conference on Digital Libraries (JCDL 2004), pages 296-305, 2004.

Birger Hjorland Information retrieval, text composition, and semantics. Knowledge Organization 25(1/2):16-31, 1998

Vipul Kashyap, Amit Sheth: Semantic Heterogeneity in Global Information Systems: The Role of Metadata, Context and Ontologies, Cooperative Information Systems 1996

M. Kuramochi and G. Karypis. Finding frequent patterns in a large sparse graph. In SIAM International Conference on Data Mining (SDM-04), 2004.

Alexander Maedche, Steffen Staab: Ontology Learning for the Semantic Web. IEEE Intelligent Systems 16(2): 72-79 (2001)

B. Omelayenko. Learning of Ontologies for the Web: the Analysis of Existent approaches. In Proceedings of the International Workshop on Web Dynamics, 2001.

Erhard Rahm, Philip A. Bernstein. A Survey of Approaches to Automatic Schema Matching. In VLDB Journal 10: 4, 2001
Int’l Journal on Semantic Web & Information Systems, 1(1), 1-18, Jan-March 2005

Amit P. Sheth, Sanjeev Thacker, Shuchi Patel: Complex relationships and knowledge discovery support in the InfoQuilt system. VLDB J. 12(1): 2-27 (2003)

J. Townley, The Streaming Search Engine That Reads Your Mind, August 10, 2000. http://smw.internet.com/gen/reviews/searchassociation/

William A. Woods: “Meaning and Links: A Semantic Odyssey”. Principles of Knowledge Representation and Reasoning: Proceedings of the Ninth International Conference (KR2004), June 2-5, 2004. pp. 740-742

Lotfi A. Zadeh. Toward a perception-based theory of probabilistic reasoning with imprecise probabilities. In Journal of Statistical Planning and Inference 105 (2002) 233-264.