I can't say enough about the list of resources here.
Tuesday, April 12, 2016
Friday, April 1, 2016
Image Thresholding in Python
I found this article to be very useful.
http://opencv-python-tutroals.readthedocs.org/en/latest/py_tutorials/py_imgproc/py_thresholding/py_thresholding.html
http://opencv-python-tutroals.readthedocs.org/en/latest/py_tutorials/py_imgproc/py_thresholding/py_thresholding.html
Distilling the Knowledge in a Neural Network
http://arxiv.org/abs/1503.02531
A very simple way to improve the
performance of almost any machine learning algorithm is to train many
different models on the same data and then to average their predictions.
Unfortunately, making predictions using a whole ensemble of models is
cumbersome and may be too computationally expensive to allow deployment
to a large number of users, especially if the individual models are
large neural nets. Caruana and his collaborators have shown that it is
possible to compress the knowledge in an ensemble into a single model
which is much easier to deploy and we develop this approach further
using a different compression technique. We achieve some surprising
results on MNIST and we show that we can significantly improve the
acoustic model of a heavily used commercial system by distilling the
knowledge in an ensemble of models into a single model. We also
introduce a new type of ensemble composed of one or more full models and
many specialist models which learn to distinguish fine-grained classes
that the full models confuse. Unlike a mixture of experts, these
specialist models can be trained rapidly and in parallel.
Data Exploration with Kaggle Scripts, Data Science, Data Exploratory Courses
This might be interesting at a surface level. I haven't evaluated yet.
Data Exploration with Kaggle Scripts course.
Again more surface level stuff.
Intermediate Python for Data Science course.
This actually might have more substance, it is taught by a JHU professor.
Coursera course on Exploratory Data Analysis
Again more surface level stuff.
This actually might have more substance, it is taught by a JHU professor.
Coursera course on Exploratory Data Analysis
Labels:
Courses,
Data Analysis,
data exploration,
Data Science,
kaggle,
Python
Thursday, March 31, 2016
A Survey of Graph Theory and Applications in Neo4J - Talk
This is a link to a talk given at a recent meet-up in Arlington, VA.
The talk starts out with pretty introductory material but as it progresses it gets more interesting. Definitely worth a read during a treadmill session.
Here is another relevant link.
My opinion of Neo4J after using for 1 year for experimental purposes is that it is a decent application but I highly doubt its scalability for big data. I never tested this but it is a hunch based on my use.
Also if you are using Neo4J to store triples, no, don't do that, it is way too much work. Just use a triple store.
The talk starts out with pretty introductory material but as it progresses it gets more interesting. Definitely worth a read during a treadmill session.
Here is another relevant link.
My opinion of Neo4J after using for 1 year for experimental purposes is that it is a decent application but I highly doubt its scalability for big data. I never tested this but it is a hunch based on my use.
Also if you are using Neo4J to store triples, no, don't do that, it is way too much work. Just use a triple store.
Monday, March 28, 2016
CS231n: Convolutional Neural Networks for Visual Recognition Winter Course Project Report
There are lots of interesting reads on this page. And this is a great course to take if you are research deep learning for image processing.
Tuesday, March 22, 2016
Sunday, March 20, 2016
3 Minute Thesis Competition - 3MT
Can you explain your dissertation in 3 minutes?
UMBC has a 3MT competition this Wednesday.
BALTIMORE, MD
If you are preparing for a 3MT, this is a good resource.
Other good 3MT videos:
2010 Trans-Tasman 3MT Winner - Balarka Banerjee from Three Minute Thesis (3MT®) on Vimeo.
UMBC has a 3MT competition this Wednesday.
BALTIMORE, MD
If you are preparing for a 3MT, this is a good resource.
Other good 3MT videos:
2010 Trans-Tasman 3MT Winner - Balarka Banerjee from Three Minute Thesis (3MT®) on Vimeo.
Thursday, March 17, 2016
Markdown
I am starting to use markdown more. For me, I wanted to know why I should care about markdown. This article gives a good view of why to use and there is a link to a tutorial.
Read it here.
Read it here.
Tuesday, March 15, 2016
Flask
Flask...
"Flask is a microframework for Python based on Werkzeug, Jinja 2 and good intentions."
I toyed around with the tutorial and was able to get a few simple apps running. I suppose if you are interested in building web sites, this might be interesting to try. I haven't determined if it is useful for anything else.
http://flask.pocoo.org/
"Flask is a microframework for Python based on Werkzeug, Jinja 2 and good intentions."
I toyed around with the tutorial and was able to get a few simple apps running. I suppose if you are interested in building web sites, this might be interesting to try. I haven't determined if it is useful for anything else.
http://flask.pocoo.org/
Monday, March 14, 2016
Spring Break.....
I love working on campus during spring break. Front space parking, empty lab, no line for coffee.....ahhh nirvana....
No coffee! Bah!
No coffee! Bah!
Thursday, March 10, 2016
Ugh, dissertation
In those moments when you are frustrated with your dissertation, breathe, and know there are others feeling the same pain.....
The valley....
Ride the wave to finish this thing...
The valley....
Ride the wave to finish this thing...
Tuesday, March 8, 2016
DL4J
I have been using Java for a long time but I find DL4J to be a bit cumbersome to use. I prefer Torch/Lua or Theano for deep learning.
However because Java has been such a significant part of my life for so long, I will not give up on DL4J.
More to come once I get this working.
In the meantime, here are a few links, so I can close those tabs:-)....
word2vec in DL4J
deep autoencoders in DL4J
nd4j
However because Java has been such a significant part of my life for so long, I will not give up on DL4J.
More to come once I get this working.
In the meantime, here are a few links, so I can close those tabs:-)....
word2vec in DL4J
deep autoencoders in DL4J
nd4j
I like popcorn and I like bag of words
I think I am going to like this too.
Beginners, may not be very useful, but they said popcorn, so they have my attention....
Beginners, may not be very useful, but they said popcorn, so they have my attention....
Labels:
bag of words,
Data Science,
Tutorial,
word2vec
matplotlib
Examples using matplotlib.
Tutorial for matplotlib.
A little bit on density plots.
And an introduction to plotting in Python.
Labels:
density plots,
matplotlib,
plots,
Python,
Tutorial
Thursday, March 3, 2016
Deep Learning Resources
Great Papers:
http://www.iro.umontreal.ca/~ bengioy/papers/ftml.pdf
http://deeplearning.net/ reading-list/
http://www.iro.umontreal.ca/~
http://deeplearning.net/
Tutorials:
TensorFlow:
https://www.tensorflow.org/
Theano:
http://deeplearning.net/
http://deeplearning.stanford.
Important Names and associated tutorials/talks:
Hinton:
https://www.cs.toronto.edu/~ hinton/nntut.html
LeCun:
http://www.cs.nyu.edu/~yann/ talks/lecun-ranzato-icml2013. pdf
Socher:
http://www.socher.org/index. php/DeepLearningTutorial/ DeepLearningTutorial
https://www.cs.toronto.edu/~
LeCun:
http://www.cs.nyu.edu/~yann/
Socher:
http://www.socher.org/index.
Common Datasets:
IMAGENET - http://www.image-net.org/ challenges/LSVRC/
Courses:
https://www.udacity.com/ course/deep-learning--ud730 - Basic but uses TensorFlow, good to get a basic understanding
https://www.coursera.org/ course/neuralnets - Provides great intuition, a little more challenging
https://cs231n.github.io/ - Great for understanding deep learning for images
https://www.udacity.com/
https://www.coursera.org/
https://cs231n.github.io/ - Great for understanding deep learning for images
Monday, January 25, 2016
Wednesday, November 25, 2015
word2vec tutorials and references
This is a nice way to get exposed to word2vec and to reduce the number of tabs in my browser:-)
Labels:
Deep Learning,
Neural Networks,
Word Embeddings,
word2vec
Friday, November 20, 2015
GenSim and LDA
This is a nice little simple tutorial on using GenSim to exercise LDA.
GenSim LDA Tutuorial
GenSim LDA Tutuorial
Friday, October 2, 2015
Saturday, April 25, 2015
Friday, March 13, 2015
Sunday, March 8, 2015
Thursday, February 26, 2015
Monday, February 23, 2015
Large Open Data Sets
Link 1 which is a list of large open data sets.
Link 2 - Common Crawl Corpus Amazon
Link 3 - Memtracker
Link 4 - Apache Access Logs
Link 5 - Clueweb09 Wiki
Link 6 - Click dataset
Link 7 - REDD
Link 8 - NASA
Link 9 - Dartmouth Atlas of Health Care
Link 10 - Data.gov Catalog
Link 11 - Data.gov Catalog - Complete
Link 12 - Awesome Public Datasets
Link 2 - Common Crawl Corpus Amazon
Link 3 - Memtracker
Link 4 - Apache Access Logs
Link 5 - Clueweb09 Wiki
Link 6 - Click dataset
Link 7 - REDD
Link 8 - NASA
Link 9 - Dartmouth Atlas of Health Care
Link 10 - Data.gov Catalog
Link 11 - Data.gov Catalog - Complete
Link 12 - Awesome Public Datasets
Friday, February 6, 2015
Tuesday, January 13, 2015
Friday, January 9, 2015
My first teaching experience
I would have to say it was definitely a learning experience. I taught a database course to 40 mostly seniors and mostly male students. I spent the semester researching and learning which techniques were best used for teaching. I also learned how to create my own lectures, projects and exams. This experience definitely gave me a taste of what it would be like to teach professionally. I also learned a bit about myself. By the end of the semester I no longer felt uncomfortable with speaking in front of a group or answering questions and challenges on the fly. I also was impressed with many of the students in my class.
Now it is back to research!
And maybe teaching again in the future once I recover from this first one.
Now it is back to research!
And maybe teaching again in the future once I recover from this first one.
Friday, October 3, 2014
Taming Wild Big Data
Our latest paper for the AAAI Fall Symposium.
Abstract: Wild Big Data is data that is hard to extract, understand, and use due to its heterogeneous nature and volume. It typically comes without a schema, is obtained from multiple sources and provides a challenge for information extraction and integration. We describe a way to subduing Wild Big Data that uses techniques and resources that are popular for processing natural language text. The approach is applicable to data that is presented as a graph of objects and relations between them and to tabular data that can be transformed into such a graph. We start by applying topic models to contextualize the data and then use the results to identify the potential types of the graph’s nodes by mapping them to known types found in large open ontologies such as Freebase, and DBpedia. The results allow us to assemble coarse clusters of objects that can then be used to interpret the link and perform entity disambiguation and record linking.
Abstract: Wild Big Data is data that is hard to extract, understand, and use due to its heterogeneous nature and volume. It typically comes without a schema, is obtained from multiple sources and provides a challenge for information extraction and integration. We describe a way to subduing Wild Big Data that uses techniques and resources that are popular for processing natural language text. The approach is applicable to data that is presented as a graph of objects and relations between them and to tabular data that can be transformed into such a graph. We start by applying topic models to contextualize the data and then use the results to identify the potential types of the graph’s nodes by mapping them to known types found in large open ontologies such as Freebase, and DBpedia. The results allow us to assemble coarse clusters of objects that can then be used to interpret the link and perform entity disambiguation and record linking.
Labels:
Big data,
DBpedia,
Freebase,
LDA,
Semantic Web,
Wild Big Data
Tuesday, September 30, 2014
Thursday, September 11, 2014
Learning Julia
julia is a dynamic programming language getting a bit of attention. I am running a few tutorials and learning the language. Some resources are listed below in case you are interested....
Learn about julia
Quick tutorial
Another tutorial
Google group
Learn about julia
Quick tutorial
Another tutorial
Google group
Thursday, December 19, 2013
Tuesday, November 26, 2013
Monday, November 18, 2013
Just received notice Qualcomm Toq Smartwatch is available Dec 2nd
Qualcomm Toq Smartwatch is available as of Dec 2nd for a starting cost of $349.99 (OUCHY!).
Sunday, November 17, 2013
AAAI Symposium
I've been spending the weekend at the AAAI Symposium. There have been quite a few interesting talks.
John Laird gave an interesting talk on General Intelligence.
Andrew Ng also gave an interesting talk, on Deep Learning.
John Laird gave an interesting talk on General Intelligence.
Andrew Ng also gave an interesting talk, on Deep Learning.
Labels:
AAAI,
AGI,
AI,
Deep Learning,
Neural Networks
Thursday, October 31, 2013
3D on the Web - Introduction to WebGL
This is an interesting talk offered through ACM.
Supplementary Learning Resources from Alain Chesnais
Collada:
Official website http://collada.org/
Tutorials https://collada.org/mediawiki/index.php/Portal:Tutorials
WebGL:
Official website http://www.khronos.org/webgl/
Tony Parisi's Tutorials http://learningwebgl.com/
Three.js:
Official website http://threejs.org/
Ilmari Heikkinen's Tutorial http://fhtr.org/BasicsOfThreeJS/#2
X3Dom:
Official website http://www.x3dom.org/
Introductory tutorial http://x3dom.org/docs/dev/tutorial/firststeps.html
Supplementary Learning Resources from Alain Chesnais
Collada:
Official website http://collada.org/
Tutorials https://collada.org/mediawiki/index.php/Portal:Tutorials
WebGL:
Official website http://www.khronos.org/webgl/
Tony Parisi's Tutorials http://learningwebgl.com/
Three.js:
Official website http://threejs.org/
Ilmari Heikkinen's Tutorial http://fhtr.org/BasicsOfThreeJS/#2
X3Dom:
Official website http://www.x3dom.org/
Introductory tutorial http://x3dom.org/docs/dev/tutorial/firststeps.html
Friday, October 25, 2013
Saturday, October 5, 2013
4D Printers
This is pretty cool, researchers are trying to add a 4th dimension to 3D printing. What they are trying to do is print material that can change over time, not just shape, but based on some external stimuli, change of behavior. I find this incredibly interesting.
The article is titled, "’4D printing’ adaptive materials" and is found on kurzweilai.net
Read the article at http://www.kurzweilai.net/4d-printing-adaptive-materials.
A great Ted talk on this same topic:
The article is titled, "’4D printing’ adaptive materials" and is found on kurzweilai.net
Read the article at http://www.kurzweilai.net/4d-printing-adaptive-materials.
A great Ted talk on this same topic:
Labels:
4D Printer,
Biology,
Convergence,
Nanotech,
Temporal Change
Thursday, October 3, 2013
Google Glass
Our lab, ebiquity, has been playing with Google Glass. So now I am hooked and wondering what sort of app I could develop for Google Glass. I've been doing a little reading on Google glass and came across this blog. The developer, Lance Nanek developed an interesting implementation of panning using head movement.
Watch the video to learn more:
Very cool stuff.
More information for developers is here.
And yet more info for developers.
And if you want Google Glass, then keep up with the latest news.
Watch the video to learn more:
Very cool stuff.
More information for developers is here.
And yet more info for developers.
And if you want Google Glass, then keep up with the latest news.
Labels:
ebiquity,
Google Glass,
Panning,
Programming,
SDK
Wednesday, October 2, 2013
Mid-Atlantic Student Colloquium on Speech, Language and Learning
UMBC is hosting the Mid-Atlantic Student Colloquium on Speech, Language and Learning.
Register online.
The schedule is now posted.
Snapshot of the schedule:
09:00-09:45 Registration, set up
09:45-10:00 Opening
10:00-11:20 Oral presentations I
Lushan Han, Abhay Kashyap, Tim Finin, James Mayfield and Jonathan Weese (UMBC & JHU). Semantic Textual Similarity Systems
Keith Levin, Aren Jansen and Ben Van Durme (JHU). Toward Faster Audio Search Using Context-Dependent Hashing
Shawn Squire, Monica Babes-Vroman, Marie Desjardins, Ruoyuan Gao, Michael Littman, James MacGlashan and Smaranda Muresan (UMBC & Brown). Learning to Interpret Natural Language Instructions
Viet-An Nguyen, Jordan Boyd-Graber and Philip Resnik (UMCP). Lexical and Hierarchical Topic Regression
11:20-12:10 Poster session I
Posters
12:10-12:40 Lunch
12:40-1:40 Panel
"How to be a successful PhD student and and transition to a great job"
Marie desJardins (UMBC)
Mark Dredze (JHU)
Claudia Pearce (DoD)
Ian Soboroff (NIST)
Hanna Wallach (UMass)
1:50-3:10 Oral presentations II
Qingqing Cai and Alexander Yates (Temple). Large-scale Semantic Parsing via Schema Matching and Lexicon Extension
William Yang Wang and William W. Cohen (CMU). Efficient First-Order Probabilistic Logic Programming for Natural Language Inference
Xuchen Yao, Benjamin Van Durme, Chris Callison-Burch, Peter Clark (JHU & UPenn & AI2). Semi-Markov Phrase-based Monolingual Alignment
Wei Xu, Alan Ritter and Ralph Grishman (NYU). Gathering and Generating Paraphrases from Twitter with Application to Normalization
3:10-4:00 Poster session II
Posters
4:00-5:00 Breakout sessions
NLP in low resource settings, Ann Irvine (JHU)
Dynamic Programming: Theory and Practice, Alexander Rush (Columbia/MIT)
NELL: Never Ending Language Learning, Partha Pratim Talukdar (CMU)
5:00 Closing
5:15 - 7:00 Wine down
Wine and beer at Flat Tuesdays, UMBC Commons
Register online.
The schedule is now posted.
Snapshot of the schedule:
09:00-09:45 Registration, set up
09:45-10:00 Opening
10:00-11:20 Oral presentations I
Lushan Han, Abhay Kashyap, Tim Finin, James Mayfield and Jonathan Weese (UMBC & JHU). Semantic Textual Similarity Systems
Keith Levin, Aren Jansen and Ben Van Durme (JHU). Toward Faster Audio Search Using Context-Dependent Hashing
Shawn Squire, Monica Babes-Vroman, Marie Desjardins, Ruoyuan Gao, Michael Littman, James MacGlashan and Smaranda Muresan (UMBC & Brown). Learning to Interpret Natural Language Instructions
Viet-An Nguyen, Jordan Boyd-Graber and Philip Resnik (UMCP). Lexical and Hierarchical Topic Regression
11:20-12:10 Poster session I
Posters
12:10-12:40 Lunch
12:40-1:40 Panel
"How to be a successful PhD student and and transition to a great job"
Marie desJardins (UMBC)
Mark Dredze (JHU)
Claudia Pearce (DoD)
Ian Soboroff (NIST)
Hanna Wallach (UMass)
1:50-3:10 Oral presentations II
Qingqing Cai and Alexander Yates (Temple). Large-scale Semantic Parsing via Schema Matching and Lexicon Extension
William Yang Wang and William W. Cohen (CMU). Efficient First-Order Probabilistic Logic Programming for Natural Language Inference
Xuchen Yao, Benjamin Van Durme, Chris Callison-Burch, Peter Clark (JHU & UPenn & AI2). Semi-Markov Phrase-based Monolingual Alignment
Wei Xu, Alan Ritter and Ralph Grishman (NYU). Gathering and Generating Paraphrases from Twitter with Application to Normalization
3:10-4:00 Poster session II
Posters
4:00-5:00 Breakout sessions
NLP in low resource settings, Ann Irvine (JHU)
Dynamic Programming: Theory and Practice, Alexander Rush (Columbia/MIT)
NELL: Never Ending Language Learning, Partha Pratim Talukdar (CMU)
5:00 Closing
5:15 - 7:00 Wine down
Wine and beer at Flat Tuesdays, UMBC Commons
Monday, September 30, 2013
Sunday, September 29, 2013
Friday, September 27, 2013
Integrity
“Truth at last cannot be hidden. Dissimulation is of no avail. Dissimulation is to no purpose before so great a judge. Falsehood puts on a mask. Nothing is hidden under the sun.”
― Leonardo da Vinci
“Stars hide your fires; let not light see my black and deep desires: The eyes wink at the hand; yet let that be which the eye fears, when it is done, to see” ― William Shakespeare, Macbeth
“Stars hide your fires; let not light see my black and deep desires: The eyes wink at the hand; yet let that be which the eye fears, when it is done, to see” ― William Shakespeare, Macbeth
Wednesday, September 25, 2013
ubuntu 12.04 lts - No video mode activated
After 5 hours of debugging, I finally was able to work around the issue.
In /etc/default/grub
Commented out:
GRUB_DEFAULT
GRUB_HIDDEN_TIMEOUT
GRUB_HIDDEN_TIMEOUT_QUIET
GRUB_TIMEOUT
Changed
GRUB_CMDLINE_LINUX_DEFAULT="nomodeset"
Get rid of the reference to "splash" because that is what is crashing.
Ref 1
Ref 2
Ref 3
Ref 4
Ref 5
I also did the following:
cd /usr/share/grub/
sudo cp *.pf2 /boot/grub
Finally I could boot but the mouse pointer was MIA.
Did this to get it back:
sudo modprobe -r psmouse
sudo modprobe psmouse proto=imps
NOW I CAN FINALLY CATCH UP ON MY WORK!
In /etc/default/grub
Commented out:
GRUB_DEFAULT
GRUB_HIDDEN_TIMEOUT
GRUB_HIDDEN_TIMEOUT_QUIET
GRUB_TIMEOUT
Changed
GRUB_CMDLINE_LINUX_DEFAULT="nomodeset"
Get rid of the reference to "splash" because that is what is crashing.
Ref 1
Ref 2
Ref 3
Ref 4
Ref 5
I also did the following:
cd /usr/share/grub/
sudo cp *.pf2 /boot/grub
Finally I could boot but the mouse pointer was MIA.
Did this to get it back:
sudo modprobe -r psmouse
sudo modprobe psmouse proto=imps
NOW I CAN FINALLY CATCH UP ON MY WORK!
Tuesday, September 24, 2013
References for graphs in R
Wednesday, September 18, 2013
How to write a good research paper and give a good research talk
Good reference for research paper and presentation advice.
Tuesday, September 17, 2013
Sunday, September 15, 2013
NLP Resources
I am spending time this semester learning NLP since my work overlaps with NLP. I found a free ebook for learning NLP in Python. Another link for this book is here.
At the UMBC NLP course web site there are also slides available.
In particular, I have been working with OpenNLP which so far has been very useful. However, I am at the point of doing in-doc coref and having a lot of problems. There is not really any documentation on the in-doc coref api calls but I did find someone who dug into the code a bit and found a way to do it. However, so far, when executing my implementation, it runs forever. I hope to have an update soon.
At the UMBC NLP course web site there are also slides available.
In particular, I have been working with OpenNLP which so far has been very useful. However, I am at the point of doing in-doc coref and having a lot of problems. There is not really any documentation on the in-doc coref api calls but I did find someone who dug into the code a bit and found a way to do it. However, so far, when executing my implementation, it runs forever. I hope to have an update soon.
Never Ending Learning (NELL)
UMBC ebiquity group
Our group is performing interesting research by blending semantic computing with security, mobile devices, medical informatics and big data problems. Some of this research overlaps quite a bit with natural language processing.
Read our publications to see the latest research coming from our lab.
Monday, September 9, 2013
Powerpoint Alternatives
I have been looking for an alternative to powerpoint that works reasonably well. So far OpenOffice and LibreOffice are not sufficient, especially when working on a team of Windows users. The following article provides a good list.
I don't have a suggestion yet but would love to hear what others are using....
Labels:
libreoffice,
Linux,
Openoffice,
Powerpoint,
windows
Tuesday, July 23, 2013
System76 Ubuntu laptops
I spent over 6 months researching laptops and finally decided to go with a System76 Ubuntu laptop. It has been just less than 1 month but I am loving my new lemur. Review to come...
System76 Lemur Ultra
Saturday, May 11, 2013
Bias and Variance Tradeoff
There is a great blog entry that describes this from a practical standpoint.
If references the following lecture also.
Tuesday, April 9, 2013
LDA - Step by Step in R
Was able to get an example running very quickly with this tutorial.
http://www.rtexttools.com/1/post/2011/08/getting-started-with-latent-dirichlet-allocation-using-rtexttools-topicmodels.html
Topic Modeling - Great Resource
Excellent resource!
http://www.cs.princeton.edu/~blei/topicmodeling.html
Building Templates for Basic Entity Types
http://schema.org/Person
http://schema.org/Organization
http://nlp.cs.nyu.edu/ene/version7_1_0Beng.html
Learning Programming by means of Python
This is a very basic beginners tutorial. Good for those just learning programming.
http://cscircles.cemc.uwaterloo.ca/
Monday, December 17, 2012
Monday, September 3, 2012
PCA, SVD, LSA
Great links to learn and understand these concepts:
PCA Tutorial
SVD and PCA discussion
LSA Tutorial
PCA Tutorial
SVD and PCA discussion
LSA Tutorial
Saturday, February 25, 2012
Mining Twitter Article (Not technical)
This is an interesting article http://isc.sans.edu/diary.html?storyid=5728 due to the fact that they have links that are interesting related to mining Twitter. It is a little outdated and a few of the links do not work but still semi-useful. Keep in mind data sets that were previously available, for example http://snap.stanford.edu/data/twitter7.html, are no longer available due to a Twitter request (read about it here). However you can still use the Twitter API to get data, you are just limited to the number of tweets you can get per day/session (can't remember).
Also useful is this article.
Paper Summary - Short text classification in twitter to improve information filtering
Short text classification in twitter to improve information filtering, B. Sriram and D. Fuhry and E. Demir and H. Ferhatosmanoglu,2010
This paper describes research that classifies tweets using a reduced set of features. In this approach they try to classify text into the following set of classes "News, Events, Opinions, Deals, and Private Messages". The problem they present is the curse of dimensionality problem that results from trying to conquer the spareness issue related to classifying twitter messages. Other research typically uses external knowledge bases to support tweet classification. They argue this can be slow due to the need to excessively query the external knowledge base. Important points about this paper:
1. They provide a very useful discussion of Twitter and tweets
2. How they classify tweets is interesting worth another review
This paper describes research that classifies tweets using a reduced set of features. In this approach they try to classify text into the following set of classes "News, Events, Opinions, Deals, and Private Messages". The problem they present is the curse of dimensionality problem that results from trying to conquer the spareness issue related to classifying twitter messages. Other research typically uses external knowledge bases to support tweet classification. They argue this can be slow due to the need to excessively query the external knowledge base. Important points about this paper:
1. They provide a very useful discussion of Twitter and tweets
2. How they classify tweets is interesting worth another review
Latex Tip - Controlling placement of figures and tables
This was a useful article for controlling placement of figures and tables.
http://robjhyndman.com/researchtips/latex-floats/
Paper Summary - Linking Social Networks on the Web with FOAF: A Semantic Web Case Study
Linking Social Networks on the Web with FOAF: A Semantic Web Case Study,J. Golbeck and M. Rothstein, 2008
Another paper related to FOAF and the Semantic Web. This paper examines linking FOAF identities across social networks. They propose that it would be better to merge these accounts across sites so a user has one social network. This may have been an interesting idea in 2008 but I'm not sure if this is relevant now seeing as some social networks let you bring in contacts from other sites. In general, this paper can be excluded from further analysis.
Another paper related to FOAF and the Semantic Web. This paper examines linking FOAF identities across social networks. They propose that it would be better to merge these accounts across sites so a user has one social network. This may have been an interesting idea in 2008 but I'm not sure if this is relevant now seeing as some social networks let you bring in contacts from other sites. In general, this paper can be excluded from further analysis.
Labels:
FOAF,
Research Paper Summaries,
Semantic Web,
Social Networks
Resource - Writing and Presenting Your Thesis or Dissertation
Writing and Presenting Your Thesis or Dissertation, S. Joseph Levine, Ph.D.
http://www.learnerassociates.net/dissthes/
Another useful resource for writing the dissertation. They have a section on writing the proposal which I found useful.
Paper Summary - Twitter Sentiment Classification using Distant Supervision
Twitter Sentiment Classification using Distant Supervision, A. Go and R. Bhayani and L. Huang, 2009, Technical report, Stanford Digital Library Technologies Project
This paper relates to classifying sentiment found on Twitter. They use machine learning and are able to construct training data by using the Twitter API and emoticons present among tweets. The standard :) and :( are used to determine if a tweet contains positive or negative sentiment. The key points are 1.) they picked an efficient way to construct their training sets, 2.) Tweets are harder to classify because their length can not exceed 140 characters, and 3.) their results were promising for classifying the sentiment of the tweets.
This paper relates to classifying sentiment found on Twitter. They use machine learning and are able to construct training data by using the Twitter API and emoticons present among tweets. The standard :) and :( are used to determine if a tweet contains positive or negative sentiment. The key points are 1.) they picked an efficient way to construct their training sets, 2.) Tweets are harder to classify because their length can not exceed 140 characters, and 3.) their results were promising for classifying the sentiment of the tweets.
Friday, February 17, 2012
Data Mining Resources
This thread has a lot of useful links for resources to help one studying data mining. It is mainly to help one build the mathematical background.
Friday, February 10, 2012
Data Mining and Machine Learning
I wished to understand the distinction between data mining and machine learning. This presentation (Machine Learning and Data Mining: 01 Data Mining) is useful.
Tuesday, February 7, 2012
Khan Academy
If you need to review concepts from Calculus, Linear Algebra, or Probability.
This is a good resource:
Khan Academy
I used this to review Linear Algebra concepts for my Data Mining course. They have quite a few topics.
Cheers.
This is a good resource:
Khan Academy
I used this to review Linear Algebra concepts for my Data Mining course. They have quite a few topics.
Cheers.
Wednesday, January 4, 2012
Advice Collection
I came across a useful collection of links for Ph.D. students. It offers dissertation advice, presentation advice, and more....I read some of these articles in the past but it is nice to have a central location for reference.
Saturday, December 24, 2011
Neural Network Framework for Java
I spent the last month learning how to use Self-Organizing Maps (SOM) for my Neural Network course. I used a SOM to perform instance matching (which is not typically what it is used for) with the intention that possibly it could be used in a 2-level fashion or with some other supervised approach. Think of SOM as a clustering technique. It maps high dimensional data into a lower dimension (typically 2-D) space and it enables one to visually see the data in 2-D. The beauty of a SOM is that it is unsupervised which means you do not need to specify the desired output for your training set.
It outperformed K-Means when running comparison tests using the OAEI IIMB benchmark both in F-Measure scores and CPU time.
I used Encog for the Neural Net framework. I experienced a lot of memory issues but for the most part I was quickly running a SOM, K-Means and SVM comparison test.
For the next test, I think using Matlab, Weka or R might be a better approach. As much as I like to keep things nice and clean in the code. I quickly have memory issues as I increase the number of instances to test.
Code Sample for Encog SOM (based on the Encog example):
//Set of the training data, no desired output in this case
MLDataSet training = new BasicMLDataSet(trainingInput,null);
// Create the SOM neural network with an input count and an output count
// This is basically your input node size and output node size
// One point of improvement
SOM network = new SOM(inputNodeSIze,ouputNodeSize);
//reset the nework
network.reset();
//Here you specify your parameters i.e. learning rate and neighborhood function
//I used the NeighborhoodSingle here but clearly that is not the best choice
//Next round of tests will use a RBF and a GaussianFunction
//.7 for learning rate is not unreasonable
BasicTrainSOM train = new BasicTrainSOM(
network,
0.7,
training,
new NeighborhoodSingle());
//new NeighborhoodRBF(sizes, RBFEnum.Gaussian)
//store the winner in a space in the 2-d array reserved
//calling code will lump instances that have the same winner
//to determine which instances are 'similar'
double[][] newItems = new double[input.length][];
int i=0;
for (double[] item: input)
{
item[item.length-2]=network.winner(new BasicMLData(item));
newItems[i] =item;
i++;
}
Encog.getInstance().shutdown();
It outperformed K-Means when running comparison tests using the OAEI IIMB benchmark both in F-Measure scores and CPU time.
I used Encog for the Neural Net framework. I experienced a lot of memory issues but for the most part I was quickly running a SOM, K-Means and SVM comparison test.
For the next test, I think using Matlab, Weka or R might be a better approach. As much as I like to keep things nice and clean in the code. I quickly have memory issues as I increase the number of instances to test.
Code Sample for Encog SOM (based on the Encog example):
//Set of the training data, no desired output in this case
MLDataSet training = new BasicMLDataSet(trainingInput,null);
// Create the SOM neural network with an input count and an output count
// This is basically your input node size and output node size
// One point of improvement
SOM network = new SOM(inputNodeSIze,ouputNodeSize);
//reset the nework
network.reset();
//Here you specify your parameters i.e. learning rate and neighborhood function
//I used the NeighborhoodSingle here but clearly that is not the best choice
//Next round of tests will use a RBF and a GaussianFunction
//.7 for learning rate is not unreasonable
BasicTrainSOM train = new BasicTrainSOM(
network,
0.7,
training,
new NeighborhoodSingle());
//new NeighborhoodRBF(sizes, RBFEnum.Gaussian)
//store the winner in a space in the 2-d array reserved
//calling code will lump instances that have the same winner
//to determine which instances are 'similar'
double[][] newItems = new double[input.length][];
int i=0;
for (double[] item: input)
{
item[item.length-2]=network.winner(new BasicMLData(item));
newItems[i] =item;
i++;
}
Encog.getInstance().shutdown();
Saturday, August 20, 2011
Paper Summary - Toward Conditional Models of Identity Uncertainty with Application to Proper Noun Coreference - Part 1
Toward Conditional Models of Identity Uncertainty
with Application to Proper Noun Coreference
A. McCallum and B. Wellner
This paper is interesting. They make the point that pairwise decisions may not always be independent of others. One may be able to resolve inconsistencies by using a dependence model. They mention work, Relational Probabilistic Model, which captures this dependence. However since it is a generative model, they state this could lead to complexities due to many features with varying degrees of granularity. They discuss Hidden Markov models and conditional random fields briefly and Relational Markov networks as a similar model but improved classification.
They then discuss their work specifically which is "three conditional undirected graphical
models for identity uncertainty" which make the coreference decisions. Their first model connects mentions, entity-assignments, and each attribute of the mention. Edges indicate dependence. There is the concept of a clique, parameters may be part of different cliques which results in patterns of parameters called clique templates. Parts of the graph that depend on a number of entities are removed and replaced with random variables indicating coreference (Read this paper again to make sure we are clear on this). Per-entity attribute nodes are removed and replaced with attributes of mention. They then use graph partitioning. There is a lot in this paper and really requires another read to understand their methods better.
with Application to Proper Noun Coreference
A. McCallum and B. Wellner
This paper is interesting. They make the point that pairwise decisions may not always be independent of others. One may be able to resolve inconsistencies by using a dependence model. They mention work, Relational Probabilistic Model, which captures this dependence. However since it is a generative model, they state this could lead to complexities due to many features with varying degrees of granularity. They discuss Hidden Markov models and conditional random fields briefly and Relational Markov networks as a similar model but improved classification.
They then discuss their work specifically which is "three conditional undirected graphical
models for identity uncertainty" which make the coreference decisions. Their first model connects mentions, entity-assignments, and each attribute of the mention. Edges indicate dependence. There is the concept of a clique, parameters may be part of different cliques which results in patterns of parameters called clique templates. Parts of the graph that depend on a number of entities are removed and replaced with random variables indicating coreference (Read this paper again to make sure we are clear on this). Per-entity attribute nodes are removed and replaced with attributes of mention. They then use graph partitioning. There is a lot in this paper and really requires another read to understand their methods better.
Paper Summary - Disambiguation and Filter Methods in Using Web Knowledge for Coreference Resolution
Disambiguation and Filter Methods in Using Web Knowledge for Coreference Resolution
O. Uryupina and M. Poesio
They describe how they use Wikipedia and Yago to increase their Coreference Resolution performance. They use BART to support their efforts. Their classification consists of a anaphor and a potential antecedent, as they describe. Using their associated feature vectors, they use a 'maximum entropy classifier' to determine coreference. They used Wikipedia to improve their aliasing algorithm, which would perform string matching functions. They use Wikipedia based information as a feature, and to disambiguate mentions. They use Yago to supplement their efforts when there are too few features to make any reasonable decisions related to coreference. Yago information is also incorporated as a feature. They tested using ACE with reasonable scores.
Using these publicly available knowledge bases appears to improve performance (2-3% in this case). Something to think about....
O. Uryupina and M. Poesio
They describe how they use Wikipedia and Yago to increase their Coreference Resolution performance. They use BART to support their efforts. Their classification consists of a anaphor and a potential antecedent, as they describe. Using their associated feature vectors, they use a 'maximum entropy classifier' to determine coreference. They used Wikipedia to improve their aliasing algorithm, which would perform string matching functions. They use Wikipedia based information as a feature, and to disambiguate mentions. They use Yago to supplement their efforts when there are too few features to make any reasonable decisions related to coreference. Yago information is also incorporated as a feature. They tested using ACE with reasonable scores.
Using these publicly available knowledge bases appears to improve performance (2-3% in this case). Something to think about....
Monday, August 8, 2011
Stanford Online AI Course
This course is offered for Fall 2011 semester and the instructors are Sebastian Thrun and Peter Norvig. It should be a good class. The formal title is "Introduction to Artificial Intelligence".
Join
Join
Wednesday, July 20, 2011
Boom Time In Silicon Valley?
In case you missed this over the weekend, apparently in Silicon Valley the next big boom is occurring, at least according to the LA Times.
For those keeping their eye on the market for that special time to start-up, this may be good news. Still too early to know for sure.
For those keeping their eye on the market for that special time to start-up, this may be good news. Still too early to know for sure.
Friday, July 15, 2011
Intuition
There was an interesting question posed on the Linked In AGI group discussion board:
Has anyone put out a suggestion on how to impliment an "Intuition" engine?
Responses were also quite interesting...
Has anyone successfully built an intuition engine? The responses ranged from a weak description of how one might do this to long detailed descriptions of current research that has been ongoing for many years.
If this sounds interesting you may want to join the Linked In group AGI — Artificial General Intelligence.
Or go to the AGI Conference in August .
Has anyone put out a suggestion on how to impliment an "Intuition" engine?
Responses were also quite interesting...
Has anyone successfully built an intuition engine? The responses ranged from a weak description of how one might do this to long detailed descriptions of current research that has been ongoing for many years.
If this sounds interesting you may want to join the Linked In group AGI — Artificial General Intelligence.
Or go to the AGI Conference in August .
Linked Data Paper Summary
"Linked Data - The Story So Far", C. Bizer, T. Heath, T. Berners-Lee, 2009, http://eprints.ecs.soton.ac.uk/21285/1/bizer-heath-berners-lee-ijswis-linked-data.pdf
My work will support working with linked data so I am attempting to build a better understanding of this topic.
This paper is good for providing a very basic understanding of linked data. If you are new to the concept of linked data, this is a good starting paper to read. It provides basic principles, examples, and isn't too technical.
If you already have a good understanding of the basics, skip this paper. I read it in about 10 minutes and it was pretty much just a review of what I already knew.
My work will support working with linked data so I am attempting to build a better understanding of this topic.
This paper is good for providing a very basic understanding of linked data. If you are new to the concept of linked data, this is a good starting paper to read. It provides basic principles, examples, and isn't too technical.
If you already have a good understanding of the basics, skip this paper. I read it in about 10 minutes and it was pretty much just a review of what I already knew.
Labels:
Linked Data,
Research Paper Summaries,
Semantic Web
Hadoop Meet-up - Large Scale Graph Processing On HBase and Map/Reduce on Greenplum
I attended the Hadoop Meet-up on Tuesday titled "Large Scale Graph Processing On HBase and Map/Reduce on Greenplum". You can view the event here.
I attended this event for two reasons: It has been a couple of years since I worked intimately with Hadoop and I wanted to see how others are using it. I was also hoping the discussion on Large Scale Graph Processing would be useful for my research.
Though the presentations were somewhat stimulating, I didn't find much I could use for my work.
I attended this event for two reasons: It has been a couple of years since I worked intimately with Hadoop and I wanted to see how others are using it. I was also hoping the discussion on Large Scale Graph Processing would be useful for my research.
Though the presentations were somewhat stimulating, I didn't find much I could use for my work.
Tuesday, June 21, 2011
Canopy Clustering
"Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching", McCallum,Nigam,Ungar,http://www.kamalnigam.com/papers/canopy-kdd00.pdf
This paper discusses a different type of clustering, titled canopy clustering. It is an interesting idea. There are basically two thresholds, using a 'cheap distance metric', we evaluate a list of points. Threshold 1 is > than Threshold 2. Pick one point to compare with all the other points in the list. When the distance between the two points falls within T1 put the points into a canopy. If the distance falls within T2 then remove point from list. We generate the canopies this way and work through the list until empty.
We can then apply our second level of clustering to each canopy and are pretty much guaranteed that if two points do not fall into the same canopy then they are likely not to be co-referent and therefore do not need to be evaluated.
This is efficient and elegant. Currently the only implementation that I found of canopy clustering is in Mahout. I am building my own implementation though to get a feel for how well it works.
This paper discusses a different type of clustering, titled canopy clustering. It is an interesting idea. There are basically two thresholds, using a 'cheap distance metric', we evaluate a list of points. Threshold 1 is > than Threshold 2. Pick one point to compare with all the other points in the list. When the distance between the two points falls within T1 put the points into a canopy. If the distance falls within T2 then remove point from list. We generate the canopies this way and work through the list until empty.
We can then apply our second level of clustering to each canopy and are pretty much guaranteed that if two points do not fall into the same canopy then they are likely not to be co-referent and therefore do not need to be evaluated.
This is efficient and elegant. Currently the only implementation that I found of canopy clustering is in Mahout. I am building my own implementation though to get a feel for how well it works.
SimHash: Hash-based Similarity Detection
"SimHash: Hash-based Similarity Detection", Sadowski, Levin, 2007.
This paper outlines a hash algorithm that can be used for similarity detection. Most hash algorithm are designed to offer low collision and hash values for similar strings can vary quite a bit. This hash basically sets out to achieve the opposite, higher collision and hash keys for similar strings are similar if not the same.
If you cannot use term frequency and need a numeric representation of a string for statistical processing, what process can be used? Using an integer-based hash is one way to achieve this, though in my opinion it is not the most sophisticated of approaches.
An implementation of this algorithm showed that it is a reasonable approach for hashing strings with the intent to determine similarity. I found minor issues which I will address by altering the algorithm.
Results showed that this is a reasonable approach....
This paper outlines a hash algorithm that can be used for similarity detection. Most hash algorithm are designed to offer low collision and hash values for similar strings can vary quite a bit. This hash basically sets out to achieve the opposite, higher collision and hash keys for similar strings are similar if not the same.
If you cannot use term frequency and need a numeric representation of a string for statistical processing, what process can be used? Using an integer-based hash is one way to achieve this, though in my opinion it is not the most sophisticated of approaches.
An implementation of this algorithm showed that it is a reasonable approach for hashing strings with the intent to determine similarity. I found minor issues which I will address by altering the algorithm.
Results showed that this is a reasonable approach....
Labels:
Classification,
Hashing,
Similarity Detection
Subscribe to:
Posts (Atom)