**Start here**, this describes the steps for Information Extraction: http://www.nltk.org/book3/ch07.html Background information can be found in chapters 3 and 5. ALSO there is source code here: http://www.nltk.org/ source code for NER chunker: https://code.google.com/p/nltk/source/browse/trunk/nltk/nltk/chunk/named_entity.py -- tagger code info: http://www.nltk.org/api/nltk.tag.html#module-nltk.tag.stanford
-
This is using a C# tool, but I thought the description of hidden markov modeling was good: http://www.codeproject.com/Articles/541428/Sequence-Classifiers-in-Csharp-Part-I-Hidden-Marko
-
(new) Here is an academic paper that talks about a semi-supervised learning algorithm for chunking. (As an aside, on page six there is a list of features they used for NER, something that I had been trying to find for a while): http://stat.rutgers.edu/home/tzhang/papers/acl05-semi.pdf
-
(new) Powerpoint discussing semi-supervised algorithms. I'll look for a better source of info: http://pages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdf
-
(new) another powerpoint discussing SSL, from the same source. Seems meatier than the one above: http://pages.cs.wisc.edu/~jerryzhu/pub/acl08tutorial.pdf
-
(new) A paper on Unsupervised chunking: http://aclweb.org/anthology//P/P11/P11-1108.pdf
-
(new) Semi supervised HMM info: http://www.shi-zhong.com/papers/FLAIRS04ZhongS.pdf
-
Blog from one of the authors of nltk, this entry discussing chunker training: http://streamhacker.wordpress.com/2008/12/29/how-to-train-a-nltk-chunker/
-
Lingpipe is an NLP implementation with NER module. It has a book associated with it (much like the nltk) http://alias-i.com/lingpipe-book/lingpipe-book-0.5.pdf
- Ch. 6 is about character lang. models, Ch. 7 is about tokenized language models, ch 9 is about classifiers has
- useful information about scoring classifier performance, Ch 10 is bayes classifier, ch 11 is about tagging...
-
Textbook with lots of information about language processing http://www.cse.iitk.ac.in/users/mohit/Speech-and-Language-Processing.pdf
-
possible availiable corpii for use in project: http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html