Character Extraction

The purpose of this program is to extract the names of fictional characters from a novel and analyze the sentences the characters appear in or are referenced in within the text in order to build a profile containing data specific to each character. It was created using the 32-bit version of Python 2.7 with the Natural Language Toolkit 2.0.4 and Pattern 2.6 libraries.

To change the book to be analyzed, add the book as a text file to the same file directory as the program, change the name of the text file on line 25 of the file and rerun the program. You can also have the book file in a different directory and reference the file path to the book instead.

References

Oliver Twist

This and all associated files of various formats will be found in: http://www.gutenberg.org/7/3/730/

Produced by Peggy Gaugy and Leigh Little. HTML version by Al Haines. This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.net

NLTK

Bird, Steven, Edward Loper and Ewan Klein (2009). Natural Language Processing with Python. O'Reilly Media Inc.

NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets and tutorials supporting research and development in Natural Language Processing.

NLTK source code is distributed under the Apache 2.0 License. NLTK documentation is distributed under the Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States license. NLTK corpora are provided under the terms given in the README file for each corpus; all are redistributable, and available for non-commercial use. NLTK may be freely redistributed, subject to the provisions of these licenses.

https://github.com/nltk/nltk/blob/develop/LICENSE.txt

Pattern

De Smedt, T., Daelemans, W. (2012). Pattern for Python. Journal of Machine Learning Research, 13, 2031–2035.

Pattern is a web mining module for Python. It has tools for data mining (web services for Google, Twitter and Wikipedia, web crawler, HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, classification using KNN, SVM, Perceptron) and network analysis (graph centrality and visualization). It is well documented and bundled with 50+ examples and 350+ unit tests. The source code is licensed under BSD and available from http://www.clips.ua.ac.be/pages/pattern.

https://github.com/clips/pattern/blob/master/README.txt

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Download resources automatically?

λ python characterExtraction.py
Traceback (most recent call last):
  File "C:\Users\endolith\Documents\Engineering documents\Machine learning neural networks\Language models\character-extraction\characterExtraction.py", line 195, in <module>
    chunkedSentences = chunkSentences(text)
  File "C:\Users\endolith\Documents\Engineering documents\Machine learning neural networks\Language models\character-extraction\characterExtraction.py", line 44, in chunkSentences
    chunkedSentences = nltk.ne_chunk_sents(taggedSentences, binary=True)
  File "C:\Users\endolith\anaconda3\lib\site-packages\nltk\chunk\__init__.py", line 196, in ne_chunk_sents
    chunker = load(chunker_pickle)
  File "C:\Users\endolith\anaconda3\lib\site-packages\nltk\data.py", line 750, in load
    opened_resource = _open(resource_url)
  File "C:\Users\endolith\anaconda3\lib\site-packages\nltk\data.py", line 876, in _open
    return find(path_, path + [""]).open()
  File "C:\Users\endolith\anaconda3\lib\site-packages\nltk\data.py", line 583, in find
    raise LookupError(resource_not_found)
LookupError:
**********************************************************************
  Resource maxent_ne_chunker not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('maxent_ne_chunker')

  For more information see: https://www.nltk.org/data.html

  Attempted to load chunkers/maxent_ne_chunker/english_ace_binary.pickle

  Searched in:
    - 'C:\\Users\\endolith/nltk_data'
    - 'C:\\Users\\endolith\\anaconda3\\nltk_data'
    - 'C:\\Users\\endolith\\anaconda3\\share\\nltk_data'
    - 'C:\\Users\\endolith\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\endolith\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************

says the same for several resources and I have to install them, restart the script, get another error, etc.

emdaniels / character-extraction Goto Github PK