Coder Social home page Coder Social logo

ang_models_cltk's Introduction

circleci pypi twitter discord

The Classical Language Toolkit (CLTK) is a Python library offering natural language processing (NLP) for pre-modern languages.

Installation

For the CLTK's latest version:

$ pip install cltk

For more information, see Installation docs or, to install from source, Development.

Pre-1.0 software remains available on the branch v0.1.x and docs at https://legacy.cltk.org. Install it with pip install "cltk<1.0".

Documentation

Documentation at https://docs.cltk.org.

Citation

When using the CLTK, please cite the following publication, including the DOI:

Johnson, Kyle P., Patrick J. Burns, John Stewart, Todd Cook, Clément Besnier, and William J. B. Mattingly. "The Classical Language Toolkit: An NLP Framework for Pre-Modern Languages." In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pp. 20-29. 2021. 10.18653/v1/2021.acl-demo.3

The complete BibTeX entry:

@inproceedings{johnson-etal-2021-classical,
    title = "The {C}lassical {L}anguage {T}oolkit: {A}n {NLP} Framework for Pre-Modern Languages",
    author = "Johnson, Kyle P.  and
      Burns, Patrick J.  and
      Stewart, John  and
      Cook, Todd  and
      Besnier, Cl{\'e}ment  and
      Mattingly, William J. B.",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-demo.3",
    doi = "10.18653/v1/2021.acl-demo.3",
    pages = "20--29",
    abstract = "This paper announces version 1.0 of the Classical Language Toolkit (CLTK), an NLP framework for pre-modern languages. The vast majority of NLP, its algorithms and software, is created with assumptions particular to living languages, thus neglecting certain important characteristics of largely non-spoken historical languages. Further, scholars of pre-modern languages often have different goals than those of living-language researchers. To fill this void, the CLTK adapts ideas from several leading NLP frameworks to create a novel software architecture that satisfies the unique needs of pre-modern languages and their researchers. Its centerpiece is a modular processing pipeline that balances the competing demands of algorithmic diversity with pre-configured defaults. The CLTK currently provides pipelines, including models, for almost 20 languages.",
}

License

Copyright (c) 2014-2024 Kyle P. Johnson under the MIT License.

ang_models_cltk's People

Contributors

free-variation avatar jds-amplify avatar kylepjohnson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

ang_models_cltk's Issues

Recommendations for next steps

@free-variation This looks so great! I have not checked the details of your work, but I am confident we are off to a very strong start. I have however confirmed that everything in the README runs for me as it does for you.

Question: The final models -- are these created from your entire training set? If not, they should be, as you have such a small amount of data, you must make use of it. Since you have done a proper 10-fold cross-validation, your averages across the 10 accurately convey the accuracy of each.

I'm going to open an issue for you in the main project, so that our GSoC people are aware of your additions.

Accomodate for Linux/Mac discrepancies in `shuf` and `head`

@free-variation We're off to a good start here!

I made some trivial changes on branch kj-review, please update these to master if they look good. Then you can see where I am stuck with scripts/evaluate_all_models.bash.

$ ./scripts/evaluate_all_models.bash
---------- unigram ----------
scripts/split_dataset.bash: line 17: shuf: command not found
head: illegal line count -- 0
Traceback (most recent call last):
  File "src/python/oe_dev.py", line 19, in <module>
    _, acc = make_pos_model(model_type, 'tmp/oe_train.pos', 'tmp/oe_test.pos')
  File "/Users/kyle/english_models_cltk/src/python/train_pos_tagger.py", line 42, in make_pos_model
    return (tagger, tagger.evaluate(test_sents))
  File "/Users/kyle/venv36/lib/python3.6/site-packages/nltk/tag/api.py", line 72, in evaluate
    return accuracy(gold_tokens, test_tokens)
  File "/Users/kyle/venv36/lib/python3.6/site-packages/nltk/metrics/scores.py", line 41, in accuracy
    return sum(x == y for x, y in zip(reference, test)) / len(test)
ZeroDivisionError: division by zero

---------- backoff ----------
scripts/split_dataset.bash: line 17: shuf: command not found
head: illegal line count -- 0
Traceback (most recent call last):
  File "src/python/oe_dev.py", line 19, in <module>
    _, acc = make_pos_model(model_type, 'tmp/oe_train.pos', 'tmp/oe_test.pos')
  File "/Users/kyle/english_models_cltk/src/python/train_pos_tagger.py", line 42, in make_pos_model
    return (tagger, tagger.evaluate(test_sents))
  File "/Users/kyle/venv36/lib/python3.6/site-packages/nltk/tag/api.py", line 72, in evaluate
    return accuracy(gold_tokens, test_tokens)
  File "/Users/kyle/venv36/lib/python3.6/site-packages/nltk/metrics/scores.py", line 41, in accuracy
    return sum(x == y for x, y in zip(reference, test)) / len(test)
ZeroDivisionError: division by zero

---------- crf ----------
scripts/split_dataset.bash: line 17: shuf: command not found
head: illegal line count -- 0
Traceback (most recent call last):
  File "src/python/oe_dev.py", line 19, in <module>
    _, acc = make_pos_model(model_type, 'tmp/oe_train.pos', 'tmp/oe_test.pos')
  File "/Users/kyle/english_models_cltk/src/python/train_pos_tagger.py", line 42, in make_pos_model
    return (tagger, tagger.evaluate(test_sents))
  File "/Users/kyle/venv36/lib/python3.6/site-packages/nltk/tag/api.py", line 72, in evaluate
    return accuracy(gold_tokens, test_tokens)
  File "/Users/kyle/venv36/lib/python3.6/site-packages/nltk/metrics/scores.py", line 41, in accuracy
    return sum(x == y for x, y in zip(reference, test)) / len(test)
ZeroDivisionError: division by zero

---------- perceptron ----------
scripts/split_dataset.bash: line 17: shuf: command not found
head: illegal line count -- 0
Traceback (most recent call last):
  File "src/python/oe_dev.py", line 19, in <module>
    _, acc = make_pos_model(model_type, 'tmp/oe_train.pos', 'tmp/oe_test.pos')
  File "/Users/kyle/english_models_cltk/src/python/train_pos_tagger.py", line 42, in make_pos_model
    return (tagger, tagger.evaluate(test_sents))
  File "/Users/kyle/venv36/lib/python3.6/site-packages/nltk/tag/api.py", line 72, in evaluate
    return accuracy(gold_tokens, test_tokens)
  File "/Users/kyle/venv36/lib/python3.6/site-packages/nltk/metrics/scores.py", line 41, in accuracy
    return sum(x == y for x, y in zip(reference, test)) / len(test)
ZeroDivisionError: division by zero

Take a look at what I did on line 17: shuf $1 > $SHUFFLED_FILE || gshuf $1 > $SHUFFLED_FILE, so that gshuf gets called for mac users. I assume my || trick is working, not totally sure. This ZeroDivisionError seems to come out of head: illegal line count -- 0. What OS are you now using? I can try on a Linux later, however till then if you can try these on a mac, it'd help speed things along.

Requirements.txt

Hey,

It would be great to have a requirements.txt file where packages and their versions are listed so that anyone can reproduce your work.

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.