cltk / ang_models_cltk Goto Github PK

Shell 0.13% Python 74.87% Awk 0.54% Jupyter Notebook 24.45% Dockerfile 0.01%

ang_models_cltk's Introduction

The Classical Language Toolkit (CLTK) is a Python library offering natural language processing (NLP) for pre-modern languages.

Installation

For the CLTK's latest version:

$ pip install cltk

For more information, see Installation docs or, to install from source, Development.

Pre-1.0 software remains available on the branch v0.1.x and docs at https://legacy.cltk.org. Install it with pip install "cltk<1.0".

Documentation

Documentation at https://docs.cltk.org.

Citation

When using the CLTK, please cite the following publication, including the DOI:

Johnson, Kyle P., Patrick J. Burns, John Stewart, Todd Cook, Clément Besnier, and William J. B. Mattingly. "The Classical Language Toolkit: An NLP Framework for Pre-Modern Languages." In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pp. 20-29. 2021. 10.18653/v1/2021.acl-demo.3

The complete BibTeX entry:

@inproceedings{johnson-etal-2021-classical,
    title = "The {C}lassical {L}anguage {T}oolkit: {A}n {NLP} Framework for Pre-Modern Languages",
    author = "Johnson, Kyle P.  and
      Burns, Patrick J.  and
      Stewart, John  and
      Cook, Todd  and
      Besnier, Cl{\'e}ment  and
      Mattingly, William J. B.",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-demo.3",
    doi = "10.18653/v1/2021.acl-demo.3",
    pages = "20--29",
    abstract = "This paper announces version 1.0 of the Classical Language Toolkit (CLTK), an NLP framework for pre-modern languages. The vast majority of NLP, its algorithms and software, is created with assumptions particular to living languages, thus neglecting certain important characteristics of largely non-spoken historical languages. Further, scholars of pre-modern languages often have different goals than those of living-language researchers. To fill this void, the CLTK adapts ideas from several leading NLP frameworks to create a novel software architecture that satisfies the unique needs of pre-modern languages and their researchers. Its centerpiece is a modular processing pipeline that balances the competing demands of algorithmic diversity with pre-configured defaults. The CLTK currently provides pipelines, including models, for almost 20 languages.",
}

License

ang_models_cltk's People

Contributors

Stargazers

Watchers

ang_models_cltk's Issues

Recommendations for next steps

@free-variation This looks so great! I have not checked the details of your work, but I am confident we are off to a very strong start. I have however confirmed that everything in the README runs for me as it does for you.

Question: The final models -- are these created from your entire training set? If not, they should be, as you have such a small amount of data, you must make use of it. Since you have done a proper 10-fold cross-validation, your averages across the 10 accurately convey the accuracy of each.

I'm going to open an issue for you in the main project, so that our GSoC people are aware of your additions.

Accomodate for Linux/Mac discrepancies in `shuf` and `head`

@free-variation We're off to a good start here!

I made some trivial changes on branch kj-review, please update these to master if they look good. Then you can see where I am stuck with scripts/evaluate_all_models.bash.

$ ./scripts/evaluate_all_models.bash
---------- unigram ----------
scripts/split_dataset.bash: line 17: shuf: command not found
head: illegal line count -- 0
Traceback (most recent call last):
  File "src/python/oe_dev.py", line 19, in <module>
    _, acc = make_pos_model(model_type, 'tmp/oe_train.pos', 'tmp/oe_test.pos')
  File "/Users/kyle/english_models_cltk/src/python/train_pos_tagger.py", line 42, in make_pos_model
    return (tagger, tagger.evaluate(test_sents))
  File "/Users/kyle/venv36/lib/python3.6/site-packages/nltk/tag/api.py", line 72, in evaluate
    return accuracy(gold_tokens, test_tokens)
  File "/Users/kyle/venv36/lib/python3.6/site-packages/nltk/metrics/scores.py", line 41, in accuracy
    return sum(x == y for x, y in zip(reference, test)) / len(test)
ZeroDivisionError: division by zero

---------- backoff ----------
scripts/split_dataset.bash: line 17: shuf: command not found
head: illegal line count -- 0
Traceback (most recent call last):
  File "src/python/oe_dev.py", line 19, in <module>
    _, acc = make_pos_model(model_type, 'tmp/oe_train.pos', 'tmp/oe_test.pos')
  File "/Users/kyle/english_models_cltk/src/python/train_pos_tagger.py", line 42, in make_pos_model
    return (tagger, tagger.evaluate(test_sents))
  File "/Users/kyle/venv36/lib/python3.6/site-packages/nltk/tag/api.py", line 72, in evaluate
    return accuracy(gold_tokens, test_tokens)
  File "/Users/kyle/venv36/lib/python3.6/site-packages/nltk/metrics/scores.py", line 41, in accuracy
    return sum(x == y for x, y in zip(reference, test)) / len(test)
ZeroDivisionError: division by zero

---------- crf ----------
scripts/split_dataset.bash: line 17: shuf: command not found
head: illegal line count -- 0
Traceback (most recent call last):
  File "src/python/oe_dev.py", line 19, in <module>
    _, acc = make_pos_model(model_type, 'tmp/oe_train.pos', 'tmp/oe_test.pos')
  File "/Users/kyle/english_models_cltk/src/python/train_pos_tagger.py", line 42, in make_pos_model
    return (tagger, tagger.evaluate(test_sents))
  File "/Users/kyle/venv36/lib/python3.6/site-packages/nltk/tag/api.py", line 72, in evaluate
    return accuracy(gold_tokens, test_tokens)
  File "/Users/kyle/venv36/lib/python3.6/site-packages/nltk/metrics/scores.py", line 41, in accuracy
    return sum(x == y for x, y in zip(reference, test)) / len(test)
ZeroDivisionError: division by zero

---------- perceptron ----------
scripts/split_dataset.bash: line 17: shuf: command not found
head: illegal line count -- 0
Traceback (most recent call last):
  File "src/python/oe_dev.py", line 19, in <module>
    _, acc = make_pos_model(model_type, 'tmp/oe_train.pos', 'tmp/oe_test.pos')
  File "/Users/kyle/english_models_cltk/src/python/train_pos_tagger.py", line 42, in make_pos_model
    return (tagger, tagger.evaluate(test_sents))
  File "/Users/kyle/venv36/lib/python3.6/site-packages/nltk/tag/api.py", line 72, in evaluate
    return accuracy(gold_tokens, test_tokens)
  File "/Users/kyle/venv36/lib/python3.6/site-packages/nltk/metrics/scores.py", line 41, in accuracy
    return sum(x == y for x, y in zip(reference, test)) / len(test)
ZeroDivisionError: division by zero

Take a look at what I did on line 17: shuf $1 > $SHUFFLED_FILE || gshuf $1 > $SHUFFLED_FILE, so that gshuf gets called for mac users. I assume my || trick is working, not totally sure. This ZeroDivisionError seems to come out of head: illegal line count -- 0. What OS are you now using? I can try on a Linux later, however till then if you can try these on a mac, it'd help speed things along.

Requirements.txt

Hey,

It would be great to have a requirements.txt file where packages and their versions are listed so that anyone can reproduce your work.

Thanks.

cltk / ang_models_cltk Goto Github PK

ang_models_cltk's Introduction

Installation

Documentation

Citation

License

ang_models_cltk's People

Contributors

Stargazers

Watchers

ang_models_cltk's Issues

Recommendations for next steps

Accomodate for Linux/Mac discrepancies in `shuf` and `head`

Requirements.txt

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent