Coder Social home page Coder Social logo

nemo's Introduction

馃悹馃悹 NEMO2 - Neural Modeling for Named Entities and Morphology - Hebrew NER

Table of Contents

Introduction

Code and models for neural modeling of Hebrew NER. Described in the TACL paper "Neural Modeling for Named Entities and Morphology (NEMO2)" along with extensive experiments on the different modeling scenarios provided in this repository.

Main Features

  1. Trained on the Hebrew NER and Morphology NEMO corpus of gold annotated Modern Hebrew news articles.
  2. Multiple modeling options to go from raw Hebrew text to morpheme and/or token-level NER boundaries.
  3. Neural model implementation of NCRF++
  4. bclm is used for reading and transforming morpho-syntactic information layers.

Setup

Prerequisites:

  1. Clone this NEMO repo: git clone https://github.com/OnlpLab/NEMO.git
  2. Enter the repo directory: cd NEMO
  3. Preferably in a virtual env: pip install -r requirements.txt
  4. Unpack model files: gunzip data/*.gz
  5. Install yap: https://github.com/OnlpLab/yap

To run API server

  1. In YAP folder, run YAP API server ./yap api
  2. In NEMO folder, run NEMO API server uvicorn api_main:app --port 8090

To run on file input (CLI): nemo.py

  1. Change YAP_PATH in config.py to the path of your local yap executable.

Setup Using Docker

  1. docker-compose up (pulls, builds and/or startup will take a few minutes, depending on your bandwidth)
  2. That's it. You now have NEMO API running and available at local port 8090.
    1. YAP API docker is also running in the background, you can make it available by uncommenting the last two lines of docker-compose.yml.

Usage

API Usage

  1. Once the API server is up, check out the API documentation by opening (http://localhost:8090/docs) in your browser.
  2. You can find the available API endpoints and more usage examples in api_usage.ipynb.

File Input Usage (CLI)

  1. All you need to do is run nemo.py with a specific command (scenario), on a text file of Hebrew sentences separated by a line-break.
  2. You can run a neural NER model directly, or choose a full end-to-end scenario that includes morphological segmentation and alignments (described fully in the next section). e.g.:
    • the run_ner_model command with the token-single model will tokenize sentences and run the token-single model:
      • python nemo.py run_ner_model token-single example.txt example_output.txt
    • the morph_hybrid command runs the end-to-end segmentation and NER pipeline which provided our best performing morpheme-level NER boundaries:
      • python nemo.py morph_yap morph example.txt example_output_MORPH.txt
  3. You can find outputs of different commands on the input in example.txt in: morph_hybrid_align_tokens, morph_hybrid, morph_yap, multi_align_hybrid, single
  4. For a full list of the available commands please consult the next section and the inline documentation at the end of nemo.py.

Models and Scenarios

Models are all standard Bi-LSTM-CRF with char encoding (LSTM/CNN) of NCRFpp with pre-trained fastText embeddings. Differences between models lay in:

  1. Input units: morphemes morph vs. tokens token-*
  2. Output label set: token-single single sequence labels (e.g. B-ORG) vs. token-multi multi-labels (atomic labels, e.g. O-ORG^B-ORG^I-ORG) that predict, in order, the labels for the morphemes the token is made of.
Token-based Models Morpheme-based Model
token-based models morpheme-based Model

Morphemes must be predicted. This is done by performing morphological disambiguation (MD). We offer two options to do so:

  1. Standard pipeline: MD using YAP. This is used in the morph_yap command, which runs our morph NER model on the output of YAP joint segmentation.
  2. Hybrid pipeline: MD using our best performing Hybrid approach, which uses the output of the token-multi model to reduce the MD option space. This is used in morph_hybrid, multi_align_hybrid and morph_hybrid_align_tokens. We will explain these scenarios next.
MD Approach Commands
Standard Standard MD morph_yap
Hybrid Hybrid MD
Hybrid MD
morph_hybrid,
multi_align_hybrid,
morph_hybrid_align_tokens

Finally, to get our desired output (tokens/morphemes), we can choose between different scenarios, some involving extra post-processing alignments:

  1. To get morpheme-level labels we have two options:
    • Run our morph NER model on predicted morphemes: Commands: morph_yap or morph_hybrid (better).
    • token-multi labels can be aligned with predicted morphemes to get morpheme-level boundaries. Command: multi_align_hybrid.
Run morph NER on Predicted Morphemes Multi Predictions Aligned with Predicted Morpheme
Morph NER on Predicted Morphemes Multi Predictions Aligned with Predicted Morpheme
morph_yap,morph_hybrid multi_align_hybrid
  1. To get token-level labels we have three options:
    • run_ner_model command with token-single model.
    • the predicted labels of the token-multi can be mapped to token-single labels to get standard token-single output. The command multi_to_single does this end-to-end.
    • Morpheme-level output can be aligned back to token-level boundaries. Command: morph_hybrid_align_tokens (this achieved best token-level results in our experiments).
Run token-single Map token-multi to token-single Align morph NER with Tokens
Run token-single Map token-multi to token-single Align morph NER with Tokens
run_ner_model token-single multi_to_single morph_hybrid_align_tokens
  • Note: while the morph_hybrid* scenarios offer the best performance, they are slightly less efficient since they requires running both morph and token-multi NER models (yap calls take up most of the runtime anyway, so this is not extremely significant).

Important Notes

  1. NCRFpp was great for our experiments on the NEMO corpus (which is given constant data), but it holds some caveats for real life scenarios of arbitrary text:
    • fastText is not used on the fly to obtain vectors for OOV words (i.e. those that were not seen in our Wikipedia corpus). Instead, it is used as a regular embedding matrix. Hence the full generalization capacities of fastText, as shown in our experiments, are not available in the currently provided models, which will perform slightly worse than they could on arbitrary text. In our experiments we created such a matrix in advance with all the words in the NEMO corpus and used it during training. Information regarding training your own model with your own vocabulary in the next section.
    • If you do wish to replicate our reported results on the Hebrew treebank, download the *oov* models from here and extract to the data/ folder (they already appear in config.py).
  2. In the near future we plan to publish a cleaner end-to-end implementation, including use of our new AlephBERT pre-trained Transformer models.
  3. For archiving and reproducibility purposes, our original code used for experiments and analysis can be found in the following repos: https://github.com/cjer/NCRFpp, https://github.com/cjer/NER (beware - 2 years of Jupyter notebooks).

Training your own model

We provide template NCRF++ config files. These files already contain the hyperparameters we used in our training. To train your own model:

  1. Copy the config for the variant (token-multi, token-single, morph) you wish to use from the ncrf_train_configs folder.
  2. Change the parameter word_emb_dir to that of an embedding vectors file in standard word2vec textual format. You can use the fastText bin models we make available (in the next section) or any other embedding vectors of your choice.
  3. Run the following in your shell:
python ncrf_main.py --config <path_to_config> --device <gpu_device_number>
  1. For more information, please consult NCRF++ documentation.
  2. To evaluate your trained models, please consult the evaluation section.

Morpheme and Word Embeddings

The word embeddings we trained and used in our models are available:

  1. Space-delimited tokens (traditional word embeddings): fastText (bin, text), GloVe, word2vec
  2. Morphemes: fastText (bin, text), GloVe, word2vec

These were trained on a 2013 Wiki dump corpus by Yoav Goldberg, which we re-tokenized and then re-parsed using YAP:

  1. Space-delimited tokens
  2. Morphemes, automatic YAP segmentation (using the morpheme FORM as the unit for embedding)
  3. CONLL files of full morpho-syntactic output of YAP

Evaluation

To evaluate your predictions against gold use the ne_evaluate_mentions.py script. Evaluation looks for exact match of string and entity category, but is slightly different than the standard CoNLL2003 evaluation commonly used for NER. The reason is that predicted segmentation differs from gold, so positional indexes of sequence labels cannot be used. What we do instead, is extract multi-sets of entity mentions and use set operations to compute precision, recall and F1-score. You can find more detailed discussion of evaluation in the NEMO2 paper.

To evaluate an output prediction file against a gold file use:

python ne_evaluate_mentions.py <path_to_gold_ner> <path_to_predicted_ner>

If you're within python, just call ne_evaluate_mentions.evaluate_files(...) with the same parameters.

Ben-Mordecai Corpus

In our NEMO2 paper we also evaluate our models on the Ben-Mordecai Hebrew NER Corpus (BMC). The 3 random splits we used can be found here.

Citations

If you use any of the NEMO2 code, models, embeddings or the NEMO corpus, please cite the NEMO2 paper:

@article{10.1162/tacl_a_00404,
    author = {Bareket, Dan and Tsarfaty, Reut},
    title = "{Neural Modeling for Named Entities and Morphology (NEMO2)}",
    journal = {Transactions of the Association for Computational Linguistics},
    volume = {9},
    pages = {909-928},
    year = {2021},
    month = {09},
    abstract = "{Named Entity Recognition (NER) is a fundamental NLP task, commonly formulated as classification over a sequence of tokens. Morphologically rich languages (MRLs) pose a challenge to this basic formulation, as the boundaries of named entities do not necessarily coincide with token boundaries, rather, they respect morphological boundaries. To address NER in MRLs we then need to answer two fundamental questions, namely, what are the basic units to be labeled, and how can these units be detected and classified in realistic settings (i.e., where no gold morphology is available). We empirically investigate these questions on a novel NER benchmark, with parallel token- level and morpheme-level NER annotations, which we develop for Modern Hebrew, a morphologically rich-and-ambiguous language. Our results show that explicitly modeling morphological boundaries leads to improved NER performance, and that a novel hybrid architecture, in which NER precedes and prunes morphological decomposition, greatly outperforms the standard pipeline, where morphological decomposition strictly precedes NER, setting a new performance bar for both Hebrew NER and Hebrew morphological decomposition tasks.}",
    issn = {2307-387X},
    doi = {10.1162/tacl_a_00404},
    url = {https://doi.org/10.1162/tacl\_a\_00404},
    eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00404/1962472/tacl\_a\_00404.pdf},
}

If you use the NEMO2's NER models please also cite NCRF++:

@inproceedings{yang2018ncrf,  
 title={{NCRF}++: An Open-source Neural Sequence Labeling Toolkit},  
 author={Yang, Jie and Zhang, Yue},  
 booktitle={Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics},
 Url = {http://aclweb.org/anthology/P18-4013},
 year={2018}  
}

nemo's People

Contributors

alonisser avatar cjer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

nemo's Issues

"md_lattice" output format

Hi,

I'd appreciate your help finding a simple way to extract the lemmas from NEMO's output.

So far I tried extracting them from "md_lattice" output, but I'm having with the format.

I tried reading the output with pd:

import pandas as pd
import io
pd.read_csv(io.StringIO(res[0]["md_lattice"]), sep="\t", header=None)[3]

This doesn't always work.

Thanks!

KeyError: 'md_lattice' in morph_hybrid_align_tokens

Hi,
I'm having an error that repeatedly occurs with some textual inputs. I can't tell what is the issue with them.

The error (as printed in NEMO's server):

ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.8/site-packages/uvicorn/protocols/http/h11_impl.py", line 373, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "/opt/anaconda3/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 75, in __call__
    return await self.app(scope, receive, send)
  File "/opt/anaconda3/lib/python3.8/site-packages/fastapi/applications.py", line 208, in __call__
    await super().__call__(scope, receive, send)
  File "/opt/anaconda3/lib/python3.8/site-packages/starlette/applications.py", line 112, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/anaconda3/lib/python3.8/site-packages/starlette/middleware/errors.py", line 181, in __call__
    raise exc from None
  File "/opt/anaconda3/lib/python3.8/site-packages/starlette/middleware/errors.py", line 159, in __call__
    await self.app(scope, receive, _send)
  File "/opt/anaconda3/lib/python3.8/site-packages/starlette/exceptions.py", line 82, in __call__
    raise exc from None
  File "/opt/anaconda3/lib/python3.8/site-packages/starlette/exceptions.py", line 71, in __call__
    await self.app(scope, receive, sender)
  File "/opt/anaconda3/lib/python3.8/site-packages/starlette/routing.py", line 580, in __call__
    await route.handle(scope, receive, send)
  File "/opt/anaconda3/lib/python3.8/site-packages/starlette/routing.py", line 241, in handle
    await self.app(scope, receive, send)
  File "/opt/anaconda3/lib/python3.8/site-packages/starlette/routing.py", line 52, in app
    response = await func(request)
  File "/opt/anaconda3/lib/python3.8/site-packages/fastapi/routing.py", line 219, in app
    raw_response = await run_endpoint_function(
  File "/opt/anaconda3/lib/python3.8/site-packages/fastapi/routing.py", line 154, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "/opt/anaconda3/lib/python3.8/site-packages/starlette/concurrency.py", line 40, in run_in_threadpool
    return await loop.run_in_executor(None, func, *args)
  File "/opt/anaconda3/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "./api_main.py", line 481, in morph_hybrid_align_tokens
    return morph_hybrid(q, multi_model_name, morph_model_name, align_tokens=True)
  File "./api_main.py", line 424, in morph_hybrid
    md_lattice = run_yap_md(pruned_lattice) #TODO: this should be joint, but there is currently no joint on MA in yap api
  File "./api_main.py", line 91, in run_yap_md
    return resp['md_lattice']
KeyError: 'md_lattice'

Examples for inputs in which this happened to me:

专讗砖 讛诪诪砖诇讛, 讘谞讬诪讬谉 谞转谞讬讛讜, 砖讜讞讞 讛注专讘 讘讟诇驻讜谉 注诐 拽谞爪诇专讬转 讙专诪谞讬讛 讗谞讙诇讛 诪专拽诇, 讘注拽讘讜转 讛驻讬讙讜注 讗转诪讜诇 讘讘讬转 讛讻谞住转 讘注讬专 讛讗诇讛 讘讝诪谉 转驻讬诇讜转 讬讜诐 讛讻讬驻讜专讬诐. 讘砖讬讞讛 爪讬讬谞讛 讛拽谞爪诇专讬转 诪专拽诇  讻讬 讬砖 讘讻讜讜谞转讛 诇讛讙讘讬专 讗转 讛诪讗诪爪讬诐 诇讗讘讟讞转 讛拽讛讬诇讛 讛讬讛讜讚讬转 讘讗专爪讛. 驻专砖谞讬转谞讜 讛诪讚讬谞讬转, 讗讬诇讗讬诇 砖讞专, 诪讝讻讬专讛 讻讬 诪讜拽讚诐 讬讜转专 砖讜讞讞 谞砖讬讗 讛诪讚讬谞讛, 专讗讜讘谉 专讬讘诇讬谉, 注诐 谞砖讬讗 讙专诪谞讬讛, 驻专谞拽 讜讜诇讟专 住讟讬讬谞诪讗讬讬专 讜讛讚讙讬砖 讻讬 讬砖 诇讛诪砖讬讱 讜"诇注砖讜转 诇诇讗 驻砖专讜转 讻讚讬 诇讛讬诇讞诐 讘讗谞讟讬砖诪讬讜转". 谞砖讬讗 讙专诪谞讬讛 讘讬拽专 诇驻谞讬 诪住驻专 砖注讜转 讘讝讬专转 讛驻讬讙讜注 讘讛讗诇讛 讜讗诪专: "讘讛转讞砖讘 讘讛讬住讟讜专讬讛 砖诇 讙专诪谞讬讛 - 讗谞讞谞讜 讗讞讜讝讬 讗讬诪讛 讻砖讗谞讜 诪讘讬谞讬诐 砖讗讬专讜注 讻讝讛 讛转专讞砖 讻讗谉 讗爪诇谞讜".
讘讬转 讛诪砖驻讟 讛注诇讬讜谉 讛讻专讬注 诇驻谞讬 讝诪谉 拽爪专 讻讬 讛谞讗砖诪转 讘驻讚讜驻讬诇讬讛, 诪诇讻讛 诇讬讬驻专, 转讬砖讗专 讘诪注爪专. 诇讬讬驻专 诪讜讗砖诪转 讘砖讜专转 注讘讬专讜转 诪讬谉 讘讗讜住讟专诇讬讛 讜讟专诐 讛讜讞诇讟 讛讗诐 讛讬讗 讻砖讬专讛 诇注诪讜讚 砖诐 诇讚讬谉. 讻转讘谞讜 诇注谞讬讬谞讬 诪砖驻讟, 讬讜讘诇 讗专讗诇, 诪讜住讬祝 讻讬 讘讛讞诇讟讛 谞讻转讘 砖讛住驻拽 讘诪爪讘讛 讛谞驻砖讬 砖诇 诇讬讬驻专 诪注讜专专 讞砖砖 讻讬 诪讚讜讘专 讘谞住讬讜谉 诇讘专讜讞 诪讗讬诪转 讛讚讬谉. 砖专转 讛诪砖驻讟讬诐 诇砖注讘专 讜讞讘专转 讛讻谞住转, 讗讬讬诇转 砖拽讚, 讘讬专讻讛 注诇 讛讛讞诇讟讛 讘讞砖讘讜谉 讛讟讜讜讬讟专 砖诇讛 讜讛讜住讬驻讛 讻讬 诇讗讞专 讞诪砖 砖谞讬诐 - 讬砖 诇住讬讬诐 讗转 讛诇讬讱 讛讛住讙专讛. 

Thanks.

CLI morph_hybrid_align_tokens command problem

Hi, I'm having a problem running the morph_hybrid_align_tokens command after pulling the changes made to the repository.
While running other commands such as morph_hybrid do work on the input files attached, the command python nemo.py morph_hybrid_align_tokens morph NER_TEST.txt NER_RESULT.txt produces the following message:

    new_toks = get_fixed_tok(temp_morph_ner_output_path, orig_sents=prun_sents)
  File "nemo.py", line 224, in get_fixed_tok
    new_toks = bclm.get_token_df(new_sents, fields=['bio'])
  File "C:\Users\Tmuna\anaconda3\envs\nemoenv\lib\site-packages\bclm\transforms.py", line 70, in get_token_df
    tok_df = tok_df.sort_index().reset_index()
  File "C:\Users\Tmuna\anaconda3\envs\nemoenv\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\Tmuna\anaconda3\envs\nemoenv\lib\site-packages\pandas\core\frame.py", line 5794, in reset_index
    new_obj.insert(0, name, level_values)
  File "C:\Users\Tmuna\anaconda3\envs\nemoenv\lib\site-packages\pandas\core\frame.py", line 4409, in insert
    raise ValueError(f"cannot insert {column}, already exists")
ValueError: cannot insert token_str, already exists

the temporary files for NER_TEST_2.txt are also attached.
temp.zip
NER_TEST.txt
NER_TEST_2.txt

Training Data

Hi is the training data available for use? If so, where can I get it from?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    馃枛 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 馃搳馃搱馃帀

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google 鉂わ笍 Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.