Coder Social home page Coder Social logo

yuweiyin / crop Goto Github PK

View Code? Open in Web Editor NEW
5.0 2.0 3.0 15.95 MB

[EMNLP 2022] CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual Labeled Sequence Translation

Home Page: https://aclanthology.org/2022.findings-emnlp.34/

License: MIT License

Python 93.95% Makefile 0.01% Batchfile 0.01% Shell 5.17% C++ 0.21% Cuda 0.47% Cython 0.14% Lua 0.05%
named-entity-recognition natural-language-processing neural-machine-translation

crop's Introduction

CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual Labeled Sequence Translation

picture

Abstract

Named entity recognition (NER) suffers from the scarcity of annotated training data, especially for low-resource languages without labeled data. Cross-lingual NER has been proposed to alleviate this issue by transferring knowledge from high-resource languages to low-resource languages via aligned crosslingual representations or machine translation results. However, the performance of crosslingual NER methods is severely affected by the unsatisfactory quality of translation or label projection. To address these problems, we propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot crosslingual NER with the help of a multilingual labeled sequence translation model. Specifically, the target sequence is first translated into the source language and then tagged by a source NER model. We further adopt a labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence. Ultimately, the whole pipeline is integrated into an end-to-end model by the way of self-training. Experimental results on two benchmarks demonstrate that our method substantially outperforms the previous strong baseline by a large margin of +3 ~ 7 F1 scores and achieves state-of-the-art performance.

Data

We use CCaligned, CoNLL-5, and XTREME-40 datasets. For more details, please refer to the 4.1 Dataset Section in our paper.

Environment

cd m2m
pip install --editable ./
  • For faster training install NVIDIA's apex library:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./

CROP Training

NOTE: modify all the "/path/to/" in our code to your own code/data path.

Train the NER and Translation models

The Source NER Model

bash ./pipeline/step0_train_source_ner_model.sh

The Multilingual Labeled Sequence Translation Model

bash ./pipeline/step0_train_translation_model.sh
  • Download: Google Drive; Baidu Drive
    • Trained model
      • Trained Baseline Translation Model: m2m_checkpoint_baseline.pt
      • Trained Insert-based Translation Model: m2m_checkpoint_insert_avg_41_60.pt
      • Trained Replace-based Translation Model: m2m_checkpoint_replace_avg_11_20.pt
    • Dictionary for Tokenization (used by all three models above): dict.txt
      • dict-40-lang.zip includes 40 dictionaries of different languages.
    • SentencePiece Model: spm.model
    • XTREME-40 NER Data: xtreme_ner_data.zip

CROP Pipeline

  1. Translated Target Translation data
bash ./pipeline/step1_prepare_tgt_translation_data.sh
  1. Translated Target data to the Source data

(use the Baseline Translation Model or Insert-based Translation Model or Replace-based Translation Model)

bash ./pipeline/step2_tgt2src_translation.sh
  1. Prepare Translated NER Data
bash ./pipeline/step3_preapre_src_ner_data.sh
  1. Source NER
bash ./pipeline/step4_src_ner.sh
  1. Prepare Source Translation Data
bash ./pipeline/step5_prepare_src_translation_data.sh
  1. Labeled Translation

(use the Insert-based Translation Model)

bash ./pipeline/step6_labeled_transation.sh
  1. Prepare and Filter the multilingual NER Data
bash ./pipeline/step7_prepare_pseudo_ner_data1.sh
bash ./pipeline/step7_prepare_pseudo_ner_data2.sh

Citation

@inproceedings{crop,
  title     = {CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual Labeled Sequence Translation},
  author    = {Yang, Jian and Huang, Shaohan and Ma, Shuming and Yin, Yuwei and Dong, Li and Zhang, Dongdong and Guo, Hongcheng and Li, Zhoujun and Wei, Furu},
  booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2022},
  month     = {Dec},
  year      = {2022},
  pages     = {486--496},
  address   = {Abu Dhabi, United Arab Emirates},
  publisher = {Association for Computational Linguistics},
  url       = {https://aclanthology.org/2022.findings-emnlp.34},
  doi       = {10.18653/v1/2022.findings-emnlp.34},
}

License

Please refer to the LICENSE file for more details.

Contact

If there is any question, feel free to create a GitHub issue or contact us by Email.

crop's People

Contributors

yuweiyin avatar

Stargazers

 avatar Vegatable_hd avatar  avatar Bianca Carneiro avatar

Watchers

Kostas Georgiou avatar  avatar

crop's Issues

Issue with downloading trained model

I want to use the project for my own research. For that, I have to download the trained Roberta model from (https://pan.baidu.com/s/1YQjJEIVevEHXk-wpxcA8wg?pwd=jp4b). However, I cannot download that because I need to register and to register I have to have a Chinese phone number, which is impossible because I am not Chinese and/or in China . Can you please upload it to some resources available without registration e.g. a Google Drive?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.