Coder Social home page Coder Social logo

cofenet's Introduction

CofeNet

This is the source code of COLING 2022 paper "CofeNet: Context and Former-Label Enhanced Net for Complicated Quotation Extraction". See our paper for more details.

Abstract: Quotation extraction aims to extract quotations from written text. There are three components in a quotation: source refers to the holder of the quotation, cue is the trigger word(s), and content is the main body. Existing solutions for quotation extraction mainly utilize rulebased approaches and sequence labeling models. While rule-based approaches often lead to low recalls, sequence labeling models cannot well handle quotations with complicated structures. In this paper, we propose the Context and Former-Label Enhanced Net (CofeNet) for quotation extraction. CofeNet is able to extract complicated quotations with components of variable lengths and complicated structures. On two public datasets (i.e., PolNeAR and Riqua) and one proprietary dataset (i.e., PoliticsZH), we show that our CofeNet achieves state-of-the-art performance on complicated quotation extraction.

1. Setup

Environment

# Python version==3.7
git clone https://github.com/cofe-ai/CofeNet.git
cd CofeNet
pip install -r requirements.txt

Datasets

The data set is in the ./res directory. Here we give two datasets polnear and riqua in our paper. You can store other datasets here for the framework to read.

./res
├── polnear
│   ├── tag.txt
│   ├── test.txt
│   ├── train.txt
│   ├── valid.txt
│   └── voc.txt
├── riqua
│   ├── tag.txt
│   ├── test.txt
│   ├── train.txt
│   ├── valid.txt
│   └── voc.txt
├── others
└── ...

If you want to use other datasets, you need to build 5 files for each dataset. The file names do not change:

  • train.txt, test.txt, valid.txt: Structured Dataset.

Each item of data is stored in a line by json. The key "tokens" is the text words sequence, and "labels" is the corresponding sequence label tag.

{"tokens": ["WikiLeaks", "claims", "`", "state", ...], "labels": ["B-source", "B-cue", "B-content", "I-content", ...]}
  • tag.txt: The set of "labels" in the dataset.
  • voc.txt: Tokens vocabulary for non-pretrained model(i.e., LSTM).

Experiment Configure

Configuration files are stored in the conf/setting directory. Here we give the experimental configuration's name(exp_name) in the paper so that you can quickly reproduce the experimental results. You can also configure your experiments here.

Base Model Dateset Base with CRF with Cofe Dateset Base with CRF with Cofe
Embedding polnear pn_emb pn_emb_crf pn_emb_cofe riqua rq_emb rq_emb_crf rq_emb_cofe
CNN polnear pn_cnn pn_cnn_crf pn_cnn_cofe riqua rq_cnn rq_cnn_crf rq_cnn_cofe
GRU polnear pn_gru pn_gru_crf pn_gru_cofe riqua rq_gru rq_gru_crf rq_gru_cofe
LSTM polnear pn_lstm pn_lstm_crf pn_lstm_cofe riqua rq_lstm rq_lstm_crf rq_lstm_cofe
BiLSTM polnear pn_blstm pn_blstm_crf pn_blstm_cofe riqua rq_blstm rq_blstm_crf rq_blstm_cofe
BiLSTM L2 polnear pn_blstm2 pn_blstm2_crf pn_blstm2_cofe riqua rq_blstm2 rq_blstm2_crf rq_blstm2_cofe
BERT polnear pn_bert pn_bert_crf pn_bert_cofe riqua rq_bert rq_bert_crf rq_bert_cofe
BERT-CNN polnear pn_bert_cnn riqua rq_bert_cnn
BERT-LSTM polnear pn_bert_lstm riqua rq_bert_lstm
BERT-BiLSTM polnear pn_bert_blstm pn_bert_blstm_crf riqua rq_bert_blstm rq_bert_blstm_crf

Trained model

Download the trained model. Save in ./conf/models. Reproduce our results by Evaluate.

2. Run

Train

(a) Run the code

# Cofe for polnear
python run_train.py --exp_name pn_emb_cofe --trn_name v1 --eval_per_step 250 --max_epoch 15 --batch_size 32 --gpu 0
python run_train.py --exp_name pn_cnn_cofe --trn_name v1 --eval_per_step 250 --max_epoch 15 --batch_size 32 --gpu 0
python run_train.py --exp_name pn_gru_cofe --trn_name v1 --eval_per_step 250 --max_epoch 15 --batch_size 32 --gpu 0
python run_train.py --exp_name pn_lstm_cofe --trn_name v1 --eval_per_step 250 --max_epoch 15 --batch_size 32 --gpu 0
python run_train.py --exp_name pn_blstm_cofe --trn_name v1 --eval_per_step 250 --max_epoch 15 --batch_size 32 --gpu 0
python run_train.py --exp_name pn_blstm2_cofe --trn_name v1 --eval_per_step 250 --max_epoch 15 --batch_size 32 --gpu 0
python run_train.py --exp_name pn_bert_cofe --trn_name v1 --eval_per_step 500 --max_epoch 6 --batch_size 15 --bert_learning_rate 5e-5 --gpu 0

# Cofe for riqua
python run_train.py --exp_name rq_emb_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 32 --gpu 0
python run_train.py --exp_name rq_cnn_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 32 --gpu 0
python run_train.py --exp_name rq_gru_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 32 --gpu 0
python run_train.py --exp_name rq_lstm_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 32 --gpu 0
python run_train.py --exp_name rq_blstm_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 32 --gpu 0
python run_train.py --exp_name rq_blstm2_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 32 --gpu 0
python run_train.py --exp_name rq_bert_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 15 --bert_learning_rate 5e-5 --gpu 0

(b) Check log

You can find log files in ./log. For an experiment, here you can find these files:

  • Parameter Configuration (i.e., pn_bert_cofe_v1_20221101_040732.json)
  • Training Log (i.e., pn_bert_cofe_v1_20221101_040732.txt)
  • Tensorboard Files (i.e., pn_bert_cofe_v1_20221101_040732/)

In this example, pn_bert_cofe_v1_20221101_040732 is the unique name of an experiment. It contains the experiment name(pn_bert_cofe), training version(v1), and start training time(20221101_040732).

(c) Run Tensorboard

tensorboard --bind_all --port 9900 --logdir ./log

Evaluate

Run the code to print the experimental results of the trained model.

# Cofe for polnear
python run_eval.py --exp_name pn_emb_cofe --gpu 0
python run_eval.py --exp_name pn_cnn_cofe --gpu 0
python run_eval.py --exp_name pn_gru_cofe --gpu 0
python run_eval.py --exp_name pn_lstm_cofe --gpu 0
python run_eval.py --exp_name pn_blstm_cofe --gpu 0
python run_eval.py --exp_name pn_blstm2_cofe --gpu 0
python run_eval.py --exp_name pn_bert_cofe --gpu 0

# Cofe for riqua
python run_eval.py --exp_name rq_emb_cofe --gpu 0
python run_eval.py --exp_name rq_cnn_cofe --gpu 0
python run_eval.py --exp_name rq_gru_cofe --gpu 0
python run_eval.py --exp_name rq_lstm_cofe --gpu 0
python run_eval.py --exp_name rq_blstm_cofe --gpu 0
python run_eval.py --exp_name rq_blstm2_cofe --gpu 0
python run_eval.py --exp_name rq_bert_cofe --gpu 0

3. Experiment

CofeNet Detail Experiment here

4. Cite

If the code help you, please cite the following paper.

@inproceedings{wang-etal-2022-cofenet,
    title = "{C}ofe{N}et: Context and Former-Label Enhanced Net for Complicated Quotation Extraction",
    author = "Wang, Yequan  and
      Li, Xiang  and
      Sun, Aixin  and
      Meng, Xuying  and
      Liao, Huaming  and
      Guo, Jiafeng",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.215",
    pages = "2438--2449",
    abstract = "Quotation extraction aims to extract quotations from written text. There are three components in a quotation: \textit{source} refers to the holder of the quotation, \textit{cue} is the trigger word(s), and \textit{content} is the main body. Existing solutions for quotation extraction mainly utilize rule-based approaches and sequence labeling models. While rule-based approaches often lead to low recalls, sequence labeling models cannot well handle quotations with complicated structures. In this paper, we propose the \textbf{Co}ntext and \textbf{F}ormer-Label \textbf{E}nhanced \textbf{Net} () for quotation extraction. is able to extract complicated quotations with components of variable lengths and complicated structures. On two public datasets (and ) and one proprietary dataset (), we show that our achieves state-of-the-art performance on complicated quotation extraction.",
}

cofenet's People

Contributors

keshuichonglx avatar thuwyq avatar

Stargazers

Harish Pentapalli avatar  avatar  avatar

Watchers

 avatar

cofenet's Issues

What strategies are used to segment documents into sentences

PolNeAR is a corpus of news articles in which attributions have been annotated. In this program, it is translated into sentences. And i would like to ask you what kind of strategies are used in the process of segmentation. Because i have encountered a similar problem, i am looking forward to your help.

Inference code - no attribute named dataset kind

Hi - after downloading the pn_bert_cofe model files and saving it in the conf>models folder, I tried running the following command:
python run_eval.py --exp_name pn_bert_cofe --gpu 0
I run into the following issue:
device: CPU
Data 'TST' loading ...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1814/1814 [00:05<00:00, 355.12it/s]
Load -1 -> '/workspaces/CofeNet/conf/models/pn_bert_cofe/param/model_6000.bin' ..
Traceback (most recent call last):
File "/workspaces/CofeNet/run_eval.py", line 29, in
ret = Executor.format_result(mod.eval_dataset_tst(), markdown_table=True)
File "/workspaces/CofeNet/exe/executor.py", line 238, in eval_dataset_tst
return self.eval_dataset(self.data_tst, batch_size, beam_width)
File "/workspaces/CofeNet/exe/executor.py", line 104, in eval_dataset
preds, labels = self.get_preds_trues(dataset, batch_size, beam_width)
File "/workspaces/CofeNet/exe/executor.py", line 60, in get_preds_trues
for batch_data, _, lbstrss in dataloder:
File "/workspaces/CofeNet/data/loader.py", line 71, in iter
return _MYSingleProcessDataLoaderIter(self)
File "/workspaces/CofeNet/data/loader.py", line 48, in init
super(_MYSingleProcessDataLoaderIter, self).init(loader)
File "/workspaces/CofeNet/data/loader.py", line 20, in init
self.dataset_kind = loader.dataset_kind
AttributeError: 'SingleDataLoader' object has no attribute 'dataset_kind'. Did you mean: '_dataset_kind'?

I changed this - self.dataset_kind = "" - and it seems to be running now. Am I missing something here?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.