CofeNet

This is the source code of COLING 2022 paper "CofeNet: Context and Former-Label Enhanced Net for Complicated Quotation Extraction". See our paper for more details.

Abstract: Quotation extraction aims to extract quotations from written text. There are three components in a quotation: source refers to the holder of the quotation, cue is the trigger word(s), and content is the main body. Existing solutions for quotation extraction mainly utilize rulebased approaches and sequence labeling models. While rule-based approaches often lead to low recalls, sequence labeling models cannot well handle quotations with complicated structures. In this paper, we propose the Context and Former-Label Enhanced Net (CofeNet) for quotation extraction. CofeNet is able to extract complicated quotations with components of variable lengths and complicated structures. On two public datasets (i.e., PolNeAR and Riqua) and one proprietary dataset (i.e., PoliticsZH), we show that our CofeNet achieves state-of-the-art performance on complicated quotation extraction.

1. Setup

Environment

# Python version==3.7
git clone https://github.com/cofe-ai/CofeNet.git
cd CofeNet
pip install -r requirements.txt

Datasets

The data set is in the ./res directory. Here we give two datasets polnear and riqua in our paper. You can store other datasets here for the framework to read.

./res
├── polnear
│   ├── tag.txt
│   ├── test.txt
│   ├── train.txt
│   ├── valid.txt
│   └── voc.txt
├── riqua
│   ├── tag.txt
│   ├── test.txt
│   ├── train.txt
│   ├── valid.txt
│   └── voc.txt
├── others
└── ...

If you want to use other datasets, you need to build 5 files for each dataset. The file names do not change:

train.txt, test.txt, valid.txt: Structured Dataset.

Each item of data is stored in a line by json. The key "tokens" is the text words sequence, and "labels" is the corresponding sequence label tag.

{"tokens": ["WikiLeaks", "claims", "`", "state", ...], "labels": ["B-source", "B-cue", "B-content", "I-content", ...]}

tag.txt: The set of "labels" in the dataset.
voc.txt: Tokens vocabulary for non-pretrained model(i.e., LSTM).

Experiment Configure

Configuration files are stored in the conf/setting directory. Here we give the experimental configuration's name(exp_name) in the paper so that you can quickly reproduce the experimental results. You can also configure your experiments here.

Base Model	Dateset	Base	with CRF	with Cofe	Dateset	Base	with CRF	with Cofe
Embedding	polnear	`pn_emb`	`pn_emb_crf`	`pn_emb_cofe`	riqua	`rq_emb`	`rq_emb_crf`	`rq_emb_cofe`
CNN	polnear	`pn_cnn`	`pn_cnn_crf`	`pn_cnn_cofe`	riqua	`rq_cnn`	`rq_cnn_crf`	`rq_cnn_cofe`
GRU	polnear	`pn_gru`	`pn_gru_crf`	`pn_gru_cofe`	riqua	`rq_gru`	`rq_gru_crf`	`rq_gru_cofe`
LSTM	polnear	`pn_lstm`	`pn_lstm_crf`	`pn_lstm_cofe`	riqua	`rq_lstm`	`rq_lstm_crf`	`rq_lstm_cofe`
BiLSTM	polnear	`pn_blstm`	`pn_blstm_crf`	`pn_blstm_cofe`	riqua	`rq_blstm`	`rq_blstm_crf`	`rq_blstm_cofe`
BiLSTM L2	polnear	`pn_blstm2`	`pn_blstm2_crf`	`pn_blstm2_cofe`	riqua	`rq_blstm2`	`rq_blstm2_crf`	`rq_blstm2_cofe`
BERT	polnear	`pn_bert`	`pn_bert_crf`	`pn_bert_cofe`	riqua	`rq_bert`	`rq_bert_crf`	`rq_bert_cofe`
BERT-CNN	polnear	`pn_bert_cnn`			riqua	`rq_bert_cnn`
BERT-LSTM	polnear	`pn_bert_lstm`			riqua	`rq_bert_lstm`
BERT-BiLSTM	polnear	`pn_bert_blstm`	`pn_bert_blstm_crf`		riqua	`rq_bert_blstm`	`rq_bert_blstm_crf`

Trained model

Download the trained model. Save in ./conf/models. Reproduce our results by Evaluate.

2. Run

Train

(a) Run the code

# Cofe for polnear
python run_train.py --exp_name pn_emb_cofe --trn_name v1 --eval_per_step 250 --max_epoch 15 --batch_size 32 --gpu 0
python run_train.py --exp_name pn_cnn_cofe --trn_name v1 --eval_per_step 250 --max_epoch 15 --batch_size 32 --gpu 0
python run_train.py --exp_name pn_gru_cofe --trn_name v1 --eval_per_step 250 --max_epoch 15 --batch_size 32 --gpu 0
python run_train.py --exp_name pn_lstm_cofe --trn_name v1 --eval_per_step 250 --max_epoch 15 --batch_size 32 --gpu 0
python run_train.py --exp_name pn_blstm_cofe --trn_name v1 --eval_per_step 250 --max_epoch 15 --batch_size 32 --gpu 0
python run_train.py --exp_name pn_blstm2_cofe --trn_name v1 --eval_per_step 250 --max_epoch 15 --batch_size 32 --gpu 0
python run_train.py --exp_name pn_bert_cofe --trn_name v1 --eval_per_step 500 --max_epoch 6 --batch_size 15 --bert_learning_rate 5e-5 --gpu 0

# Cofe for riqua
python run_train.py --exp_name rq_emb_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 32 --gpu 0
python run_train.py --exp_name rq_cnn_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 32 --gpu 0
python run_train.py --exp_name rq_gru_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 32 --gpu 0
python run_train.py --exp_name rq_lstm_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 32 --gpu 0
python run_train.py --exp_name rq_blstm_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 32 --gpu 0
python run_train.py --exp_name rq_blstm2_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 32 --gpu 0
python run_train.py --exp_name rq_bert_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 15 --bert_learning_rate 5e-5 --gpu 0

(b) Check log

You can find log files in ./log. For an experiment, here you can find these files:

Parameter Configuration (i.e., pn_bert_cofe_v1_20221101_040732.json)
Training Log (i.e., pn_bert_cofe_v1_20221101_040732.txt)
Tensorboard Files (i.e., pn_bert_cofe_v1_20221101_040732/)

In this example, pn_bert_cofe_v1_20221101_040732 is the unique name of an experiment. It contains the experiment name(pn_bert_cofe), training version(v1), and start training time(20221101_040732).

tensorboard --bind_all --port 9900 --logdir ./log

Evaluate

Run the code to print the experimental results of the trained model.

# Cofe for polnear
python run_eval.py --exp_name pn_emb_cofe --gpu 0
python run_eval.py --exp_name pn_cnn_cofe --gpu 0
python run_eval.py --exp_name pn_gru_cofe --gpu 0
python run_eval.py --exp_name pn_lstm_cofe --gpu 0
python run_eval.py --exp_name pn_blstm_cofe --gpu 0
python run_eval.py --exp_name pn_blstm2_cofe --gpu 0
python run_eval.py --exp_name pn_bert_cofe --gpu 0

# Cofe for riqua
python run_eval.py --exp_name rq_emb_cofe --gpu 0
python run_eval.py --exp_name rq_cnn_cofe --gpu 0
python run_eval.py --exp_name rq_gru_cofe --gpu 0
python run_eval.py --exp_name rq_lstm_cofe --gpu 0
python run_eval.py --exp_name rq_blstm_cofe --gpu 0
python run_eval.py --exp_name rq_blstm2_cofe --gpu 0
python run_eval.py --exp_name rq_bert_cofe --gpu 0

3. Experiment

CofeNet Detail Experiment here

4. Cite

If the code help you, please cite the following paper.

@inproceedings{wang-etal-2022-cofenet,
    title = "{C}ofe{N}et: Context and Former-Label Enhanced Net for Complicated Quotation Extraction",
    author = "Wang, Yequan  and
      Li, Xiang  and
      Sun, Aixin  and
      Meng, Xuying  and
      Liao, Huaming  and
      Guo, Jiafeng",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.215",
    pages = "2438--2449",
    abstract = "Quotation extraction aims to extract quotations from written text. There are three components in a quotation: \textit{source} refers to the holder of the quotation, \textit{cue} is the trigger word(s), and \textit{content} is the main body. Existing solutions for quotation extraction mainly utilize rule-based approaches and sequence labeling models. While rule-based approaches often lead to low recalls, sequence labeling models cannot well handle quotations with complicated structures. In this paper, we propose the \textbf{Co}ntext and \textbf{F}ormer-Label \textbf{E}nhanced \textbf{Net} () for quotation extraction. is able to extract complicated quotations with components of variable lengths and complicated structures. On two public datasets (and ) and one proprietary dataset (), we show that our achieves state-of-the-art performance on complicated quotation extraction.",
}

Inference code - no attribute named dataset kind

Hi - after downloading the pn_bert_cofe model files and saving it in the conf>models folder, I tried running the following command:
python run_eval.py --exp_name pn_bert_cofe --gpu 0
I run into the following issue:
device: CPU
Data 'TST' loading ...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1814/1814 [00:05<00:00, 355.12it/s]
Load -1 -> '/workspaces/CofeNet/conf/models/pn_bert_cofe/param/model_6000.bin' ..
Traceback (most recent call last):
File "/workspaces/CofeNet/run_eval.py", line 29, in
ret = Executor.format_result(mod.eval_dataset_tst(), markdown_table=True)
File "/workspaces/CofeNet/exe/executor.py", line 238, in eval_dataset_tst
return self.eval_dataset(self.data_tst, batch_size, beam_width)
File "/workspaces/CofeNet/exe/executor.py", line 104, in eval_dataset
preds, labels = self.get_preds_trues(dataset, batch_size, beam_width)
File "/workspaces/CofeNet/exe/executor.py", line 60, in get_preds_trues
for batch_data, _, lbstrss in dataloder:
File "/workspaces/CofeNet/data/loader.py", line 71, in iter
return _MYSingleProcessDataLoaderIter(self)
File "/workspaces/CofeNet/data/loader.py", line 48, in init
super(_MYSingleProcessDataLoaderIter, self).init(loader)
File "/workspaces/CofeNet/data/loader.py", line 20, in init
self.dataset_kind = loader.dataset_kind
AttributeError: 'SingleDataLoader' object has no attribute 'dataset_kind'. Did you mean: '_dataset_kind'?

I changed this - self.dataset_kind = "" - and it seems to be running now. Am I missing something here?

cofe-ai / cofenet Goto Github PK

cofenet's Introduction

CofeNet

1. Setup

Environment

Datasets

Experiment Configure

Trained model

2. Run

Train

Evaluate

3. Experiment

4. Cite

cofenet's People

Contributors

Stargazers

Watchers

Forkers

cofenet's Issues

Recommend Projects

Recommend Topics

Recommend Org