Because of the small size the training set of Conll-2003, some authors incorporated th

Conll-2003 uncomparable results about nlp-progress HOT 28 CLOSED

sebastianruder commented on April 28, 2024

Conll-2003 uncomparable results

from nlp-progress.

Comments (28)

matt-peters commented on April 28, 2024 4

In "Deep contextualized word representations" Peters et al 2018 (ELMo) and in the allennlp reimplementation the CoNLL NER model is only trained on eng.train, using eng.testa as a validation dataset for early stopping. In the earlier TagLM (Peters et al 2017, "Semi-supervised sequence tagging with bidirectional language models") the final model was trained on both eng.train + eng.testa. I tried to clarify this in Table 12 in the paper: https://arxiv.org/pdf/1802.05365.pdf

from nlp-progress.

alanakbik commented on April 28, 2024 2

Hi all, to clarify from my side:

Whenever possible Flair separately loads train, dev and test set for all datasets for which all three splits are defined. This is not always possible since some datasets (for instance CoNLL-2000 NP chunking) only define a train and a test set. In such cases, a dev dataset is sampled from the train set so that we again have three separate splits.

Then, for training a model, hyperparameters are selected using dev data, i.e. by training on train and evaluating on dev (see the included hyperopt model for this). Once we have hyperparameters, Flair supports several methods for the final training run, of which the 2 most commonly used are:

Train on train data. During training, measure generalization error using dev data and do learning rate annealing and early stopping using the dev data. After all epochs are completed, select the best model according to which model worked best on the dev data. Finally, evaluate this best model on test.
Train on train and dev data. In this case, no generalization error can be computed. Instead, anneal against training loss. Also, since there is no separate dev data, no best model can be selected. Instead we use the last state of the model after learning rate has annealed to a point where it no longer learns. Finally, evaluate this last model on test.

In the paper, we report numbers using option 2 for all tasks. We do think that both methods are valid since they only differ in how you use the dev data. Also many tasks do not explicitly define a dev dataset and let you sample your own so you can trade-off yourself how important it is to have more training data vs being able to confidently select the best model from all epochs.

Hope this clarifies!

from nlp-progress.

sebastianruder commented on April 28, 2024

Good point. Do you want to create a PR for this? What about the recent BERT models? Do they also train on train+dev?

from nlp-progress.

ghaddarAbs commented on April 28, 2024

Do you want to create a PR for this?

Not this time :)

What about the recent BERT models? Do they also train on train+dev?

other models in the table use train only on train

from nlp-progress.

pvcastro commented on April 28, 2024

Are you saying that these 4 papers are using eng.train + eng.testa for training? Not using eng.testa for validation.

from nlp-progress.

ghaddarAbs commented on April 28, 2024

They use testa for hyperparameters tuning than they train the final model on eng.train+dev ....

I worked on conll2003 for a time, and in my experience, they do this for 2 reasons:

The dev portion contains examples that appear (so similar) in the test.
Performance on dev is inversely proportional to that on the test. In other words, the best performance on testa will give you bad performance on testb and vice versa. Not because your model is bad but .......... i don't know this dataset is weird.

So if you going to publish to publish code to replicate your results, you are more comfortable if you mix train and dev then split another "unbias"dev where performance on this dev is proportional with the test.

from nlp-progress.

pvcastro commented on April 28, 2024

Do you mind indicating where you saw this? I'm asking because I directly used allennlp NER training with ELMo and flair from zalando, and in both scenarios they explicitly define testa for validation train for training and testb for testing, not mixing any of these at any time during training. And the results were compatible with what they reported on their papers.

from nlp-progress.

ghaddarAbs commented on April 28, 2024

http://alanakbik.github.io/papers/coling2018.pdf

Following Peters et al. (2017), then repeat the experiment for the model chosen 5 times with different random seeds, and train using both train and development set, reporting both average performance and standard deviation over these on the test set as final performance

from nlp-progress.

pvcastro commented on April 28, 2024

But are you sure this should be interpreted as using train + testa for training? And not that they just use train for training and testa for validation, during each epoch. This isn't consistent with their code 🤔

from nlp-progress.

ghaddarAbs commented on April 28, 2024

Yeh, I am sure that they mix train and testa after hyperparameter tuning .... However, a number of papers have done this .... For this dataset, their is 2 settings: train on train only and train on train+testa.

For example in https://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00104

As the dataset is small compared to Ontonotes, we trained the model on both the training and development sets after performing hyperparameter optimization on the development set.

from nlp-progress.

ghaddarAbs commented on April 28, 2024

This isn't consistent with their code 🤔

Because (as I suppose) this is particular to this dataset and should not apply to other datasets

from nlp-progress.

pvcastro commented on April 28, 2024

Because (as a i suppose) this is particular to this dataset and should not apply to other dataset

I mean that the implementation of flair and allennlp clearly separates each dataset and it's role, never mixing them during training. Have you ever seen their code?

from nlp-progress.

ghaddarAbs commented on April 28, 2024

No i didn't see the code but I read their papers .... and i worked on this dataset. As I said before mixing train and dev is particular and commonly used on this datatset.

You can communicate with the authors to check

from nlp-progress.

ghaddarAbs commented on April 28, 2024

You are right !!! I though that Peters et al 2018 follows Peters et al 2017

I removed Peters et al 2018 from the first comment

from nlp-progress.

sebastianruder commented on April 28, 2024

Thanks for the clarification, @matt-peters.

from nlp-progress.

pvcastro commented on April 28, 2024

Hi @alanakbik , can you confirm the information from @ghaddarAbs ? From the code in flair, I'm under the impression that you also do not mix train and testa for training with CoNLL2003, but perhaps I missed something.

from nlp-progress.

pvcastro commented on April 28, 2024

No i didn't see the code but I read their papers .... and i worked on this dataset. As I said before mixing train and dev is particular and commonly used on this datatset.

You can communicate with the authors to check

Sorry @ghaddarAbs , I'm just being thorough because I'm writing a survey on NER 😄

from nlp-progress.

sebastianruder commented on April 28, 2024

Hi Alan, thanks for pitching in and for clarifying. :) I generally think that training on dev data is not a big issue (though makes easily comparing against other results harder). One thing that I think is problematic is sampling the dev dataset from the test set. As far as I'm aware, either taking the dev set from the training data or using cross-validation are the common practices in this case.

from nlp-progress.

alanakbik commented on April 28, 2024

Ah oops, yes that should read "dev dataset is sampled from the train set" of course - I typed too hastily. I'll edit the comment above to correct. The test set is never touched or sampled in any way during training / hyperparameter selection.

from nlp-progress.

pvcastro commented on April 28, 2024

I generally think that training on dev data is not a big issue (though makes easily comparing against other results harder).

Hi @sebastianruder , I didn't follow this part. Did you mean that it's easier or harder to make the comparison?
Not sure if this is what you meant, but I agree with @ghaddarAbs that the comparisons in these different scenarios should be kept a part, right?

from nlp-progress.

sebastianruder commented on April 28, 2024

Thanks for clarifying, Alan. :)
Pedro, sorry if I was being ambiguous. I meant that it makes it harder to compare results. I don't think we should have different tables, but feel free to add an asterisk to note if the dev set is used in a different way.

from nlp-progress.

pvcastro commented on April 28, 2024

Ok, sure. @ghaddarAbs, so we should mark only those 3 that you are aware of?

Flair embeddings (Akbik et al., 2018)
Peters et al. (2017)
Yang et al. (2017)

Thanks!

from nlp-progress.

ghaddarAbs commented on April 28, 2024

@pvcastro For Flair embeddings (Akbik et al., 2018) and Peters et al. (2017) yes, but I am not sure about Yang et al. (2017) .... the text is ambiguous.

from nlp-progress.

pvcastro commented on April 28, 2024

OK, I'll try contacting the author as well.

from nlp-progress.

ghaddarAbs commented on April 28, 2024

Also considering adding:

Conll

This paper was cited +350 and it uses both train and dev for conll as well.

ontonotes v5:

from nlp-progress.

pvcastro commented on April 28, 2024

@pvcastro For Flair embeddings (Akbik et al., 2018) and Peters et al. (2017) yes, but I am not sure about Yang et al. (2017) .... the text is ambiguous.

OK, @kimiyoung confirmed by e-mail:

Hi,

Yes we also used the dev set for training, just to be comparable to previous results that adopted this setting.

from nlp-progress.

ghaddarAbs commented on April 28, 2024

@sebastianruder I marked paper that uses train and dev by ♦ and added some results in a pull request. Feel free to close the issue

from nlp-progress.

sebastianruder commented on April 28, 2024

Thanks for the thoroughness! :)

from nlp-progress.

Conll-2003 uncomparable results about nlp-progress HOT 28 CLOSED

Comments (28)

Conll

ontonotes v5:

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent