Comments (28)
In "Deep contextualized word representations" Peters et al 2018 (ELMo) and in the allennlp reimplementation the CoNLL NER model is only trained on eng.train
, using eng.testa
as a validation dataset for early stopping. In the earlier TagLM (Peters et al 2017, "Semi-supervised sequence tagging with bidirectional language models") the final model was trained on both eng.train
+ eng.testa
. I tried to clarify this in Table 12 in the paper: https://arxiv.org/pdf/1802.05365.pdf
from nlp-progress.
Hi all, to clarify from my side:
Whenever possible Flair separately loads train, dev and test set for all datasets for which all three splits are defined. This is not always possible since some datasets (for instance CoNLL-2000 NP chunking) only define a train and a test set. In such cases, a dev dataset is sampled from the train set so that we again have three separate splits.
Then, for training a model, hyperparameters are selected using dev data, i.e. by training on train and evaluating on dev (see the included hyperopt model for this). Once we have hyperparameters, Flair supports several methods for the final training run, of which the 2 most commonly used are:
-
Train on train data. During training, measure generalization error using dev data and do learning rate annealing and early stopping using the dev data. After all epochs are completed, select the best model according to which model worked best on the dev data. Finally, evaluate this best model on test.
-
Train on train and dev data. In this case, no generalization error can be computed. Instead, anneal against training loss. Also, since there is no separate dev data, no best model can be selected. Instead we use the last state of the model after learning rate has annealed to a point where it no longer learns. Finally, evaluate this last model on test.
In the paper, we report numbers using option 2 for all tasks. We do think that both methods are valid since they only differ in how you use the dev data. Also many tasks do not explicitly define a dev dataset and let you sample your own so you can trade-off yourself how important it is to have more training data vs being able to confidently select the best model from all epochs.
Hope this clarifies!
from nlp-progress.
Good point. Do you want to create a PR for this? What about the recent BERT models? Do they also train on train+dev?
from nlp-progress.
Do you want to create a PR for this?
Not this time :)
What about the recent BERT models? Do they also train on train+dev?
other models in the table use train only on train
from nlp-progress.
Are you saying that these 4 papers are using eng.train + eng.testa for training? Not using eng.testa for validation.
from nlp-progress.
They use testa for hyperparameters tuning than they train the final model on eng.train+dev ....
I worked on conll2003 for a time, and in my experience, they do this for 2 reasons:
- The dev portion contains examples that appear (so similar) in the test.
- Performance on dev is inversely proportional to that on the test. In other words, the best performance on testa will give you bad performance on testb and vice versa. Not because your model is bad but .......... i don't know this dataset is weird.
So if you going to publish to publish code to replicate your results, you are more comfortable if you mix train and dev then split another "unbias"dev where performance on this dev is proportional with the test.
from nlp-progress.
Do you mind indicating where you saw this? I'm asking because I directly used allennlp NER training with ELMo and flair from zalando, and in both scenarios they explicitly define testa for validation train for training and testb for testing, not mixing any of these at any time during training. And the results were compatible with what they reported on their papers.
from nlp-progress.
http://alanakbik.github.io/papers/coling2018.pdf
Following Peters et al. (2017), then repeat the experiment for the model chosen 5 times with different random seeds, and train using both train and development set, reporting both average performance and standard deviation over these on the test set as final performance
from nlp-progress.
But are you sure this should be interpreted as using train + testa for training? And not that they just use train for training and testa for validation, during each epoch. This isn't consistent with their code 🤔
from nlp-progress.
Yeh, I am sure that they mix train and testa after hyperparameter tuning .... However, a number of papers have done this .... For this dataset, their is 2 settings: train on train only and train on train+testa.
For example in https://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00104
As the dataset is small compared to Ontonotes, we trained the model on both the training and development sets after performing hyperparameter optimization on the development set.
from nlp-progress.
This isn't consistent with their code 🤔
Because (as I suppose) this is particular to this dataset and should not apply to other datasets
from nlp-progress.
Because (as a i suppose) this is particular to this dataset and should not apply to other dataset
I mean that the implementation of flair and allennlp clearly separates each dataset and it's role, never mixing them during training. Have you ever seen their code?
from nlp-progress.
No i didn't see the code but I read their papers .... and i worked on this dataset. As I said before mixing train and dev is particular and commonly used on this datatset.
You can communicate with the authors to check
from nlp-progress.
You are right !!! I though that Peters et al 2018
follows Peters et al 2017
I removed Peters et al 2018
from the first comment
from nlp-progress.
Thanks for the clarification, @matt-peters.
from nlp-progress.
Hi @alanakbik , can you confirm the information from @ghaddarAbs ? From the code in flair, I'm under the impression that you also do not mix train and testa for training with CoNLL2003, but perhaps I missed something.
from nlp-progress.
No i didn't see the code but I read their papers .... and i worked on this dataset. As I said before mixing train and dev is particular and commonly used on this datatset.
You can communicate with the authors to check
Sorry @ghaddarAbs , I'm just being thorough because I'm writing a survey on NER 😄
from nlp-progress.
Hi Alan, thanks for pitching in and for clarifying. :) I generally think that training on dev data is not a big issue (though makes easily comparing against other results harder). One thing that I think is problematic is sampling the dev dataset from the test set. As far as I'm aware, either taking the dev set from the training data or using cross-validation are the common practices in this case.
from nlp-progress.
Ah oops, yes that should read "dev dataset is sampled from the train set" of course - I typed too hastily. I'll edit the comment above to correct. The test set is never touched or sampled in any way during training / hyperparameter selection.
from nlp-progress.
I generally think that training on dev data is not a big issue (though makes easily comparing against other results harder).
Hi @sebastianruder , I didn't follow this part. Did you mean that it's easier or harder to make the comparison?
Not sure if this is what you meant, but I agree with @ghaddarAbs that the comparisons in these different scenarios should be kept a part, right?
from nlp-progress.
Thanks for clarifying, Alan. :)
Pedro, sorry if I was being ambiguous. I meant that it makes it harder to compare results. I don't think we should have different tables, but feel free to add an asterisk to note if the dev set is used in a different way.
from nlp-progress.
Ok, sure. @ghaddarAbs, so we should mark only those 3 that you are aware of?
Flair embeddings (Akbik et al., 2018)
Peters et al. (2017)
Yang et al. (2017)
Thanks!
from nlp-progress.
@pvcastro For Flair embeddings (Akbik et al., 2018)
and Peters et al. (2017)
yes, but I am not sure about Yang et al. (2017)
.... the text is ambiguous.
from nlp-progress.
OK, I'll try contacting the author as well.
from nlp-progress.
Also considering adding:
Conll
Model | F1 | Paper / Source | Code
Chiu and Nichols, 2016 | 91.62 | https://www.aclweb.org/anthology/Q16-1026 |
This paper was cited +350 and it uses both train and dev for conll as well.
ontonotes v5:
Model | F1 | Paper / Source | Code
Chiu and Nichols, 2016 | 86.28| https://www.aclweb.org/anthology/Q16-1026 |
from nlp-progress.
@pvcastro For
Flair embeddings (Akbik et al., 2018)
andPeters et al. (2017)
yes, but I am not sure aboutYang et al. (2017)
.... the text is ambiguous.
OK, @kimiyoung confirmed by e-mail:
Hi,
Yes we also used the dev set for training, just to be comparable to previous results that adopted this setting.
from nlp-progress.
@sebastianruder I marked paper that uses train and dev by ♦ and added some results in a pull request. Feel free to close the issue
from nlp-progress.
Thanks for the thoroughness! :)
from nlp-progress.
Related Issues (20)
- How "SOTA" should results be? HOT 2
- SOTA entity linking is based on validation set not test set
- Add FinNLP Section HOT 3
- Hindi and Indian languages resource HOT 1
- NLP Results on code-mixed text HOT 1
- Maybe we should add readability assessment task, too? HOT 2
- Add Text-to-SQL progress (Dialogue) HOT 1
- Did you release dialogue progress? thanks
- For Grammar Error Correction task, why F0.5 is consider for evaluation and not F1? (Giving twice weight to precision than recall) HOT 1
- Add CFF (citation file format) to the repository HOT 1
- Add Dataset for Twitter
- DynaSent: Dynamic Sentiment Analysis Dataset
- English information extraction has incorrect F1 scores
- Language recognition? HOT 5
- Add sentence boundaries disambiguation section
- A Knowledge Graph resource of NLP-progress HOT 7
- NLP Repository
- Regarding the PreCo dataset
- Dependency parsing using NLP for list of words rather than a given sentence
- Tasks are not the right measure anymore
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nlp-progress.