tigerchen52 / love Goto Github PK

View Code? Open in Web Editor NEW

39.0 39.0 7.0 49.88 MB

ACL22 paper: Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost

License: MIT License

Python 90.56% Perl 9.44%

language-model out-of-vocabulary robustness

love's People

Contributors

Stargazers

Watchers

Forkers

phongnt570 ifigueroap pestrstr andreacarotti99 ameykasbe egsotic manu87ds

love's Issues

Corrupted Extrinsic dataset

Hi,

I was wondering When generating typos for Extrinsic tasks (tables 3, 4) in the paper,
Did you corrupt both the train and test datasets?

Thank you!

bert_text_classification

Hello, I am trying to use BERT and LOVE for text classification recently. In your latest released code, I have some questions:

Is the embedding vector of each word in the file love. emb generated by the pre-trained model love_bert_base_uncased?
I found that the number of lines in love. emb is 57,459, while the number of lines in vocab.txt is 64,083. Why are they not consistent? I thought that love is used to generate embedding vectors for each word in vocab.
I would like to know how to obtain the bert_emb in the function get_emb_for_text(text, bert_emb=None, embeddings=None, max_len=50)?
Thank you !

reproducing paper results with different typo probabilities

Hi!
I'm trying to reproduce results from your ACL22 paper for MLRC2022.
In your paper, you show that LOVE+FastText and LOVE+BERT achieve more robustness in OOV, with respect to their respective baselines.
However, I found no instructions on how to produce typos/a corrupted dataset, except for details provided in Appendix B.3, where you talk about simulating post-OCR errors.

Would you be able to provide more insights on simulated typos?
I also found these paths in extrinsic/rnn_ner/gen_vocab.py:

    train_path = 'input/train.txt' 
    dev_path = 'typo_data/typo_dev.txt'
    test_path = 'typo_data/typo_test.txt'
    out_path = 'typo_data/typo_vocab.txt'

which make me believe that some folders or script for simulating typos are missing.
Thank you!

Can I update the vocabulary and embedding matrix with a model other than "bert base uncase"?

Issue with Data Augmentation in LOVE Reproduction

Hello! I am currently trying to reproduce the LOVE model, but I have encountered an issue with data augmentation.

Specifically, the paper mentions that one of the strategies for data augmentation is to replace the original word with a synonymous word. However, I noticed that the 'data/synonym.txt' file does not contain the full set of 2M vocabulary as expected.

Could you please provide the complete 'data/synonym.txt' file or, alternatively, share the code that can be used to generate this file? Thank you for your assistance!

How can create Hard Negative Samples?

Hi,

I was wondering LOVE makes a hard negative sample that top-100 similar words are extracted from each target word,
So do you make hard negative samples for every target words in vector file?

And How can create the similar words by edit distance for each target words?
or could you please provide the hard negative sample file used in LOVE framework?

Thank you!

[Table 4 SST2 Task] BERT+LOVE Reproduction Issue

Hi,
I've been trying to reproduce the performance in the paper for the SST2 task using the 'BERT+LOVE' embedding you provided.
I tried changing various hyper-parameters in the model and modifying the code.
However, I failed to reproduce the performance of the paper.

My reproduction performance is below.

Could you provide the code that performed the SST2 Task?

Thank you!

Train BERT Embedding

Hello! I am currently trying to reproduce the LOVE model,

The sentence "For ease of implementation, we learn only from the words that are not separated into pieces." in your paper.

As I understand, In the vocab.txt file you provided, you did not use special tokens (e.g. "[PAD]", "[UNK]") and separated words (e.g. ##ir, ##di).

Is my understanding correct?

Thank you!

Question on FastText baseline in the experiment

Hello! I have a question about the detail of FastText baseline in the Table 2 ~ 4.
For this baseline, as handling OOV words, we have two choices:

Representing OOV words using null vectors
Computing vectors for OOV words by summing the n-gram vectors.

In the context of Bojanowski et al. (2017)[1], the first option corresponds to the "sisg-" setting, while the second aligns with the "sisg" setting.
Could you please specify which option was utilized in your experiments?
My conjecture leans towards the option 1) because the option 2) doesn't seem to follow a mimick-like model.
Nonetheless, I would greatly appreciate your guidance on this matter.
Thank you in advance for your help!

[1] Bojanowski et al., Enriching Word Vectors with Subword Information, TACL 2017.

Scores

Hi,

I was wondering how you got the intrinsic scores for KVQ-FH in Table 2. The scores reported in their paper (table 4) are much higher than the scores you report, and higher than the scores reported for LOVE.

Kind regards,
Stéphan

Extrinsic task code

Hi,

I was wondering where I could find the testing/training code for the extrinsic tasks, i.e., SST-2 and CoNLL-03.
Also, are the models included in the repository the models for which you report the scores in the paper?

Thanks!

tigerchen52 / love Goto Github PK

love's People

Contributors

Stargazers

Watchers

Forkers

love's Issues

Corrupted Extrinsic dataset

bert_text_classification

reproducing paper results with different typo probabilities

Can I update the vocabulary and embedding matrix with a model other than "bert base uncase"?

Issue with Data Augmentation in LOVE Reproduction

How can create Hard Negative Samples?

[Table 4 SST2 Task] BERT+LOVE Reproduction Issue

Train BERT Embedding

Question on FastText baseline in the experiment

Scores

Extrinsic task code

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent