Coder Social home page Coder Social logo

tigerchen52 / love Goto Github PK

View Code? Open in Web Editor NEW
39.0 39.0 7.0 49.88 MB

ACL22 paper: Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost

License: MIT License

Python 90.56% Perl 9.44%
language-model out-of-vocabulary robustness

love's People

Contributors

phongnt570 avatar tigerchen52 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

love's Issues

Corrupted Extrinsic dataset

Hi,

I was wondering When generating typos for Extrinsic tasks (tables 3, 4) in the paper,
Did you corrupt both the train and test datasets?

Thank you!

bert_text_classification

Hello, I am trying to use BERT and LOVE for text classification recently. In your latest released code, I have some questions:

  1. Is the embedding vector of each word in the file love. emb generated by the pre-trained model love_bert_base_uncased?
  2. I found that the number of lines in love. emb is 57,459, while the number of lines in vocab.txt is 64,083. Why are they not consistent? I thought that love is used to generate embedding vectors for each word in vocab.
  3. I would like to know how to obtain the bert_emb in the function get_emb_for_text(text, bert_emb=None, embeddings=None, max_len=50)?
    Thank you !

reproducing paper results with different typo probabilities

Hi!
I'm trying to reproduce results from your ACL22 paper for MLRC2022.
In your paper, you show that LOVE+FastText and LOVE+BERT achieve more robustness in OOV, with respect to their respective baselines.
However, I found no instructions on how to produce typos/a corrupted dataset, except for details provided in Appendix B.3, where you talk about simulating post-OCR errors.

Would you be able to provide more insights on simulated typos?
I also found these paths in extrinsic/rnn_ner/gen_vocab.py:

    train_path = 'input/train.txt' 
    dev_path = 'typo_data/typo_dev.txt'
    test_path = 'typo_data/typo_test.txt'
    out_path = 'typo_data/typo_vocab.txt'

which make me believe that some folders or script for simulating typos are missing.
Thank you!

Issue with Data Augmentation in LOVE Reproduction

Hello! I am currently trying to reproduce the LOVE model, but I have encountered an issue with data augmentation.

Specifically, the paper mentions that one of the strategies for data augmentation is to replace the original word with a synonymous word. However, I noticed that the 'data/synonym.txt' file does not contain the full set of 2M vocabulary as expected.

Could you please provide the complete 'data/synonym.txt' file or, alternatively, share the code that can be used to generate this file? Thank you for your assistance!

How can create Hard Negative Samples?

Hi,

I was wondering LOVE makes a hard negative sample that top-100 similar words are extracted from each target word,
So do you make hard negative samples for every target words in vector file?

And How can create the similar words by edit distance for each target words?
or could you please provide the hard negative sample file used in LOVE framework?

Thank you!

[Table 4 SST2 Task] BERT+LOVE Reproduction Issue

Hi,
I've been trying to reproduce the performance in the paper for the SST2 task using the 'BERT+LOVE' embedding you provided.
I tried changing various hyper-parameters in the model and modifying the code.
However, I failed to reproduce the performance of the paper.

My reproduction performance is below.
image

Could you provide the code that performed the SST2 Task?

Thank you!

Train BERT Embedding

Hello! I am currently trying to reproduce the LOVE model,

The sentence "For ease of implementation, we learn only from the words that are not separated into pieces." in your paper.

As I understand, In the vocab.txt file you provided, you did not use special tokens (e.g. "[PAD]", "[UNK]") and separated words (e.g. ##ir, ##di).

Is my understanding correct?

Thank you!

Question on FastText baseline in the experiment

Hello! I have a question about the detail of FastText baseline in the Table 2 ~ 4.
For this baseline, as handling OOV words, we have two choices:

  1. Representing OOV words using null vectors
  2. Computing vectors for OOV words by summing the n-gram vectors.

In the context of Bojanowski et al. (2017)[1], the first option corresponds to the "sisg-" setting, while the second aligns with the "sisg" setting.
Could you please specify which option was utilized in your experiments?
My conjecture leans towards the option 1) because the option 2) doesn't seem to follow a mimick-like model.
Nonetheless, I would greatly appreciate your guidance on this matter.
Thank you in advance for your help!

[1] Bojanowski et al., Enriching Word Vectors with Subword Information, TACL 2017.

Scores

Hi,

I was wondering how you got the intrinsic scores for KVQ-FH in Table 2. The scores reported in their paper (table 4) are much higher than the scores you report, and higher than the scores reported for LOVE.

Kind regards,
Stéphan

Extrinsic task code

Hi,

I was wondering where I could find the testing/training code for the extrinsic tasks, i.e., SST-2 and CoNLL-03.
Also, are the models included in the repository the models for which you report the scores in the paper?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.