jasonwei20 / eda_nlp Goto Github PK

View Code? Open in Web Editor NEW

1.6K 36.0 314.0 20.42 MB

Data augmentation for NLP, presented at EMNLP 2019

Home Page: https://arxiv.org/abs/1901.11196

Python 100.00%

nlp data-augmentation text-classification synonyms embeddings sentence classification rnn cnn swap

eda_nlp's People

Contributors

Stargazers

Watchers

Forkers

ssameerr aikho wurentidai munaachyuta scorpionhiccup sshleifer carolina-caprile-bose zeroesones jingmouren luxiaolingfei useric oswen xueyuuu scottishfold007 azuredsky johnnycn psyxusheng bbackknight sofazhg 2585575866 fancyerii wxc1884 fengdf yuconan ghiblifield naz947 wangshengneu zhenjason pankajmehar liujinyanliu yashvijay duonghuuthanh smithleroy askintution qolina chirayukong jeinlee1991 henryalps lefugang yorko yylgsch tobeatraceur strategist922 xuanxuangao hecongqing zshy1205 bikong2 rhtrht mbabby hitalex lzhan011 connietong anirband jx1100370217 zymale meta-chen gaojie-shumei qianrenjian finanity xietianwen92 frostjsy lbda1 aid91 chenny0808 hasanimran arita37 woooodbond chengli0327 jacknhat vunb geor7 acmilannesta tschunknail ccsonia ibotamon qshuang123 danxiaodong zhlzhl databill86 littleflow3r wangbo009 shrikanthsingh angelo337 sasshoumaru 1601120453 rakaiv5 dapeng2018 siddrtm junglezax ssitb kc2fresh mhtoin milkwhite byew backlu maxheuillet hmxv2 rogervaas dx2048 asukaj

eda_nlp's Issues

About the test dataset

I am curious about your data process. I mean do you split the dataset to train and test datasets, then augment the train_dataset or augment all datasets firstly then split the dataset?
Because the only difference between these two processes is whether the test datasets include the augment data.

Possibility of masking some tokens?

I am working on a problem with heavily imbalanced datasets. I want to use this tool to augment the positive class in my dataset. The problem is some of the tokens are critical to the problem I am trying to solve and I would like to mask those tokens for this tool. Is that currently possible?

performance gain of each eda method

Can I know the each performance gain per each eda method??

And in Synonym Replacement (SR), how can I get a similar word?

Augmenting non-english datasets

Hello,

this idea of implementing such data augmentations in such an easy script is superb. I would like to use it in some of my applications. My main concern is that my dataset is on Portuguese-BR, and I would like to know what should I do for adapting your code.

There are already some embeddings trained in Portuguese (link), by having them, is it easy to adjust your code?

Again, congrats for the great work.

ValueError: empty range for randrange() (0,0, 0)

I have processed the data according to the data format you said，Here are my running scripts and errors

python code/augment.py --input=train_50w.en --output=train_50w._augmented.txt --num_aug=1 --alpha_sr=0.05 --alpha_rd=0.05 --alpha_ri=0 --alpha_rs=0.05

Traceback (most recent call last):
File "code/augment.py", line 75, in
gen_eda(args.input, output, alpha_sr=alpha_sr, alpha_ri=alpha_ri, alpha_rs=alpha_rs, alpha_rd=alpha_rd, num_aug=num_aug)
File "code/augment.py", line 64, in gen_eda
aug_sentences = eda(sentence, alpha_sr=alpha_sr, alpha_ri=alpha_ri, alpha_rs=alpha_rs, p_rd=alpha_rd, num_aug=num_aug)
File "/home/tool/eda_nlp-master/code/eda.py", line 201, in eda
a_words = random_swap(words, n_rs)
File "/home/tool/eda_nlp-master/code/eda.py", line 130, in random_swap
new_words = swap_word(new_words)
File "/home/tool/eda_nlp-master/code/eda.py", line 134, in swap_word
random_idx_1 = random.randint(0, len(new_words)-1)
File "/home/miniconda3/envs/eda/lib/python3.6/random.py", line 221, in randint
return self.randrange(a, b+1)
File "/home/miniconda3/envs/eda/lib/python3.6/random.py", line 199, in randrange
raise ValueError("empty range for randrange() (%d,%d, %d)" % (istart, istop, width))
ValueError: empty range for randrange() (0,0, 0)

Chinese text classification

Is this method suitable for Chinese text classification?

Question/Suggestions

Hi! I would like to say thank you for your contribution to the community. I've been going over your code, and have a few questions and potentially suggestions:

For the method "synonym_replacement()", why do you substitute all instance of a word for the selected synonym? For example, if you have the sentence "A B C B", and the select B=X as the synonym substitution, you will always end up with "A X C X". There's no possibility to generate "A B C X" or "A X C B" - was that intentional? If not, that could be a potential improvement.

1.1) Another suggestion for "synonym_replacement()": Iterate through random indices of the list rather than the elements themselves. This will allow you to overwrite the words with their synonyms directly in the "new_words" list rather than creating a new list each time a substitution is made. Creating a new list can be expensive, especially if it's large or you're running it thousands of times.

For the method "add_words()", you repeatedly select random indices as opposed to creating a random ordering of the indices - is there a reason for that? With the current code, there's the possibility that it selects the same index every time, and never adds a word even if the list contains a word with synonyms. Alternatively, you could shuffle a list of indices, and iterate through the list until you find one with a synonym. This would ensure you're not redundantly getting synonyms for the same word over and over, and ensure you always insert a word when the input list contains a word with synonyms.
For the method "random_swap()", a check to ensure the word last has at least 2 words would prevent unnecessary computation in single-word cases (small thing, but something I added in my code)
For the method "swap_words()", I basically have the same question as (2). In addition, I'm pretty sure you can greatly simplify the method to:

def swap_word(new_words):
    index1, index2 = random.sample(range(len(new_words)), 2)
    new_words[index1], new_words[index2] = new_words[index2], new_words[index1]
    return new_words

Please correct me if the above method doesn't do the same thing - it's what I'm using in my code and I'd prefer to not have missed something :)

I'm intrigued by this code snippet:

        # this is stupid but we need it, trust me
        sentence = ' '.join(new_words)
        new_words = sentence.split(' ')

Could you tell me what was happening that made this necessary? I'm sure you encountered some edge case that this fixed, but I'm curious what that was.

For the method "eda()", when the user specifies "num_aug=0", what are you trying to do in the "else" block at the end of the method? The output ends up being anywhere from 0 to num_aug sentences, depending on chance. Statistically, you will get on average 1 augmentation back, but this isn't guaranteed.

This is all meant as constructive criticism, so I hope you don't feel I'm being negative towards your work. I'd love to hear your feedback on these questions and thoughts; the work I'm doing is highly related to this topic, so I'm curious about some of the decisions you made. Thanks!

Run augmentation on my dataset, unsure how to procceed.

What should I do in order to use my own dataset for the experiments? I placed my dataset in "data" folder, I augmented it but I don't know what to do after that. Are there any specific commands that I should use?

Thank you in advance!

Confirmation abt data augmentation

Hi @jasonwei20 thanks for the great work!

I want to confirm my understanding of data augmentation in your paper:
In the experiment (RNN and CNN), do you ONLY use the output (the augmented data, train_aug_st.txt) for training? Or do you mix them (original training data + augmented training data from the original) and used them for training?
In other words, does your training data in the experiment (experiment +EDA) include the original training data?

Thanks.

How to change alpha?

Hi again,

It works like a charm!

Just a quick question, how do you change the alpha at runtime (as in an argument of the command). As I seen from augment.py:

import argparse
ap = argparse.ArgumentParser()
ap.add_argument("--input", required=True, type=str, help="input file of unaugmented data")
ap.add_argument("--output", required=False, type=str, help="output file of unaugmented data")
ap.add_argument("--num_aug", required=False, type=int, help="number of augmented sentences per original sentence")
args = ap.parse_args()

This does not seem to be possible.

Cheers,
M

hi,

Mechanism to choose between EDA tasks

Is there a mechanism to choose between the type of augmentation that I wish to apply to my data. Example - Sometimes, you might just want to apply Synonym Replacement (SR) and Random Deletion (RD), while ignoring the other two techniques (Random Insertion & Random Swap) as it may completely change the label. One dataset I could think of is Corpus of Linguistic Acceptability(CoLA), where RI & RS I believe will change the target label.

In the current implementation the passed argument alpha is applied equally for each of the augmentation. Passing alpha for each of the technique as command line argument individually would allow fine grained control & will help achieve the desired task.

Meaning of 0 and 1

Hi,

can you please explain to me why some sentences in dataset have 0 in front, and some have 1.

Thank you.

Strange occurrences with I'm and It's (apostrophes)

So I augmented my dataset and found some strange things happening:

I'm standing on it
It's in front of me
became
i m standing on it
it s in front of me
I don't know why it was split like that, and in fact I can't even reproduce this now anymore. At first I thought that I had some weird apostrophes, but my tests don't confirm that.

Basically I am creating this issue to ask if anyone had something like this happen or knows a possible cause?

Need Code for paper "Good-Enough Example Extrapolation"

Hi Jason!
Sorry to interrupt you. I can't contact you via email. I have to try this place.

I am very interested in your EMNLP paper "Good-Enough Example Extrapolation", which provides me lots of inspirations.

When reading the paper, I have some questions :

You mentioned that " implement GE3 at this final max-pooled hidden layer, which has size 768. That is, the hidden-space augmentation method only updates classifier weights after the BERT encoder", do you mean the weights of transformer are frozen during training? This is a very important detail when I reimplement your paper.
GE3 needs to average the hidden vectors of all samples in the same class. So how to implement it in a mini-batch training? Or did you implement the GE3 in a two-stage way: First use BERT to get all vectors, and use GE3 for feature augmentation, then use a simple classifier to train on top of these features?
Could you please provide the source code? I am new to this area and I really want to study this method by code.

I would appreciate it if you could answer my questions and provide the source code. In fact, I am also quite interested in data augmentation and have cited your EDA and other works in my paper and working papers. I look forward to communicate with you! Thanks a lot.

experiments section

@jasonwei20 , hi, thank you very much for your project, and I've been waiting for your update. If you can update, thank you very sincerely.

How to use this?

Hi there,

First of all, great paper. I had thought of similar solutions for DA on text, but I'm glad someone put all of them together!

However, I can't seem to run. First of all, the readme mentions python code/1_data_process.py but there is no such file.

By adding a_, b_, etc suffixes, I get the following errors

Using TensorFlow backend.
Traceback (most recent call last):
  File "code/a_1_data_process.py", line 28, in <module>
    gen_sr_aug(train_orig, output_file, alpha, n_aug)
  File "/myworkingdirectory/eda_nlp/code/methods.py", line 173, in gen_sr_aug
    writer = open(output_file, 'w')
FileNotFoundError: [Errno 2] No such file or directory: 'size_data_f1/1_tiny/cr/train_sr_0.05.txt'

and similar to every possible suffix.

Thanks!

What is the role of label here?

From README:

You can easily write your own implementation, but this one takes input files in the format label\tsentence (note the \t). So for instance, your input file should look like this (example from stanford sentiment treebank):

What does label signifies here. What to do in the case of more than 2 classes?

interesting

this is stupid but we need it, trust me

sentence = ' '.join(new_words)
new_words = sentence.split(' ')

Liencese of this repo

Hi, thank you for the great repo. Could you add a license to it, like MIT or Apache 2.0? Thank you very much.

Performance on trec dataset drops significantly after using eda

model: BertForSequenceClassification
train_set size: 120 of 5452, using sklearn.model_selection.StratifiedShuffleSplit to keep classes corresponds to the original distribution of train set:

def split(self, examples, test_size, train_size, n_splits=2, split_idx=0):
    label_map = {"ENTY": 0, "DESC": 1, 'LOC': 2, 'ABBR': 3, 'NUM': 4, 'HUM': 5}
    labels = [label_map[e.label] for e in examples]
    kf = StratifiedShuffleSplit(n_splits=n_splits, test_size=test_size, train_size=train_size, random_state=C.get()['seed'])
    kf = kf.split(list(range(len(examples))), labels)
    for _ in range(split_idx + 1):  # split_idx equal to cv_fold. this loop is used to get i-th fold
        train_idx, valid_idx = next(kf)
    train_dev_set = np.array(examples)
    return list(train_dev_set[train_idx]), list(train_dev_set[valid_idx])

valid_set size: 180
test_set size: all(500)
alpha: 0.1

augment code:

labels = labels.repeat(n_aug)
aug_texts = [eda(text, alpha_sr=alpha, alpha_ri=alpha, alpha_rs=alpha, p_rd=alpha, num_aug=n_aug - 1) for text in texts]
assert len(labels)==len(aug_texts)

I have checked that the labels and augmented texts are matched correctly.

result when n_aug=16 (the average of three experiments) :

before augment: 0.7975
after augment: 0.7657

result when n_aug=8 (the average of three experiments) :

before augment: 0.813
after augment: 0.7044

It's very confusing. In my experiment, it seems that eda performs well when texts are long (such as imdb dataset), but has poor performce in datasets like trec and sst5. Did I make any mistakes in the experiments setting?

empty range for randrange()

Hello! Thank you for sharing your code.

I got this error on one of my datasets, is this a known problem? I've checked the text file and there are no empty (zero-length or whitespace-only) lines.

Traceback (most recent call last):
  File "code/augment.py", line 55, in <module>
    gen_eda(args.input, output, alpha=alpha, num_aug=num_aug)
  File "code/augment.py", line 44, in gen_eda
    aug_sentences = eda(sentence, alpha_sr=alpha, alpha_ri=alpha, alpha_rs=alpha, p_rd=alpha, num_aug=num_aug)
  File "/home/user/Desktop/eda_nlp/code/eda.py", line 193, in eda
    a_words = random_insertion(words, n_ri)
  File "/home/user/Desktop/eda_nlp/code/eda.py", line 153, in random_insertion
    add_word(new_words)
  File "/home/user/Desktop/eda_nlp/code/eda.py", line 160, in add_word
    random_word = new_words[random.randint(0, len(new_words)-1)]
  File "/home/stc/miniconda3/lib/python3.7/random.py", line 222, in randint
    return self.randrange(a, b+1)
  File "/home/stc/miniconda3/lib/python3.7/random.py", line 200, in randrange
    raise ValueError("empty range for randrange() (%d,%d, %d)" % (istart, istop, width))

BERT + EDA ?

Will Random Swap (RS) and Random Deletion (RD) work well for BERT, as BERT is besed on contextual pre-training.

Thank you very much. @jasonwei20

A little suggestion about error exception

Hello! Thank you for all your contributions on the eda. It is pretty cool.
However, I think if a line contains only numbers which is usual in conversations like asking for cell phone numbers, the code could catch the exception. The way to add it is to judge it in eda() after get only chars.
It is only a suggestion, and thank you for your tools anyway.

Supported languages.

Does anyone has list of supported languages for using this module? I couldn't find it in original paper.

Something about the best parameters

Hello Jason Wei, great paper great idea and I read your paper about Easy Data Augmentation.

I'm trying to implement your experiment result, but I don't know how you find the best alpha and num_aug.

In Figure3 and 4 from the paper, you draw diagrams of different alpha and different num_aug, so how did you choose alpha when you test num_aug, and how did you choose num_aug when you test different alpha?

I checked the code and I find you set
"alpha_sr=0.3, alpha_ri=0.2, alpha_rs=0.1, p_rd=0.15"
in Figure 4, and
'size_data_f1/1_tiny': [16, 16, 16, 16, 16],size_data_f1/2_small': [16, 16, 16, 16, 16],'size_data_f1/3_standard': [8, 8, 8, 8, 4],size_data_f1/4_full': [8, 8, 8, 8, 4]
in Figure 3.

Could you please explain how you set these parameters? thanks !!

Tips for Non-English Augmentation

Hi, I hope you're doing great. I've been using your code with English text for a while and now I need to implement it for Persian Language. (hopefully with minimal change!) Your work is truly impressive! Could you kindly provide some advice on customizing your code for Persian and what I need to change? Your insights would be invaluable.

Thanks a lot,

Could you please upload the data set?

I spent lots of time on it but I still couldn't find them except the SST dataset.
Could you please upload the data set or send it to [email protected]?
thanks!

Semi-supervised

In the semi-supervised field, first perform eda on labeled data
If I use 500 pieces of data in the fine-tuning stage, will the use of the eda method improve the results？ (the bert model is officially trained)
thank you very much!

Languages supported

I've seen the documentation refer to input text in English.

Does it scale to other languages too? Or what do you recommend for supporting other languages?

We can't get the 3 improvement rate

Hi,
We tried PC datasets and subj datasets with number of 500, and run the e_2_rnn_baseline. py and aug. py in experiment 'e'. And our augmentation number is 16. However ,the results are not stable, sometimes lower than baseline, and we didn't get the 3 improvement rate. We want to know what parameters you use in your experiments. Thanks a lot !

Removal of apostrophes, hyphens and things.

So in eda.py you remove several things like:
line = line.replace("’", "")
line = line.replace("'", "")
line = line.replace("-", " ")

And I was wondering why is that? Cause while this augmentation method improved my results dramatically I now need to somehow get data back in which let's the bot learn that "I'm" is the same as "I am" etc, as the data now only ever includes "im".
Is this some limitation of WordNet or something?

random_insertion should take stop words into account

eda_nlp/code/eda.py

Line 151 in d75e8bd

new_words = words.copy()

in your documentation you are saying that for the "insert" you remove "stop words".
in the code it does not.

I have not very often an random insert hit due to fact that possible stop words are not found in synonms.
And you to take only the noStopWords into account here

eda_nlp/code/eda.py

Line 160 in d75e8bd

random_word = new_words[random.randint(0, len(new_words)-1)]

Moderation of corpus might be required

While trying out your code, which was infact very helpful in generating a lot of training data for my model, I found one of the generated sentence to be out of place.

Provided sentence:
5 let me start a task

Output:
5 let me start antiophthalmic factor a task
5 let me start a task
5 lashkar e taiba me start a task
5 let me kickoff a task
5 let task start a me
5 let me start a task
5 let me start a task
5 task me start a let
5 let me start a labor
5 antiophthalmic factor let me start a task
5 let me start a task

Text marked in bold is a terrorist organization. You can find the details in the link below.
https://en.wikipedia.org/wiki/Lashkar-e-Taiba

If possible can you please consider removing that name from synonyms of word "let"
random word for let : lashkar e taiba

Parameters Used:
--num_aug=10 --alpha=0.01

can you give the code to plot your figures in your paper?

Random insertion is not excluding words from the stop words list

Hi, thanks for the code repository and the paper!

I think that the idea behind Easy Data Augmentation is helpful and I am planning to port/adapt it such that it is usable for the German language as well.

Based on your paper random insertion is done in the following way:

Random Insertion (RI): Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times.

However, by looking your implementation stop words are not excluded:

def random_insertion(words, n):
	new_words = words.copy()
	for _ in range(n):
		add_word(new_words)
	return new_words`

def add_word(new_words):
	synonyms = []
	counter = 0
	while len(synonyms) < 1:
		random_word = new_words[random.randint(0, len(new_words)-1)]
		synonyms = get_synonyms(random_word)
		counter += 1
		if counter >= 10:
			return
	random_synonym = synonyms[0]
	random_idx = random.randint(0, len(new_words)-1)
	new_words.insert(random_idx, random_synonym)

def eda(sentence, alpha_sr=0.1, alpha_ri=0.1, alpha_rs=0.1, p_rd=0.1, num_aug=9):
	
	sentence = get_only_chars(sentence)
	words = sentence.split(' ')
	words = [word for word in words if word is not '']
	num_words = len(words)
	
	augmented_sentences = []
	num_new_per_technique = int(num_aug/4)+1
	n_sr = max(1, int(alpha_sr*num_words))
	n_ri = max(1, int(alpha_ri*num_words))
	n_rs = max(1, int(alpha_rs*num_words))
         
       .........

	#ri
	for _ in range(num_new_per_technique):
		a_words = random_insertion(words, n_ri)
		augmented_sentences.append(' '.join(a_words))

        .........

Do you know how this affects the final results? Thanks!

Will the model be attacked by the adversarial examples?

In Chinese text corpus, we can generate some adversarial examples by random insertion(RI), random deletion(RD) or synonym replacement(SR). I am wondering whether EDA method will cause the model such text classifier to be attacked by the adversarial examples generated by RI, RD or SR like EDA does.
Can you explain this? Because I did some experiments and they show a decrease in performance.
Thank you very much!

Random Insertion can't insert word into the last position ?

random_idx = random.randint(0, len(new_words)-1)
For I have a dream, it can't produce I have a dream [RANDOM]