jasonwei20 / eda_nlp Goto Github PK
View Code? Open in Web Editor NEWData augmentation for NLP, presented at EMNLP 2019
Home Page: https://arxiv.org/abs/1901.11196
Data augmentation for NLP, presented at EMNLP 2019
Home Page: https://arxiv.org/abs/1901.11196
I am curious about your data process. I mean do you split the dataset to train and test datasets, then augment the train_dataset or augment all datasets firstly then split the dataset?
Because the only difference between these two processes is whether the test datasets include the augment data.
I am working on a problem with heavily imbalanced datasets. I want to use this tool to augment the positive class in my dataset. The problem is some of the tokens are critical to the problem I am trying to solve and I would like to mask those tokens for this tool. Is that currently possible?
Can I know the each performance gain per each eda method??
And in Synonym Replacement (SR), how can I get a similar word?
Hello,
this idea of implementing such data augmentations in such an easy script is superb. I would like to use it in some of my applications. My main concern is that my dataset is on Portuguese-BR, and I would like to know what should I do for adapting your code.
There are already some embeddings trained in Portuguese (link), by having them, is it easy to adjust your code?
Again, congrats for the great work.
I have processed the data according to the data format you said,Here are my running scripts and errors
python code/augment.py --input=train_50w.en --output=train_50w._augmented.txt --num_aug=1 --alpha_sr=0.05 --alpha_rd=0.05 --alpha_ri=0 --alpha_rs=0.05
Traceback (most recent call last):
File "code/augment.py", line 75, in
gen_eda(args.input, output, alpha_sr=alpha_sr, alpha_ri=alpha_ri, alpha_rs=alpha_rs, alpha_rd=alpha_rd, num_aug=num_aug)
File "code/augment.py", line 64, in gen_eda
aug_sentences = eda(sentence, alpha_sr=alpha_sr, alpha_ri=alpha_ri, alpha_rs=alpha_rs, p_rd=alpha_rd, num_aug=num_aug)
File "/home/tool/eda_nlp-master/code/eda.py", line 201, in eda
a_words = random_swap(words, n_rs)
File "/home/tool/eda_nlp-master/code/eda.py", line 130, in random_swap
new_words = swap_word(new_words)
File "/home/tool/eda_nlp-master/code/eda.py", line 134, in swap_word
random_idx_1 = random.randint(0, len(new_words)-1)
File "/home/miniconda3/envs/eda/lib/python3.6/random.py", line 221, in randint
return self.randrange(a, b+1)
File "/home/miniconda3/envs/eda/lib/python3.6/random.py", line 199, in randrange
raise ValueError("empty range for randrange() (%d,%d, %d)" % (istart, istop, width))
ValueError: empty range for randrange() (0,0, 0)
Is this method suitable for Chinese text classification?
Hi! I would like to say thank you for your contribution to the community. I've been going over your code, and have a few questions and potentially suggestions:
1.1) Another suggestion for "synonym_replacement()": Iterate through random indices of the list rather than the elements themselves. This will allow you to overwrite the words with their synonyms directly in the "new_words" list rather than creating a new list each time a substitution is made. Creating a new list can be expensive, especially if it's large or you're running it thousands of times.
For the method "add_words()", you repeatedly select random indices as opposed to creating a random ordering of the indices - is there a reason for that? With the current code, there's the possibility that it selects the same index every time, and never adds a word even if the list contains a word with synonyms. Alternatively, you could shuffle a list of indices, and iterate through the list until you find one with a synonym. This would ensure you're not redundantly getting synonyms for the same word over and over, and ensure you always insert a word when the input list contains a word with synonyms.
For the method "random_swap()", a check to ensure the word last has at least 2 words would prevent unnecessary computation in single-word cases (small thing, but something I added in my code)
For the method "swap_words()", I basically have the same question as (2). In addition, I'm pretty sure you can greatly simplify the method to:
def swap_word(new_words):
index1, index2 = random.sample(range(len(new_words)), 2)
new_words[index1], new_words[index2] = new_words[index2], new_words[index1]
return new_words
Please correct me if the above method doesn't do the same thing - it's what I'm using in my code and I'd prefer to not have missed something :)
# this is stupid but we need it, trust me
sentence = ' '.join(new_words)
new_words = sentence.split(' ')
Could you tell me what was happening that made this necessary? I'm sure you encountered some edge case that this fixed, but I'm curious what that was.
This is all meant as constructive criticism, so I hope you don't feel I'm being negative towards your work. I'd love to hear your feedback on these questions and thoughts; the work I'm doing is highly related to this topic, so I'm curious about some of the decisions you made. Thanks!
What should I do in order to use my own dataset for the experiments? I placed my dataset in "data" folder, I augmented it but I don't know what to do after that. Are there any specific commands that I should use?
Thank you in advance!
Hi @jasonwei20 thanks for the great work!
I want to confirm my understanding of data augmentation in your paper:
In the experiment (RNN and CNN), do you ONLY use the output (the augmented data, train_aug_st.txt) for training? Or do you mix them (original training data + augmented training data from the original) and used them for training?
In other words, does your training data in the experiment (experiment +EDA) include the original training data?
Thanks.
Hi again,
It works like a charm!
Just a quick question, how do you change the alpha at runtime (as in an argument of the command). As I seen from augment.py
:
import argparse
ap = argparse.ArgumentParser()
ap.add_argument("--input", required=True, type=str, help="input file of unaugmented data")
ap.add_argument("--output", required=False, type=str, help="output file of unaugmented data")
ap.add_argument("--num_aug", required=False, type=int, help="number of augmented sentences per original sentence")
args = ap.parse_args()
This does not seem to be possible.
Cheers,
M
Is there a mechanism to choose between the type of augmentation that I wish to apply to my data. Example - Sometimes, you might just want to apply Synonym Replacement (SR) and Random Deletion (RD), while ignoring the other two techniques (Random Insertion & Random Swap) as it may completely change the label. One dataset I could think of is Corpus of Linguistic Acceptability(CoLA), where RI & RS I believe will change the target label.
In the current implementation the passed argument alpha
is applied equally for each of the augmentation. Passing alpha
for each of the technique as command line argument individually would allow fine grained control & will help achieve the desired task.
Hi,
can you please explain to me why some sentences in dataset have 0 in front, and some have 1.
Thank you.
So I augmented my dataset and found some strange things happening:
Basically I am creating this issue to ask if anyone had something like this happen or knows a possible cause?
Hi Jason!
Sorry to interrupt you. I can't contact you via email. I have to try this place.
I am very interested in your EMNLP paper "Good-Enough Example Extrapolation", which provides me lots of inspirations.
When reading the paper, I have some questions :
I would appreciate it if you could answer my questions and provide the source code. In fact, I am also quite interested in data augmentation and have cited your EDA and other works in my paper and working papers. I look forward to communicate with you! Thanks a lot.
@jasonwei20 , hi, thank you very much for your project, and I've been waiting for your update. If you can update, thank you very sincerely.
Hi there,
First of all, great paper. I had thought of similar solutions for DA on text, but I'm glad someone put all of them together!
However, I can't seem to run. First of all, the readme mentions python code/1_data_process.py
but there is no such file.
By adding a_
, b_
, etc suffixes, I get the following errors
Using TensorFlow backend.
Traceback (most recent call last):
File "code/a_1_data_process.py", line 28, in <module>
gen_sr_aug(train_orig, output_file, alpha, n_aug)
File "/myworkingdirectory/eda_nlp/code/methods.py", line 173, in gen_sr_aug
writer = open(output_file, 'w')
FileNotFoundError: [Errno 2] No such file or directory: 'size_data_f1/1_tiny/cr/train_sr_0.05.txt'
and similar to every possible suffix.
Thanks!
From README:
You can easily write your own implementation, but this one takes input files in the format label\tsentence (note the \t). So for instance, your input file should look like this (example from stanford sentiment treebank):
What does label signifies here. What to do in the case of more than 2 classes?
sentence = ' '.join(new_words)
new_words = sentence.split(' ')
Hi, thank you for the great repo. Could you add a license to it, like MIT or Apache 2.0? Thank you very much.
model: BertForSequenceClassification
train_set size: 120 of 5452, using sklearn.model_selection.StratifiedShuffleSplit
to keep classes corresponds to the original distribution of train set:
def split(self, examples, test_size, train_size, n_splits=2, split_idx=0):
label_map = {"ENTY": 0, "DESC": 1, 'LOC': 2, 'ABBR': 3, 'NUM': 4, 'HUM': 5}
labels = [label_map[e.label] for e in examples]
kf = StratifiedShuffleSplit(n_splits=n_splits, test_size=test_size, train_size=train_size, random_state=C.get()['seed'])
kf = kf.split(list(range(len(examples))), labels)
for _ in range(split_idx + 1): # split_idx equal to cv_fold. this loop is used to get i-th fold
train_idx, valid_idx = next(kf)
train_dev_set = np.array(examples)
return list(train_dev_set[train_idx]), list(train_dev_set[valid_idx])
valid_set size: 180
test_set size: all(500)
alpha: 0.1
augment code:
labels = labels.repeat(n_aug)
aug_texts = [eda(text, alpha_sr=alpha, alpha_ri=alpha, alpha_rs=alpha, p_rd=alpha, num_aug=n_aug - 1) for text in texts]
assert len(labels)==len(aug_texts)
I have checked that the labels and augmented texts are matched correctly.
result when n_aug=16
(the average of three experiments) :
result when n_aug=8
(the average of three experiments) :
It's very confusing. In my experiment, it seems that eda performs well when texts are long (such as imdb
dataset), but has poor performce in datasets like trec
and sst5
. Did I make any mistakes in the experiments setting?
Hello! Thank you for sharing your code.
I got this error on one of my datasets, is this a known problem? I've checked the text file and there are no empty (zero-length or whitespace-only) lines.
Traceback (most recent call last):
File "code/augment.py", line 55, in <module>
gen_eda(args.input, output, alpha=alpha, num_aug=num_aug)
File "code/augment.py", line 44, in gen_eda
aug_sentences = eda(sentence, alpha_sr=alpha, alpha_ri=alpha, alpha_rs=alpha, p_rd=alpha, num_aug=num_aug)
File "/home/user/Desktop/eda_nlp/code/eda.py", line 193, in eda
a_words = random_insertion(words, n_ri)
File "/home/user/Desktop/eda_nlp/code/eda.py", line 153, in random_insertion
add_word(new_words)
File "/home/user/Desktop/eda_nlp/code/eda.py", line 160, in add_word
random_word = new_words[random.randint(0, len(new_words)-1)]
File "/home/stc/miniconda3/lib/python3.7/random.py", line 222, in randint
return self.randrange(a, b+1)
File "/home/stc/miniconda3/lib/python3.7/random.py", line 200, in randrange
raise ValueError("empty range for randrange() (%d,%d, %d)" % (istart, istop, width))
Will Random Swap (RS) and Random Deletion (RD) work well for BERT, as BERT is besed on contextual pre-training.
Thank you very much. @jasonwei20
Hello! Thank you for all your contributions on the eda. It is pretty cool.
However, I think if a line contains only numbers which is usual in conversations like asking for cell phone numbers, the code could catch the exception. The way to add it is to judge it in eda() after get only chars.
It is only a suggestion, and thank you for your tools anyway.
Does anyone has list of supported languages for using this module? I couldn't find it in original paper.
Hello Jason Wei, great paper great idea and I read your paper about Easy Data Augmentation.
I'm trying to implement your experiment result, but I don't know how you find the best alpha and num_aug.
In Figure3 and 4 from the paper, you draw diagrams of different alpha and different num_aug, so how did you choose alpha when you test num_aug, and how did you choose num_aug when you test different alpha?
I checked the code and I find you set
"alpha_sr=0.3, alpha_ri=0.2, alpha_rs=0.1, p_rd=0.15"
in Figure 4, and
'size_data_f1/1_tiny': [16, 16, 16, 16, 16],size_data_f1/2_small': [16, 16, 16, 16, 16],'size_data_f1/3_standard': [8, 8, 8, 8, 4],size_data_f1/4_full': [8, 8, 8, 8, 4]
in Figure 3.
Could you please explain how you set these parameters? thanks !!
Hi, I hope you're doing great. I've been using your code with English text for a while and now I need to implement it for Persian Language. (hopefully with minimal change!) Your work is truly impressive! Could you kindly provide some advice on customizing your code for Persian and what I need to change? Your insights would be invaluable.
Thanks a lot,
I spent lots of time on it but I still couldn't find them except the SST dataset.
Could you please upload the data set or send it to [email protected]?
thanks!
In the semi-supervised field, first perform eda on labeled data
If I use 500 pieces of data in the fine-tuning stage, will the use of the eda method improve the results? (the bert model is officially trained)
thank you very much!
I've seen the documentation refer to input text in English.
Does it scale to other languages too? Or what do you recommend for supporting other languages?
Hi,
We tried PC datasets and subj datasets with number of 500, and run the e_2_rnn_baseline. py and aug. py in experiment 'e'. And our augmentation number is 16. However ,the results are not stable, sometimes lower than baseline, and we didn't get the 3 improvement rate. We want to know what parameters you use in your experiments. Thanks a lot !
So in eda.py you remove several things like:
line = line.replace("’", "")
line = line.replace("'", "")
line = line.replace("-", " ")
And I was wondering why is that? Cause while this augmentation method improved my results dramatically I now need to somehow get data back in which let's the bot learn that "I'm" is the same as "I am" etc, as the data now only ever includes "im".
Is this some limitation of WordNet or something?
Line 151 in d75e8bd
in your documentation you are saying that for the "insert" you remove "stop words".
in the code it does not.
I have not very often an random insert hit due to fact that possible stop words are not found in synonms.
And you to take only the noStopWords into account here
Line 160 in d75e8bd
While trying out your code, which was infact very helpful in generating a lot of training data for my model, I found one of the generated sentence to be out of place.
Provided sentence:
5 let me start a task
Output:
5 let me start antiophthalmic factor a task
5 let me start a task
5 lashkar e taiba me start a task
5 let me kickoff a task
5 let task start a me
5 let me start a task
5 let me start a task
5 task me start a let
5 let me start a labor
5 antiophthalmic factor let me start a task
5 let me start a task
Text marked in bold is a terrorist organization. You can find the details in the link below.
https://en.wikipedia.org/wiki/Lashkar-e-Taiba
If possible can you please consider removing that name from synonyms of word "let"
random word for let : lashkar e taiba
Parameters Used:
--num_aug=10 --alpha=0.01
Hi, thanks for the code repository and the paper!
I think that the idea behind Easy Data Augmentation is helpful and I am planning to port/adapt it such that it is usable for the German language as well.
Based on your paper random insertion is done in the following way:
Random Insertion (RI): Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times.
However, by looking your implementation stop words are not excluded:
def random_insertion(words, n):
new_words = words.copy()
for _ in range(n):
add_word(new_words)
return new_words`
def add_word(new_words):
synonyms = []
counter = 0
while len(synonyms) < 1:
random_word = new_words[random.randint(0, len(new_words)-1)]
synonyms = get_synonyms(random_word)
counter += 1
if counter >= 10:
return
random_synonym = synonyms[0]
random_idx = random.randint(0, len(new_words)-1)
new_words.insert(random_idx, random_synonym)
def eda(sentence, alpha_sr=0.1, alpha_ri=0.1, alpha_rs=0.1, p_rd=0.1, num_aug=9):
sentence = get_only_chars(sentence)
words = sentence.split(' ')
words = [word for word in words if word is not '']
num_words = len(words)
augmented_sentences = []
num_new_per_technique = int(num_aug/4)+1
n_sr = max(1, int(alpha_sr*num_words))
n_ri = max(1, int(alpha_ri*num_words))
n_rs = max(1, int(alpha_rs*num_words))
.........
#ri
for _ in range(num_new_per_technique):
a_words = random_insertion(words, n_ri)
augmented_sentences.append(' '.join(a_words))
.........
Do you know how this affects the final results? Thanks!
In Chinese text corpus, we can generate some adversarial examples by random insertion(RI), random deletion(RD) or synonym replacement(SR). I am wondering whether EDA method will cause the model such text classifier to be attacked by the adversarial examples generated by RI, RD or SR like EDA does.
Can you explain this? Because I did some experiments and they show a decrease in performance.
Thank you very much!
random_idx = random.randint(0, len(new_words)-1)
For I have a dream
, it can't produce I have a dream [RANDOM]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.