Coder Social home page Coder Social logo

makcedward / nlpaug Goto Github PK

View Code? Open in Web Editor NEW
4.3K 41.0 455.0 3.29 MB

Data augmentation for NLP

Home Page: https://makcedward.github.io/

License: MIT License

Python 39.66% Jupyter Notebook 60.19% Shell 0.16%
nlp augmentation machine-learning artificial-intelligence data-science natural-language-processing adversarial-attacks adversarial-example ai ml

nlpaug's People

Contributors

abcp4 avatar amitness avatar avostryakov avatar baskrahmer avatar bdalal avatar bp-high avatar buihuy1702 avatar chandan047 avatar chiragjn avatar drmatters avatar emrecncelik avatar hsm207 avatar hwchase17 avatar jbitton avatar jessicasousa avatar joaoantonioverdade avatar johngiorgi avatar karthikmurugadoss avatar litanlitudan avatar makcedward avatar markussagen avatar narayanacharya6 avatar ricardopieper avatar rogier-stegeman avatar sakares avatar sami-bg avatar sebastian-sosa avatar sorrow321 avatar usaiprashanth avatar vishxl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nlpaug's Issues

Referring to Kobayashi (2018) for BertAug seems not correct and misleading

Thank you for your work, I'm using your library for my experiments.

You cite Kobayashi (2018) for the reference of BertAug, but to me, it looks not correct.
Although Kobayashi uses (RNN-based) LM for a similarity-based word replacement, the novelty of his work is conditional constraint, which allows keeping the important words for classification (e.g. positivity-indicating words like fantastic for sentiment analysis).
You might want to read Sec 2.3 of the work.

As far as my understanding, your code does not implement the conditional constraint objective.
If you would keep citing the reference, the objective should be implemented.
Otherwise, I guess it's misleading.

No 'tfidfaug_w2idf.txt' file

Under the model folder, there is no 'tfidfaug_w2idf.txt' file which needed for TfIdfAug, please upload it, thank you.

WordNetAug seems to only work for english

A simple change to the code may fix this:

synets.extend(self.model.synsets(pos[i][0], pos=word_pos, lang='por'))
And:
for candidate in synet.lemma_names(lang='por'):

In this code, I changed it to generate examples in portuguese. Otherwise it generates random examples in english.

UnicodeDecodeError on bert model

Hi, I tested the bert-base-uncased model but I have downloaded this before.
So the augmenter instantiation is like this:
aug = naw.ContextualWordEmbsAug(model_path='bert-base-uncased-pytorch_model.bin', action="substitute")

But when I run the code, I got the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Guide for word-level data augmentation

Hey Edward,
I'm exploring your algorithms one by one, for word-level augmentation methods, I have tested almost all of them, so I wanted to combine them together to create a pipeline which uses these methods to do augmentation against each row in a pandas dataframe but I got the error
image

The logic is very simple, the main function is similar as following code
`def word_level_augmentation(dataframe, n_times):
data_aug = []
for _, row in dataframe.iterrows():
text = row['text']
label = int(row['label'])
data_aug.append([text, temp_id])
for _ in range(n_times):
data_aug.append([random_word_aug.swap(text), label ])
data_aug.append([random_word_aug.delete(text), label ])
data_aug.append([spelling_aug.augment(text), label ])

return data_aug`

Btw, I have used the same function to consume character-level augmentation successfully, so I'm wondering where did I implement incorrectly? Could you help me?
Thank you so much.

Short token problems / what happened to min_char?

Hey Edward --
Is there a reason you took out the min_char parameter from the RandomChar augmenter? At the moment it seems like it ignores augmentations made to short words and just returns the original text.

For example, if the augmenter would make "hi there" into "hW there" it just returns "hi there" instead. Seems like "min_char" is permanently set at 3.

Thanks!

Suggestion for GPT2 decoding speedup

According to transformers documentation, GPT2 LM head supports an argument called past that speeds up decoding by reusing computed attention tensors from previous steps. I have started changes in my fork to be able to get some numbers

chiragjn#2

I am not entirely sure how to get this working for XLNET, but there is a similar, mems argument

Would you accept such PR upstream if it speeds up decoding?

Producing deterministic results

I think the library should have means to produce the same results given the same inputs. However, I've observed that each time I run any augmenter, the results vary.

This is expected, but sometimes it might be interesting to be a bit more deterministic. I've done it using this piece of code:

import numpy as np
np.random.seed(1000)
import random
random.seed(1000)

This is less than ideal since it changes the seed globally. I wonder if there is a way to implement this properly.

PPDB's Synonym

Thanks for the package. In example "textual_language_augmenter.ipynb", the first time I run the cell it seemed to work but for repeated runs I am getting the following error.

Reproducible Example

aug = naw.SynonymAug(aug_src='ppdb', model_path=os.environ.get("MODEL_DIR") + 'ppdb-2.0-s-all')
augmented_text = aug.augment(text)

Error Message

AttributeError                            Traceback (most recent call last)
<ipython-input-8-d5a9a70df104> in <module>()
      1 aug = naw.SynonymAug(aug_src='ppdb', model_path=os.environ.get("MODEL_DIR") + 'ppdb-2.0-s-all')
----> 2 augmented_text = aug.augment(text)
      3 print("Original:")
      4 print(text)
      5 print("Augmented Text:")

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/nlpaug/base_augmenter.py in augment(self, data, n, num_thread)
     78                 # TODO: support multiprocessing for GPU
     79                 # https://discuss.pytorch.org/t/using-cuda-multiprocessing-with-single-gpu/7300
---> 80                 augmented_results = [action_fx(clean_data) for _ in range(n)]
     81             else:
     82                 augmented_results = self._parallel_augment(action_fx, clean_data, n=n, num_thread=num_thread)

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/nlpaug/base_augmenter.py in <listcomp>(.0)
     78                 # TODO: support multiprocessing for GPU
     79                 # https://discuss.pytorch.org/t/using-cuda-multiprocessing-with-single-gpu/7300
---> 80                 augmented_results = [action_fx(clean_data) for _ in range(n)]
     81             else:
     82                 augmented_results = self._parallel_augment(action_fx, clean_data, n=n, num_thread=num_thread)

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/nlpaug/augmenter/word/synonym.py in substitute(self, data)
     92 
     93         tokens = self.tokenizer(data)
---> 94         pos = self.model.pos_tag(tokens)
     95 
     96         aug_idxes = self._get_aug_idxes(pos)

AttributeError: 'dict' object has no attribute 'pos_tag'

I've downloaded ppdb-2.0-s-all from http://nlpgrid.seas.upenn.edu/PPDB/eng/ppdb-2.0-s-all.gz.

Guide for NER Augmentation

Thanks for sharing your work, i could not find Any NLP Augmentation library other than this.

Will this Library help in augmenting NER data?

My data looks like this

Ryan B-PER
Dsouza B-PER
/DOB O
11/11/1997 B-DOB
/MALE O
22 B-NUM
56565 B-NUM

Thanks in advance

Avoid the text augmentation of a part of a sentence

Hello,

I would like to ask you if it is possible to avoid the text augmentation of a pattern in a sentence.

Example:
Original sentence is : I would like to test @[the code]

How can i avoid the text augmentation of the pattern @[the code] ???

NameError: name 'model_dir' is not defined

Trying to run the below code:

# model_type: word2vec, glove or fasttext
aug = naw.WordEmbsAug(
    model_type='word2vec', model_path=model_dir + 'GoogleNews-vectors-negative300.bin',
    action="insert")

It is giving up the below error:

NameError: name 'model_dir' is not defined

And I did mention the below code at the top:

import os
os.environ["MODEL_DIR"] = '../model'

I also tried replacing MODEL_DIR instead of model_dir.

Reference: https://github.com/makcedward/nlpaug/blob/master/example/textual_augmenter.ipynb

Embedding issues

Hi Edward,

I am compiling the latest build from the github itself, and have encountered another issue!
What if the there is no 'index' for a particular word? This is my first thought on seeing the error below.

  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/flow/sometimes.py", line 22, in augment
    augmented_text = aug.augment(augmented_text)
  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/base_augmenter.py", line 65, in augment
    return self.substitute(data)
  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/augmenter/word/word_embs_aug.py", line 57, in substitute
    candidate_words = self.model.predict(original_word, top_n=self.aug_n)
  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/model/word_embs/word_embeddings.py", line 48, in predict
    source_id = self.word2idx(word)
  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/model/word_embs/word_embeddings.py", line 26, in word2idx
    return self.w2i[word]
KeyError: 'thethe'

The above is acceptable but it should be handled and is fixable. But for below

  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/flow/sometimes.py", line 22, in augment
    augmented_text = aug.augment(augmented_text)
  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/base_augmenter.py", line 65, in augment
    return self.substitute(data)
  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/augmenter/word/word_embs_aug.py", line 57, in substitute
    candidate_words = self.model.predict(original_word, top_n=self.aug_n)
  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/model/word_embs/word_embeddings.py", line 48, in predict
    source_id = self.word2idx(word)
  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/model/word_embs/word_embeddings.py", line 26, in word2idx
    return self.w2i[word]
KeyError: 'How'

This should not happen! as there is definately an embedding for 'How',

naw.RandomWordAug() doesn't have parameter aug_max

Hi~
init function of class naw.RandomWordAug doesn't have parameter aug_max:

class RandomWordAug(WordAugmenter):
    def __init__(self, action=Action.DELETE, name='RandomWord_Aug', aug_min=1, aug_p=0.3, stopwords=None,
                 target_words=None, tokenizer=None, reverse_tokenizer=None, stopwords_regex=None, verbose=0):
        super().__init__(
            action=action, name=name, aug_p=aug_p, aug_min=aug_min, stopwords=stopwords,
            tokenizer=tokenizer, reverse_tokenizer=reverse_tokenizer, device='cpu', verbose=verbose,
            stopwords_regex=stopwords_regex)

        self.target_words = ['_'] if target_words is None else target_words

I don't know if it was intentional or missed?

Bug in nlpaug.augmenter.char.RandomCharAug(action='swap')

The swap augmenter will often replace characters with other characters in string.

For example, running:

import nlpaug
import nlpaug.augmenter.char as nac
from collections import Counter

def char_count(word):
    return Counter(list(word))

swapper = nac.RandomCharAug(action='swap', swap_mode='random')

word = 'testing'
num_t = char_count('testing')['t']

iters=0

while num_t == char_count(word)['t'] and iters < 10000:
    word = swapper.augment(word)
    iters+=1
print(word, iters)

will take the string testing and often output something like:

ttniitg 5

where the output has added one t and one i while removing an e and an s, which is not the expected behavior of multiple swapping operations.

Proposed fix:

  • Remove .copy() method from line 140 of nlpaug/augmenter/char/random.py, or
  • Alternatively, remove the definition of original_chars variable altogether and only augment chars by referencing chars directly.

As it's written on line 151 of random.py, it appears that original_chars is never augmented, so that the characters at the swap indices of chars are being reassigned to those at the corresponding swap indices of original_chars. Since the characters in original_chars are never changed, after the first swap the method is replacing characters at the swap locations in chars with the original characters at those locations, which can end up duplicating some characters from the string, while erasing others.

Reducing the number of replaced characters

Hi Edward, thank you for this great work.

I'm trying to use the QwertyAug augmenter, but I'm having some issues controlling the number of replaced characters.

For instance:

aug = naf.Sequential([
    nac.QwertyAug()
])

aug.augment("qual o motivo do meu cartao ainda estar bloqueado?")

This results in waaay too many characters being replaced:

 wual 9 motibo di m3u carfao aindx esta# bloquezso?

I saw that there are some parameters that affect the amount of characters. For instance, if I pass aug_p = 0.00000000000000001 still way too many characters get replaced. Now I show a test with the sentence "why is my credit card still blocked?":

'wh5 id hy crevit carw stilp blpcked?'

There's barely any meaning in this sentence anymore.

The aug_p seems to be the correct parameter. If I pass 1, then all characters change.

Is there a way to further reduce the chance that a given character gets replaced? Ideally, in a sentence with that size, I'd like to replace like 2 or 3 characters.

NameError while using ContextualWordEmbsAug

I just successfully installed dependencies for ContextualWordEmbsAug and tried one of your examples.

pip install torch>=1.2.0 transformers>=2.0.0

Here is your code that i used:


aug = naw.ContextualWordEmbsAug(
model_path='bert-base-uncased', action="substitute")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

The error:
NameError: name 'BertTokenizer' is not defined
Do you have any idea why it happens?

Unable to load a custom model

Hi, I trained a model and it is unable to load it. I think this might get solved if we can pass AutoModel and AutoTokenizer objects using transformers so that it can handle new models.

aug = naw.ContextualWordEmbsAug(
----> 4 model_path=model_name, action="substitute")

3 frames
/usr/local/lib/python3.6/dist-packages/nlpaug/augmenter/word/context_word_embs.py in init(self, model_path, action, temperature, top_k, top_p, name, aug_min, aug_max, aug_p, stopwords, skip_unknown_word, device, force_reload, optimize, stopwords_regex, verbose)
91 self.model = self.get_model(
92 model_path=model_path, device=device, force_reload=force_reload, temperature=temperature, top_k=top_k,
---> 93 top_p=top_p, optimize=optimize)
94 # Override stopwords
95 if stopwords is not None and self.model_type in ['xlnet', 'roberta']:

/usr/local/lib/python3.6/dist-packages/nlpaug/augmenter/word/context_word_embs.py in get_model(cls, model_path, device, force_reload, temperature, top_k, top_p, optimize)
269 def get_model(cls, model_path, device='cuda', force_reload=False, temperature=1.0, top_k=None, top_p=0.0,
270 optimize=None):
--> 271 return init_context_word_embs_model(model_path, device, force_reload, temperature, top_k, top_p, optimize)

/usr/local/lib/python3.6/dist-packages/nlpaug/augmenter/word/context_word_embs.py in init_context_word_embs_model(model_path, device, force_reload, temperature, top_k, top_p, optimize)
28 model = nml.Roberta(model_path, device=device, temperature=temperature, top_k=top_k, top_p=top_p)
29 elif 'bert' in model_path:
---> 30 model = nml.Bert(model_path, device=device, temperature=temperature, top_k=top_k, top_p=top_p)
31 elif 'xlnet' in model_path:
32 model = nml.XlNet(model_path, device=device, temperature=temperature, top_k=top_k, top_p=top_p, optimize=optimize)

/usr/local/lib/python3.6/dist-packages/nlpaug/model/lang_models/bert.py in init(self, model_path, temperature, top_k, top_p, device)
21 self.model_path = model_path
22
---> 23 self.tokenizer = BertTokenizer.from_pretrained(model_path)
24 self.model = BertForMaskedLM.from_pretrained(model_path)
25

NameError: name 'BertTokenizer' is not defined

performance issue with the sampling method

Hello, I found that the sampling method (line 99) in nlpaug/nlpaug/model/lang_models/language_models.py has a poor performance.

top_n_ids = torch.multinomial(probas, num_samples=n, replacement=False).tolist()

torch.multinomial is rather slow, see pytorch/pytorch#11931

After changing it to numpy, the speed can accelerate a lot.
Speed up: 3.221s/call -> 5e-6s/call

top_n_ids = np.random.choice(probas.size(0), n, False, probas.cpu().numpy()).tolist()

custom word augmenter example not working

when i try your example for custom word augmenter i get error

`The

AttributeError Traceback (most recent call last)
in ()
35 for token in tokens:
36 print(token)
---> 37 print('{} --> {}'.format(token, aug.augment([token])[0]))

1 frames
/usr/local/lib/python3.6/dist-packages/nlpaug/augmenter/word/word_augmenter.py in clean(cls, data)
29 @classmethod
30 def clean(cls, data):
---> 31 return data.strip()
32
33 def skip_aug(self, token_idxes, tokens):

AttributeError: 'list' object has no attribute 'strip'`

i am using the beta version i.e directly clone from your repo

bug of naw.RandomWordAug(action='swap')

If the text has only one word, the function naw.RandomWordAug(action='swap') will change words in position 0 and position 1, then it will throw en error: IndexError: list index out of range. Maybe you can add a determine statement that if text has only one word, then return it directly.

WordAugmenter._tokenizer cant remove excessive space lead to nltk error

Hi,

When there is excessive space in a sentence for example:
text = 'The quick brown fox jumps over the lazy dog . 1 2'
it would cause index error in nltk because there will be an empty token. The resulting tokens:
['The', '', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.', '1', '', '2']

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/cenozai/mypy/tf_models/nlp/nlpaug/nlpaug/base_augmenter.py", line 61, in augment result = self.substitute(data) File "/home/cenozai/mypy/tf_models/nlp/nlpaug/nlpaug/augmenter/word/synonym.py", line 83, in substitute pos = self.model.pos_tag(tokens) File "/home/cenozai/mypy/tf_models/nlp/nlpaug/nlpaug/model/word_dict/wordnet.py", line 46, in pos_tag return nltk.pos_tag(tokens) File "/home/cenozai/.local/lib/python3.6/site-packages/nltk/tag/__init__.py", line 162, in pos_tag return _pos_tag(tokens, tagset, tagger, lang) File "/home/cenozai/.local/lib/python3.6/site-packages/nltk/tag/__init__.py", line 119, in _pos_tag tagged_tokens = tagger.tag(tokens) File "/home/cenozai/.local/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 175, in tag context = self.START + [self.normalize(w) for w in tokens] + self.END File "/home/cenozai/.local/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 175, in <listcomp> context = self.START + [self.normalize(w) for w in tokens] + self.END File "/home/cenozai/.local/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 261, in normalize elif word[0].isdigit(): IndexError: string index out of range

A quick fix could be as follows:
Original WordAugmenter._tokenizer in word_augmenter.py:
return text.split(' ')
fix:
return [t for t in text.split(' ') if len(t) > 0]

The implementation of nlpaug.augmenter.word.AntonymAug() is inconsistent with the referenced paper

In the referenced paper Adversarial Over-Sensitivity and Over-Stability Strategies for Dialogue Models, the algorithm of Antonym Substitution is (page 6):

For Antonym, we modify the first verb, adjective or adverb that has an antonym.

But in nlpaug.augmenter.word.AntonymAug(), let's say the number of words we want to augment is 3 , the function will randomly sample 3 candidate words in original text first, then search their antonyms. If all these 3 candidate words have no antonyms, the text will not be augmented. I think we should look for candidate words that do have antonyms instead of randomly sampling.

NameError: name 'BertTokenizer' is not defined

I got this error: NameError: name 'BertTokenizer' is not defined
when I am running the following code:
aug = naw.ContextualWordEmbsAug(model_path='bert-base-uncased', action="insert") augmented_text = aug.augment(text) print("Original:") print(text) print("Augmented Text:") print(augmented_text)


Returns:

/usr/local/lib/python3.6/dist-packages/nlpaug/model/lang_models/bert.py in init(self, model_path, top_k, top_p, device)
75 self.model_path = model_path
76
---> 77 self.tokenizer = BertTokenizer.from_pretrained(model_path)
78 self.model = BertForMaskedLM.from_pretrained(model_path)
79

NameError: name 'BertTokenizer' is not defined

Hard dependency on librosa because of package level import

Minor issue, but a hard dependency on librosa still exists but is not mentioned in dependencies

|     import nlpaug.augmenter.char as nac
|   File "/usr/local/lib/python3.6/site-packages/nlpaug/__init__.py", line 2, in <module>
|     from nlpaug.base_augmenter import *
|   File "/usr/local/lib/python3.6/site-packages/nlpaug/base_augmenter.py", line 5, in <module>
|     from nlpaug.util import Action, Method, WarningException, WarningName, WarningCode, WarningMessage
|   File "/usr/local/lib/python3.6/site-packages/nlpaug/util/__init__.py", line 7, in <module>
|     from nlpaug.util.visual import *
|   File "/usr/local/lib/python3.6/site-packages/nlpaug/util/visual/__init__.py", line 1, in <module>
|     from nlpaug.util.visual.spectrogram import *
|   File "/usr/local/lib/python3.6/site-packages/nlpaug/util/visual/spectrogram.py", line 2, in <module>
|     import librosa.display
| ModuleNotFoundError: No module named 'librosa'

BertAug on Insert action issue

Hi Edward,

I was exploring your repository and kudos for all the awesome work you have done! But I faced this issue when I ran naw.BertAug(action=Action.INSERT), in a loop over a large set of sentences,
__

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-0b61f9f70462> in <module>
      1 # df['questions'].apply(lambda x: aug.augment(x) if len(x.split(' ')) > 4 and len(x.split(' ')) < 20 else x)
      2 for i in range(len(df.index)):
----> 3     print(df['questions'][i],len(df['questions'][i].split(' ')),aug.augment(df['questions'][i]) if len(df['questions'][i].split(' ')) > 4 and len(df['questions'][i].split(' ')) < 20 else df['questions'][i])

~/bot/bot-ml/lib/python3.7/site-packages/nlpaug/flow/sometimes.py in augment(self, text)
     20                     continue
     21 
---> 22                 augmented_text = aug.augment(augmented_text)
     23 
     24             results.append(augmented_text)

~/bot/bot-ml/lib/python3.7/site-packages/nlpaug/base_augmenter.py in augment(self, tokens)
     42     def augment(self, tokens):
     43         if self.action == Action.INSERT:
---> 44             return self.insert(tokens)
     45         elif self.action == Action.SUBSTITUTE:
     46             return self.substitute(tokens)

~/bot/bot-ml/lib/python3.7/site-packages/nlpaug/augmenter/word/bert.py in insert(self, text)
     60         for aug_idx in aug_idexes:
     61             results.insert(aug_idx, nml.Bert.MASK)
---> 62             new_word = self.sample(self.model.predict(results, nml.Bert.MASK, self.aug_n), 1)[0]
     63             results[aug_idx] = new_word
     64 

~/bot/bot-ml/lib/python3.7/site-packages/nlpaug/base_augmenter.py in sample(self, x, num)
     67 
     68     def sample(self, x, num):
---> 69         return random.sample(x, num)
     70 
     71     def generate_aug_cnt(self, size):

~/bot/bot-ml/lib/python3.7/random.py in sample(self, population, k)
    319         n = len(population)
    320         if not 0 <= k <= n:
--> 321             raise ValueError("Sample larger than population or is negative")
    322         result = [None] * k
    323         setsize = 21        # size of a small set minus size of an empty list

ValueError: Sample larger than population or is negative

If i understand correctly, sample is getting k = 1, population as the result of self.model.predict(results, nml.Bert.MASK, self.aug_n) and n is length of population , and if the predictions(length of population) are less than 1 then this is failing with above error.

UnpicklingError: invalid load key, '<'.

Hello, when I use sentence augmentation, I got the following problem...

image

I just want to download gpt2 pre-trained model and try to augment the sentence.

nas.ContextualWordEmbsForSentenceAug(model_path='gpt2')

Stopwords

Hello, it seems like the stopwords aren't being filtered correctly:

image

The 'quick' word is not being ignored. It would be nice if it would just pass over them

WordNetAug - index error

Hello, when I put a text that contains one character words like: 'I a' , in the wordnet synom replacement , the following bug occurs:

text= "I work in a middle school"
aug = naw.WordNetAug()
augmented_text = aug.augment(text)

image

Bert - pick() in language_models.py can return no candidates

For bert, the get_candidiates in pick function

    def pick(self, logits, target_word, n=1):
        candidate_ids, candidate_probas = self.prob_multinomial(logits, n=n*10)
        results = self.get_candidiates(candidate_ids, candidate_probas, target_word, n)

        return results

can return no candidates when all tokens are sub tokens (starts with '##')

Should pick return a non-skip token greedily by looking at the logits when results is empty?

`augment_batch` for BertAug, or a GPU option

Thank you for your development.
I have used BertAug for my preliminary experiments.
Since BERT on CPU is awfully slow, it would be great if BertAug had a method to augment examples in a batch so that GPUs can be used.
Just a suggestion for a sort of production-level use.
Thanks!

ValueError: Sample larger than population or is negative

Hi,

I have a small dataset that I am trying to augment. For some of the questions, I am getting the following error:

ValueError                                Traceback (most recent call last)
<ipython-input-337-336aea02b7a2> in <module>
      2 print(len(text))
      3 aug = naw.BertAug(action="insert")
----> 4 augmented_text = aug.augment(text)
      5 print("Original:")
      6 print(text)

~/anaconda3/lib/python3.7/site-packages/nlpaug/base_augmenter.py in augment(self, data)
     69 
     70         if self.action == Action.INSERT:
---> 71             return self.insert(data)
     72         elif self.action == Action.SUBSTITUTE:
     73             return self.substitute(data)

~/anaconda3/lib/python3.7/site-packages/nlpaug/augmenter/word/bert.py in insert(self, data)
     85         for aug_idx in aug_idxes:
     86             results.insert(aug_idx, nml.Bert.MASK)
---> 87             new_word = self.sample(self.model.predict(results, nml.Bert.MASK, self.aug_n), 1)[0]
     88             results[aug_idx] = new_word
     89 

~/anaconda3/lib/python3.7/site-packages/nlpaug/base_augmenter.py in sample(cls, x, num)
    109     @classmethod
    110     def sample(cls, x, num):
--> 111         return random.sample(x, num)
    112 
    113     def generate_aug_cnt(self, size, aug_p=None):

~/anaconda3/lib/python3.7/random.py in sample(self, population, k)
    319         n = len(population)
    320         if not 0 <= k <= n:
--> 321             raise ValueError("Sample larger than population or is negative")
    322         result = [None] * k
    323         setsize = 21        # size of a small set minus size of an empty list

ValueError: Sample larger than population or is negative

After some research, I came across this https://stackoverflow.com/questions/20861497/sample-larger-than-population-in-random-sample-python
but I am still not sure what exactly the issue is. It works sometimes but other times it returns this error. Is it something to do with my questions? Is there a specific format I need to follow for the questions?

Any help would be much appreciated.

File not found in /model/char/keyboard/en.json

Hello! I'm encountering this error when using KeyboardAug:

Minimum Working Environment

import nlpaug.augmenter.char as nac
aug = nac.KeyboardAug()
aug.augment("hello world")

Exception

FileNotFoundError                         Traceback (most recent call last)
<ipython-input-24-c96cc55b7113> in <module>
----> 1 augment(3, nac.KeyboardAug())

~/anaconda3/lib/python3.6/site-packages/nlpaug/augmenter/char/keyboard.py in __init__(self, name, aug_char_min, aug_char_max, aug_char_p, aug_word_p, aug_word_min, aug_word_max, stopwords, tokenizer, reverse_tokenizer, special_char, numeric, upper_case, lang, verbose, stopwords_regex)
     53         self.upper_case = upper_case
     54         self.lang = lang
---> 55         self.model = self.get_model(special_char, numeric, upper_case, lang)
     56 
     57     def skip_aug(self, token_idxes, tokens):

~/anaconda3/lib/python3.6/site-packages/nlpaug/augmenter/char/keyboard.py in get_model(cls, special_char, numeric, upper_case, lang)
     99     @classmethod
    100     def get_model(cls, special_char=True, numeric=True, upper_case=True, lang="en"):
--> 101         return nmc.Keyboard(special_char=special_char, numeric=numeric, upper_case=upper_case, lang=lang)

~/anaconda3/lib/python3.6/site-packages/nlpaug/model/char/keyboard.py in __init__(self, special_char, numeric, upper_case, cache, lang)
     18         self.lang = lang
     19         self.model = self.get_model(
---> 20             model_dir=self.model_dir, special_char=special_char, numeric=numeric, upper_case=upper_case, lang=lang)
     21 
     22     def predict(self, data):

~/anaconda3/lib/python3.6/site-packages/nlpaug/model/char/keyboard.py in get_model(cls, model_dir, special_char, numeric, upper_case, lang)
     31 
     32         model_path = os.path.join(model_dir, lang+'.json')
---> 33         with open(model_path, encoding="utf8") as f:
     34             mapping = json.load(f)
     35 

FileNotFoundError: [Errno 2] No such file or directory: '/home/ljvm/anaconda3/lib/python3.6/site-packages/nlpaug/model/char/../../../model/char/keyboard/en.json'

Thoughts

  • I think that the paths here ../../../ becomes different when running nlpaug as a package.

Thank you!

Exception when loading word embedding models with lines containing 2 words

Some model files contain embeddings with multiple words (the NILC embeddings for portuguese) which causes the model loading code to explode. For instance, a line in the model file might contain this:

Hey there 0.001 0.0003 0.86245 ........

The same does not happen in Spacy, for instance.

I fixed it in my local dev environment, might make a pull request later.

Swap augmentation doesn't preserve internal word casing

Code to reproduce:

import nlpaug.augmenter.word as naw

aug = naw.RandomWordAug(action='swap')
aug.augment('aA bB')
>>> 'bb aa'

# real-world use case
aug.augment('I love McDonalds')
>>> 'LOVE i McDonalds'
aug.augment('I love McDonalds')
>>> 'I mcdonalds Love'

P.S. thank you very much for the amazing library

Part of Speach to map synonyms

Hi,

I would like to know if the part of speach is taken into consideration when mapping a word to their synonyms usning word2vec, glovee, and fasttext.

Thanks

BERTAug affects Proper Noun

Hi,

I have been using your BERTAug. for Text Augmentation. It works fine on a lot of tasks but it starts messing up the Proper Nouns.

image

Is there any fix to this ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.