makcedward / nlpaug Goto Github PK

Data augmentation for NLP

Home Page: https://makcedward.github.io/

License: MIT License

Python 39.66% Jupyter Notebook 60.19% Shell 0.16%

nlp augmentation machine-learning artificial-intelligence data-science natural-language-processing adversarial-attacks adversarial-example ai ml

nlpaug's People

Contributors

Stargazers

Watchers

Forkers

naz947 huitingliu allensmile ye-man ricardopieper teddius scape1989 qolina eugeneware abcp4 aikho booltime claire33333 natuan andriylazorenko keshav-gravityilabs cyrilou242 rhtrht rafikrhouma02 wshbak andrey999333 h20watermelon aymar73 sainiudit hongzhangbrown chaotingchang liamge prempv ross-mitchell avostryakov maxqai veeraelluru dataframing michael-wzhu caoxu915683474 srinivasgutta7 chiragjn zvyag x10-utils tschunknail fstcap tufanmaitydev sahanduiuc baifengbai snzisad shihuaxing shengzhehu subbaraomanchala chandreshiit francescasrc xiaojiu1414 abtgit ghosthamlet dbertazioli kennyrich pankajmehar xiaoleihuang rprilepskiy moh2236945 mars-wei shammur strategist922 wangjksjtu jackieassa naveenjr krupalraj jaycicle sankarpabba rogervaas jishma1995 jjwangnlp kyongpiltae victormadu pgsrv jennifer0218 sontran1001 emanuelaboros gxxu-ml dharakotecha estkae ds7711 xlzwhboy ahossanmarc takkyi83 binhetech xrosliang ankurgulati rileyshe barryzm m1f1 jetjodh aashsach exp-time-series-tools cxz rlds-107 owen864720655 hash2430 yanghaocsg praveensingh1904 nicholascph

nlpaug's Issues

Referring to Kobayashi (2018) for BertAug seems not correct and misleading

Thank you for your work, I'm using your library for my experiments.

You cite Kobayashi (2018) for the reference of BertAug, but to me, it looks not correct.
Although Kobayashi uses (RNN-based) LM for a similarity-based word replacement, the novelty of his work is conditional constraint, which allows keeping the important words for classification (e.g. positivity-indicating words like fantastic for sentiment analysis).
You might want to read Sec 2.3 of the work.

As far as my understanding, your code does not implement the conditional constraint objective.
If you would keep citing the reference, the objective should be implemented.
Otherwise, I guess it's misleading.

ImportError: No module named 'requests'

It looks like nlpaug imports the requests library, but requests isn't defined in the requirements.

No 'tfidfaug_w2idf.txt' file

Under the model folder, there is no 'tfidfaug_w2idf.txt' file which needed for TfIdfAug, please upload it, thank you.

textual_augmenter.ipynb still has QwertyAug() as an example

I believe QwertyAug needs to be updated to KeyboardAug in the examples notebook.

WordNetAug seems to only work for english

A simple change to the code may fix this:

synets.extend(self.model.synsets(pos[i][0], pos=word_pos, lang='por'))
And:
for candidate in synet.lemma_names(lang='por'):

In this code, I changed it to generate examples in portuguese. Otherwise it generates random examples in english.

UnicodeDecodeError on bert model

Hi, I tested the bert-base-uncased model but I have downloaded this before.
So the augmenter instantiation is like this:
aug = naw.ContextualWordEmbsAug(model_path='bert-base-uncased-pytorch_model.bin', action="substitute")

But when I run the code, I got the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Guide for word-level data augmentation

Hey Edward,
I'm exploring your algorithms one by one, for word-level augmentation methods, I have tested almost all of them, so I wanted to combine them together to create a pipeline which uses these methods to do augmentation against each row in a pandas dataframe but I got the error

The logic is very simple, the main function is similar as following code
`def word_level_augmentation(dataframe, n_times):
data_aug = []
for _, row in dataframe.iterrows():
text = row['text']
label = int(row['label'])
data_aug.append([text, temp_id])
for _ in range(n_times):
data_aug.append([random_word_aug.swap(text), label ])
data_aug.append([random_word_aug.delete(text), label ])
data_aug.append([spelling_aug.augment(text), label ])

return data_aug`

Btw, I have used the same function to consume character-level augmentation successfully, so I'm wondering where did I implement incorrectly? Could you help me?
Thank you so much.

Short token problems / what happened to min_char?

Hey Edward --
Is there a reason you took out the min_char parameter from the RandomChar augmenter? At the moment it seems like it ignores augmentations made to short words and just returns the original text.

For example, if the augmenter would make "hi there" into "hW there" it just returns "hi there" instead. Seems like "min_char" is permanently set at 3.

Thanks!

Suggestion for GPT2 decoding speedup

According to transformers documentation, GPT2 LM head supports an argument called past that speeds up decoding by reusing computed attention tensors from previous steps. I have started changes in my fork to be able to get some numbers

chiragjn#2

I am not entirely sure how to get this working for XLNET, but there is a similar, mems argument

Would you accept such PR upstream if it speeds up decoding?

Producing deterministic results

I think the library should have means to produce the same results given the same inputs. However, I've observed that each time I run any augmenter, the results vary.

This is expected, but sometimes it might be interesting to be a bit more deterministic. I've done it using this piece of code:

import numpy as np
np.random.seed(1000)
import random
random.seed(1000)

This is less than ideal since it changes the seed globally. I wonder if there is a way to implement this properly.

PPDB's Synonym

Thanks for the package. In example "textual_language_augmenter.ipynb", the first time I run the cell it seemed to work but for repeated runs I am getting the following error.

Reproducible Example

aug = naw.SynonymAug(aug_src='ppdb', model_path=os.environ.get("MODEL_DIR") + 'ppdb-2.0-s-all')
augmented_text = aug.augment(text)

Error Message

AttributeError                            Traceback (most recent call last)
<ipython-input-8-d5a9a70df104> in <module>()
      1 aug = naw.SynonymAug(aug_src='ppdb', model_path=os.environ.get("MODEL_DIR") + 'ppdb-2.0-s-all')
----> 2 augmented_text = aug.augment(text)
      3 print("Original:")
      4 print(text)
      5 print("Augmented Text:")

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/nlpaug/base_augmenter.py in augment(self, data, n, num_thread)
     78                 # TODO: support multiprocessing for GPU
     79                 # https://discuss.pytorch.org/t/using-cuda-multiprocessing-with-single-gpu/7300
---> 80                 augmented_results = [action_fx(clean_data) for _ in range(n)]
     81             else:
     82                 augmented_results = self._parallel_augment(action_fx, clean_data, n=n, num_thread=num_thread)

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/nlpaug/base_augmenter.py in <listcomp>(.0)
     78                 # TODO: support multiprocessing for GPU
     79                 # https://discuss.pytorch.org/t/using-cuda-multiprocessing-with-single-gpu/7300
---> 80                 augmented_results = [action_fx(clean_data) for _ in range(n)]
     81             else:
     82                 augmented_results = self._parallel_augment(action_fx, clean_data, n=n, num_thread=num_thread)

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/nlpaug/augmenter/word/synonym.py in substitute(self, data)
     92 
     93         tokens = self.tokenizer(data)
---> 94         pos = self.model.pos_tag(tokens)
     95 
     96         aug_idxes = self._get_aug_idxes(pos)

AttributeError: 'dict' object has no attribute 'pos_tag'

I've downloaded ppdb-2.0-s-all from http://nlpgrid.seas.upenn.edu/PPDB/eng/ppdb-2.0-s-all.gz.

Guide for NER Augmentation

Thanks for sharing your work, i could not find Any NLP Augmentation library other than this.

Will this Library help in augmenting NER data?

My data looks like this

Ryan B-PER
Dsouza B-PER
/DOB O
11/11/1997 B-DOB
/MALE O
22 B-NUM
56565 B-NUM

Thanks in advance

Avoid the text augmentation of a part of a sentence

Hello,

I would like to ask you if it is possible to avoid the text augmentation of a pattern in a sentence.

Example:
Original sentence is : I would like to test @[the code]

How can i avoid the text augmentation of the pattern @[the code] ???

AttributeError: module 'nlpaug.augmenter.word' has no attribute 'BertAug'

NameError: name 'model_dir' is not defined

Trying to run the below code:

# model_type: word2vec, glove or fasttext
aug = naw.WordEmbsAug(
    model_type='word2vec', model_path=model_dir + 'GoogleNews-vectors-negative300.bin',
    action="insert")

It is giving up the below error:

NameError: name 'model_dir' is not defined

And I did mention the below code at the top:

import os
os.environ["MODEL_DIR"] = '../model'

I also tried replacing MODEL_DIR instead of model_dir.

Reference: https://github.com/makcedward/nlpaug/blob/master/example/textual_augmenter.ipynb

Embedding issues

Hi Edward,

I am compiling the latest build from the github itself, and have encountered another issue!
What if the there is no 'index' for a particular word? This is my first thought on seeing the error below.

  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/flow/sometimes.py", line 22, in augment
    augmented_text = aug.augment(augmented_text)
  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/base_augmenter.py", line 65, in augment
    return self.substitute(data)
  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/augmenter/word/word_embs_aug.py", line 57, in substitute
    candidate_words = self.model.predict(original_word, top_n=self.aug_n)
  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/model/word_embs/word_embeddings.py", line 48, in predict
    source_id = self.word2idx(word)
  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/model/word_embs/word_embeddings.py", line 26, in word2idx
    return self.w2i[word]
KeyError: 'thethe'

The above is acceptable but it should be handled and is fixable. But for below

  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/flow/sometimes.py", line 22, in augment
    augmented_text = aug.augment(augmented_text)
  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/base_augmenter.py", line 65, in augment
    return self.substitute(data)
  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/augmenter/word/word_embs_aug.py", line 57, in substitute
    candidate_words = self.model.predict(original_word, top_n=self.aug_n)
  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/model/word_embs/word_embeddings.py", line 48, in predict
    source_id = self.word2idx(word)
  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/model/word_embs/word_embeddings.py", line 26, in word2idx
    return self.w2i[word]
KeyError: 'How'

This should not happen! as there is definately an embedding for 'How',

naw.RandomWordAug() doesn't have parameter aug_max

Hi~
init function of class naw.RandomWordAug doesn't have parameter aug_max:

class RandomWordAug(WordAugmenter):
    def __init__(self, action=Action.DELETE, name='RandomWord_Aug', aug_min=1, aug_p=0.3, stopwords=None,
                 target_words=None, tokenizer=None, reverse_tokenizer=None, stopwords_regex=None, verbose=0):
        super().__init__(
            action=action, name=name, aug_p=aug_p, aug_min=aug_min, stopwords=stopwords,
            tokenizer=tokenizer, reverse_tokenizer=reverse_tokenizer, device='cpu', verbose=verbose,
            stopwords_regex=stopwords_regex)

        self.target_words = ['_'] if target_words is None else target_words

I don't know if it was intentional or missed?

Naming inconsistency of inputs to RandomCharAug and KeyboardAug

KeyboardAug has upper_case and RandomCharAug has include_upper_case and similar other stuff. Would be nice to standardize.

Bug in nlpaug.augmenter.char.RandomCharAug(action='swap')

The swap augmenter will often replace characters with other characters in string.

For example, running:

import nlpaug
import nlpaug.augmenter.char as nac
from collections import Counter

def char_count(word):
    return Counter(list(word))

swapper = nac.RandomCharAug(action='swap', swap_mode='random')

word = 'testing'
num_t = char_count('testing')['t']

iters=0

while num_t == char_count(word)['t'] and iters < 10000:
    word = swapper.augment(word)
    iters+=1
print(word, iters)

will take the string testing and often output something like:

ttniitg 5

where the output has added one t and one i while removing an e and an s, which is not the expected behavior of multiple swapping operations.

Proposed fix:

Remove .copy() method from line 140 of nlpaug/augmenter/char/random.py, or
Alternatively, remove the definition of original_chars variable altogether and only augment chars by referencing chars directly.

As it's written on line 151 of random.py, it appears that original_chars is never augmented, so that the characters at the swap indices of chars are being reassigned to those at the corresponding swap indices of original_chars. Since the characters in original_chars are never changed, after the first swap the method is replacing characters at the swap locations in chars with the original characters at those locations, which can end up duplicating some characters from the string, while erasing others.

Reducing the number of replaced characters

Hi Edward, thank you for this great work.

I'm trying to use the QwertyAug augmenter, but I'm having some issues controlling the number of replaced characters.

For instance:

aug = naf.Sequential([
    nac.QwertyAug()
])

aug.augment("qual o motivo do meu cartao ainda estar bloqueado?")

This results in waaay too many characters being replaced:

 wual 9 motibo di m3u carfao aindx esta# bloquezso?

I saw that there are some parameters that affect the amount of characters. For instance, if I pass aug_p = 0.00000000000000001 still way too many characters get replaced. Now I show a test with the sentence "why is my credit card still blocked?":

'wh5 id hy crevit carw stilp blpcked?'

There's barely any meaning in this sentence anymore.

The aug_p seems to be the correct parameter. If I pass 1, then all characters change.

Is there a way to further reduce the chance that a given character gets replaced? Ideally, in a sentence with that size, I'd like to replace like 2 or 3 characters.

NameError while using ContextualWordEmbsAug

I just successfully installed dependencies for ContextualWordEmbsAug and tried one of your examples.

pip install torch>=1.2.0 transformers>=2.0.0

Here is your code that i used:


aug = naw.ContextualWordEmbsAug(
model_path='bert-base-uncased', action="substitute")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

The error:
NameError: name 'BertTokenizer' is not defined
Do you have any idea why it happens?

Unable to load a custom model

Hi, I trained a model and it is unable to load it. I think this might get solved if we can pass AutoModel and AutoTokenizer objects using transformers so that it can handle new models.

aug = naw.ContextualWordEmbsAug(
----> 4 model_path=model_name, action="substitute")

3 frames
/usr/local/lib/python3.6/dist-packages/nlpaug/augmenter/word/context_word_embs.py in init(self, model_path, action, temperature, top_k, top_p, name, aug_min, aug_max, aug_p, stopwords, skip_unknown_word, device, force_reload, optimize, stopwords_regex, verbose)
91 self.model = self.get_model(
92 model_path=model_path, device=device, force_reload=force_reload, temperature=temperature, top_k=top_k,
---> 93 top_p=top_p, optimize=optimize)
94 # Override stopwords
95 if stopwords is not None and self.model_type in ['xlnet', 'roberta']:

/usr/local/lib/python3.6/dist-packages/nlpaug/augmenter/word/context_word_embs.py in get_model(cls, model_path, device, force_reload, temperature, top_k, top_p, optimize)
269 def get_model(cls, model_path, device='cuda', force_reload=False, temperature=1.0, top_k=None, top_p=0.0,
270 optimize=None):
--> 271 return init_context_word_embs_model(model_path, device, force_reload, temperature, top_k, top_p, optimize)

/usr/local/lib/python3.6/dist-packages/nlpaug/augmenter/word/context_word_embs.py in init_context_word_embs_model(model_path, device, force_reload, temperature, top_k, top_p, optimize)
28 model = nml.Roberta(model_path, device=device, temperature=temperature, top_k=top_k, top_p=top_p)
29 elif 'bert' in model_path:
---> 30 model = nml.Bert(model_path, device=device, temperature=temperature, top_k=top_k, top_p=top_p)
31 elif 'xlnet' in model_path:
32 model = nml.XlNet(model_path, device=device, temperature=temperature, top_k=top_k, top_p=top_p, optimize=optimize)

/usr/local/lib/python3.6/dist-packages/nlpaug/model/lang_models/bert.py in init(self, model_path, temperature, top_k, top_p, device)
21 self.model_path = model_path
22
---> 23 self.tokenizer = BertTokenizer.from_pretrained(model_path)
24 self.model = BertForMaskedLM.from_pretrained(model_path)
25

NameError: name 'BertTokenizer' is not defined

performance issue with the sampling method

Hello, I found that the sampling method (line 99) in nlpaug/nlpaug/model/lang_models/language_models.py has a poor performance.

top_n_ids = torch.multinomial(probas, num_samples=n, replacement=False).tolist()

torch.multinomial is rather slow, see pytorch/pytorch#11931

After changing it to numpy, the speed can accelerate a lot.
Speed up: 3.221s/call -> 5e-6s/call

top_n_ids = np.random.choice(probas.size(0), n, False, probas.cpu().numpy()).tolist()

custom word augmenter example not working

when i try your example for custom word augmenter i get error

`The

AttributeError Traceback (most recent call last)
in ()
35 for token in tokens:
36 print(token)
---> 37 print('{} --> {}'.format(token, aug.augment([token])[0]))

1 frames
/usr/local/lib/python3.6/dist-packages/nlpaug/augmenter/word/word_augmenter.py in clean(cls, data)
29 @classmethod
30 def clean(cls, data):
---> 31 return data.strip()
32
33 def skip_aug(self, token_idxes, tokens):

AttributeError: 'list' object has no attribute 'strip'`

i am using the beta version i.e directly clone from your repo

bug of naw.RandomWordAug(action='swap')

If the text has only one word, the function naw.RandomWordAug(action='swap') will change words in position 0 and position 1, then it will throw en error: IndexError: list index out of range. Maybe you can add a determine statement that if text has only one word, then return it directly.

Support for back-translation augmentation

Hi,

Is there any plan for adding back-translation using machine translation (e.g. English->Chinese->English) as an augmentation?

Several research papers utilize this method since 2016:

Any plan to add back-translation and paraphrasing models as augmenters？

It's a very helpful repo, and I think it would be even better if the back-translation and paraphrasing models are added as augmenters.

WordAugmenter._tokenizer cant remove excessive space lead to nltk error

Hi,

When there is excessive space in a sentence for example:
text = 'The quick brown fox jumps over the lazy dog . 1 2'
it would cause index error in nltk because there will be an empty token. The resulting tokens:
['The', '', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.', '1', '', '2']

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/cenozai/mypy/tf_models/nlp/nlpaug/nlpaug/base_augmenter.py", line 61, in augment result = self.substitute(data) File "/home/cenozai/mypy/tf_models/nlp/nlpaug/nlpaug/augmenter/word/synonym.py", line 83, in substitute pos = self.model.pos_tag(tokens) File "/home/cenozai/mypy/tf_models/nlp/nlpaug/nlpaug/model/word_dict/wordnet.py", line 46, in pos_tag return nltk.pos_tag(tokens) File "/home/cenozai/.local/lib/python3.6/site-packages/nltk/tag/__init__.py", line 162, in pos_tag return _pos_tag(tokens, tagset, tagger, lang) File "/home/cenozai/.local/lib/python3.6/site-packages/nltk/tag/__init__.py", line 119, in _pos_tag tagged_tokens = tagger.tag(tokens) File "/home/cenozai/.local/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 175, in tag context = self.START + [self.normalize(w) for w in tokens] + self.END File "/home/cenozai/.local/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 175, in <listcomp> context = self.START + [self.normalize(w) for w in tokens] + self.END File "/home/cenozai/.local/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 261, in normalize elif word[0].isdigit(): IndexError: string index out of range

A quick fix could be as follows:
Original WordAugmenter._tokenizer in word_augmenter.py:
return text.split(' ')
fix:
return [t for t in text.split(' ') if len(t) > 0]

The implementation of nlpaug.augmenter.word.AntonymAug() is inconsistent with the referenced paper

In the referenced paper Adversarial Over-Sensitivity and Over-Stability Strategies for Dialogue Models, the algorithm of Antonym Substitution is (page 6):

For Antonym, we modify the first verb, adjective or adverb that has an antonym.

But in nlpaug.augmenter.word.AntonymAug(), let's say the number of words we want to augment is 3 , the function will randomly sample 3 candidate words in original text first, then search their antonyms. If all these 3 candidate words have no antonyms, the text will not be augmented. I think we should look for candidate words that do have antonyms instead of randomly sampling.

NameError: name 'BertTokenizer' is not defined

I got this error: NameError: name 'BertTokenizer' is not defined
when I am running the following code:
aug = naw.ContextualWordEmbsAug(model_path='bert-base-uncased', action="insert") augmented_text = aug.augment(text) print("Original:") print(text) print("Augmented Text:") print(augmented_text)

Returns:

/usr/local/lib/python3.6/dist-packages/nlpaug/model/lang_models/bert.py in init(self, model_path, top_k, top_p, device)
75 self.model_path = model_path
76
---> 77 self.tokenizer = BertTokenizer.from_pretrained(model_path)
78 self.model = BertForMaskedLM.from_pretrained(model_path)
79

NameError: name 'BertTokenizer' is not defined

Hard dependency on librosa because of package level import

Minor issue, but a hard dependency on librosa still exists but is not mentioned in dependencies

|     import nlpaug.augmenter.char as nac
|   File "/usr/local/lib/python3.6/site-packages/nlpaug/__init__.py", line 2, in <module>
|     from nlpaug.base_augmenter import *
|   File "/usr/local/lib/python3.6/site-packages/nlpaug/base_augmenter.py", line 5, in <module>
|     from nlpaug.util import Action, Method, WarningException, WarningName, WarningCode, WarningMessage
|   File "/usr/local/lib/python3.6/site-packages/nlpaug/util/__init__.py", line 7, in <module>
|     from nlpaug.util.visual import *
|   File "/usr/local/lib/python3.6/site-packages/nlpaug/util/visual/__init__.py", line 1, in <module>
|     from nlpaug.util.visual.spectrogram import *
|   File "/usr/local/lib/python3.6/site-packages/nlpaug/util/visual/spectrogram.py", line 2, in <module>
|     import librosa.display
| ModuleNotFoundError: No module named 'librosa'

Problem when reading the ppdb file

Hi,

I try to use the ppdb data downloded from the site http://paraphrasing.org/~fujita/resources/lexpanded-PPDB.html.

As the data is very large, i have obtained the error

File "........\site-packages\nlpaug-0.0.11-py3.7.egg\nlpaug\model\word_dict\ppdb.py", line 49, in read IndexError: list index out of range.

Any help please

Are there any options for manipulating the ratio of [delete/insert/substitute/swap]?

Hi,
Thank you for releasing your data augmentation code.

I could easily find how to use your code like below.

aug = nac.KeyboardAug()
augmented_text = aug.augment(text)

However, I wonder if there's any options for manpulating how much I want to denoise the original texts.

Best regards,
Hyeseon

BertAug on Insert action issue

Hi Edward,

I was exploring your repository and kudos for all the awesome work you have done! But I faced this issue when I ran naw.BertAug(action=Action.INSERT), in a loop over a large set of sentences,
__

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-0b61f9f70462> in <module>
      1 # df['questions'].apply(lambda x: aug.augment(x) if len(x.split(' ')) > 4 and len(x.split(' ')) < 20 else x)
      2 for i in range(len(df.index)):
----> 3     print(df['questions'][i],len(df['questions'][i].split(' ')),aug.augment(df['questions'][i]) if len(df['questions'][i].split(' ')) > 4 and len(df['questions'][i].split(' ')) < 20 else df['questions'][i])

~/bot/bot-ml/lib/python3.7/site-packages/nlpaug/flow/sometimes.py in augment(self, text)
     20                     continue
     21 
---> 22                 augmented_text = aug.augment(augmented_text)
     23 
     24             results.append(augmented_text)

~/bot/bot-ml/lib/python3.7/site-packages/nlpaug/base_augmenter.py in augment(self, tokens)
     42     def augment(self, tokens):
     43         if self.action == Action.INSERT:
---> 44             return self.insert(tokens)
     45         elif self.action == Action.SUBSTITUTE:
     46             return self.substitute(tokens)

~/bot/bot-ml/lib/python3.7/site-packages/nlpaug/augmenter/word/bert.py in insert(self, text)
     60         for aug_idx in aug_idexes:
     61             results.insert(aug_idx, nml.Bert.MASK)
---> 62             new_word = self.sample(self.model.predict(results, nml.Bert.MASK, self.aug_n), 1)[0]
     63             results[aug_idx] = new_word
     64 

~/bot/bot-ml/lib/python3.7/site-packages/nlpaug/base_augmenter.py in sample(self, x, num)
     67 
     68     def sample(self, x, num):
---> 69         return random.sample(x, num)
     70 
     71     def generate_aug_cnt(self, size):

~/bot/bot-ml/lib/python3.7/random.py in sample(self, population, k)
    319         n = len(population)
    320         if not 0 <= k <= n:
--> 321             raise ValueError("Sample larger than population or is negative")
    322         result = [None] * k
    323         setsize = 21        # size of a small set minus size of an empty list

ValueError: Sample larger than population or is negative

If i understand correctly, sample is getting k = 1, population as the result of self.model.predict(results, nml.Bert.MASK, self.aug_n) and n is length of population , and if the predictions(length of population) are less than 1 then this is failing with above error.

UnpicklingError: invalid load key, '<'.

Hello, when I use sentence augmentation, I got the following problem...

I just want to download gpt2 pre-trained model and try to augment the sentence.

nas.ContextualWordEmbsForSentenceAug(model_path='gpt2')

Stopwords

Hello, it seems like the stopwords aren't being filtered correctly:

The 'quick' word is not being ignored. It would be nice if it would just pass over them

is there a way to adopt this package to french language ?

WordNetAug - index error

Hello, when I put a text that contains one character words like: 'I a' , in the wordnet synom replacement , the following bug occurs:

text= "I work in a middle school"
aug = naw.WordNetAug()
augmented_text = aug.augment(text)

Bert - pick() in language_models.py can return no candidates

For bert, the get_candidiates in pick function

    def pick(self, logits, target_word, n=1):
        candidate_ids, candidate_probas = self.prob_multinomial(logits, n=n*10)
        results = self.get_candidiates(candidate_ids, candidate_probas, target_word, n)

        return results

can return no candidates when all tokens are sub tokens (starts with '##')

Should pick return a non-skip token greedily by looking at the logits when results is empty?

`augment_batch` for BertAug, or a GPU option

Thank you for your development.
I have used BertAug for my preliminary experiments.
Since BERT on CPU is awfully slow, it would be great if BertAug had a method to augment examples in a batch so that GPUs can be used.
Just a suggestion for a sort of production-level use.
Thanks!

error in "spectrogram_augmenter.ipynb"

I found error in spectrogram_augmenter.ipynb

No module named 'nlpaug.util.file.load'

and no load.py in nlpaug/util/file/

ValueError: Sample larger than population or is negative

Hi,

I have a small dataset that I am trying to augment. For some of the questions, I am getting the following error:

ValueError                                Traceback (most recent call last)
<ipython-input-337-336aea02b7a2> in <module>
      2 print(len(text))
      3 aug = naw.BertAug(action="insert")
----> 4 augmented_text = aug.augment(text)
      5 print("Original:")
      6 print(text)

~/anaconda3/lib/python3.7/site-packages/nlpaug/base_augmenter.py in augment(self, data)
     69 
     70         if self.action == Action.INSERT:
---> 71             return self.insert(data)
     72         elif self.action == Action.SUBSTITUTE:
     73             return self.substitute(data)

~/anaconda3/lib/python3.7/site-packages/nlpaug/augmenter/word/bert.py in insert(self, data)
     85         for aug_idx in aug_idxes:
     86             results.insert(aug_idx, nml.Bert.MASK)
---> 87             new_word = self.sample(self.model.predict(results, nml.Bert.MASK, self.aug_n), 1)[0]
     88             results[aug_idx] = new_word
     89 

~/anaconda3/lib/python3.7/site-packages/nlpaug/base_augmenter.py in sample(cls, x, num)
    109     @classmethod
    110     def sample(cls, x, num):
--> 111         return random.sample(x, num)
    112 
    113     def generate_aug_cnt(self, size, aug_p=None):

~/anaconda3/lib/python3.7/random.py in sample(self, population, k)
    319         n = len(population)
    320         if not 0 <= k <= n:
--> 321             raise ValueError("Sample larger than population or is negative")
    322         result = [None] * k
    323         setsize = 21        # size of a small set minus size of an empty list

ValueError: Sample larger than population or is negative

After some research, I came across this https://stackoverflow.com/questions/20861497/sample-larger-than-population-in-random-sample-python
but I am still not sure what exactly the issue is. It works sometimes but other times it returns this error. Is it something to do with my questions? Is there a specific format I need to follow for the questions?

Any help would be much appreciated.

import nlpaug.augmenter.word

pytorch_pretrained_bert has been updated to pytorch_transformers，could you please change the code?

File not found in /model/char/keyboard/en.json

Hello! I'm encountering this error when using KeyboardAug:

Minimum Working Environment

import nlpaug.augmenter.char as nac
aug = nac.KeyboardAug()
aug.augment("hello world")

Exception

FileNotFoundError                         Traceback (most recent call last)
<ipython-input-24-c96cc55b7113> in <module>
----> 1 augment(3, nac.KeyboardAug())

~/anaconda3/lib/python3.6/site-packages/nlpaug/augmenter/char/keyboard.py in __init__(self, name, aug_char_min, aug_char_max, aug_char_p, aug_word_p, aug_word_min, aug_word_max, stopwords, tokenizer, reverse_tokenizer, special_char, numeric, upper_case, lang, verbose, stopwords_regex)
     53         self.upper_case = upper_case
     54         self.lang = lang
---> 55         self.model = self.get_model(special_char, numeric, upper_case, lang)
     56 
     57     def skip_aug(self, token_idxes, tokens):

~/anaconda3/lib/python3.6/site-packages/nlpaug/augmenter/char/keyboard.py in get_model(cls, special_char, numeric, upper_case, lang)
     99     @classmethod
    100     def get_model(cls, special_char=True, numeric=True, upper_case=True, lang="en"):
--> 101         return nmc.Keyboard(special_char=special_char, numeric=numeric, upper_case=upper_case, lang=lang)

~/anaconda3/lib/python3.6/site-packages/nlpaug/model/char/keyboard.py in __init__(self, special_char, numeric, upper_case, cache, lang)
     18         self.lang = lang
     19         self.model = self.get_model(
---> 20             model_dir=self.model_dir, special_char=special_char, numeric=numeric, upper_case=upper_case, lang=lang)
     21 
     22     def predict(self, data):

~/anaconda3/lib/python3.6/site-packages/nlpaug/model/char/keyboard.py in get_model(cls, model_dir, special_char, numeric, upper_case, lang)
     31 
     32         model_path = os.path.join(model_dir, lang+'.json')
---> 33         with open(model_path, encoding="utf8") as f:
     34             mapping = json.load(f)
     35 

FileNotFoundError: [Errno 2] No such file or directory: '/home/ljvm/anaconda3/lib/python3.6/site-packages/nlpaug/model/char/../../../model/char/keyboard/en.json'

Thoughts

I think that the paths here ../../../ becomes different when running nlpaug as a package.

Thank you!

Can we disable the replacement for KeyboardError with non alphabet?

Hi @makcedward,

I have purpose to augment the string to with some keyboard error. However, i don't want the characters replaced by non alphabets (e.g: punctuation). Is it possible to disable it? thanks

How to use stopwords_regex in char noise?

Exception when loading word embedding models with lines containing 2 words

Some model files contain embeddings with multiple words (the NILC embeddings for portuguese) which causes the model loading code to explode. For instance, a line in the model file might contain this:

Hey there 0.001 0.0003 0.86245 ........

The same does not happen in Spacy, for instance.

I fixed it in my local dev environment, might make a pull request later.

Swap augmentation doesn't preserve internal word casing

Code to reproduce:

import nlpaug.augmenter.word as naw

aug = naw.RandomWordAug(action='swap')
aug.augment('aA bB')
>>> 'bb aa'

# real-world use case
aug.augment('I love McDonalds')
>>> 'LOVE i McDonalds'
aug.augment('I love McDonalds')
>>> 'I mcdonalds Love'

P.S. thank you very much for the amazing library

Part of Speach to map synonyms

Hi,

I would like to know if the part of speach is taken into consideration when mapping a word to their synonyms usning word2vec, glovee, and fasttext.

Thanks

BERTAug affects Proper Noun

Hi,

I have been using your BERTAug. for Text Augmentation. It works fine on a lot of tasks but it starts messing up the Proper Nouns.

Is there any fix to this ?

makcedward / nlpaug Goto Github PK

nlpaug's People

Contributors

Stargazers

Watchers

Forkers

nlpaug's Issues

`The

I just want to download gpt2 pre-trained model and try to augment the sentence.

Minimum Working Environment

Exception

Thoughts

Recommend Projects

Recommend Topics

Recommend Org