Coder Social home page Coder Social logo

hetpandya / textgenie Goto Github PK

View Code? Open in Web Editor NEW
36.0 2.0 6.0 277 KB

A python package to augment text data using NLP.

Home Page: https://towardsdatascience.com/textgenie-augmenting-your-text-dataset-with-just-2-lines-of-code-23ce883a0715

License: Apache License 2.0

Python 100.00%

textgenie's Introduction

License Code style: black Downloads

logo

TextGenie

TextGenie is a text data augmentations library that helps you augment your text dataset and generate similar kind of samples, thus generating a more robust dataset to train better models. It also takes care of labeled datasets while generating similar samples keeping their labels in memory.

It uses various Natural Language Processing methods such as paraphrase generation, BERT mask filling and converting text to active voice if found in passive voices. This library currently supports English Language.

Installation

pip install textgenie

Example

from textgenie import TextGenie

textgenie = TextGenie("hetpandya/t5-small-tapaco", "bert-base-uncased")

# Augment a list of sentences
sentences = [
    "The video was posted on Facebook by Alex.",
    "I plan to run it again this time",
]
textgenie.magic_lamp(
    sentences, "paraphrase: ", n_mask_predictions=5, convert_to_active=True
)

# Augment data in a txt file
textgenie.magic_lamp(
    "sentences.txt", "paraphrase: ", n_mask_predictions=5, convert_to_active=True
)

# Augment data in a csv file with labels
textgenie.magic_lamp(
    "sentences.csv",
    "paraphrase: ",
    n_mask_predictions=5,
    convert_to_active=True,
    label_column="Label",
    data_column="Text",
    column_names=["Text", "Label"],
)

Examples can be found in the examples notebook.

Usage

  • Initializing the augmentor: textgenie = TextGenie(paraphrase_model_name='model_name',mask_model_name='model_name',spacy_model_name="model_name",device="cpu")
    • Parameters:
      • paraphrase_model_name:
        • The name of the T5 paraphrase model.
        • A list of pretrained model for paraphrase generation can be found here
      • mask_model_name:
        • BERT model that will be used to fill masks. This model is disabled by default. But can be enabled by mentioning the name of the BERT model to be used. A list of mask filling models can be found here
      • spacy_model_name:
        • Name of the Spacy model. Available models can be found here. The default value is set to en_core_web_sm.
      • device:
        • The device where the model will be loaded. The default value is set to cpu.
  • Methods:
    • augment_sent_mask_filling():
      • Generate augmented data using BERT mask filling.
      • Parameters:
        • sent:
          • The sentence on which augmentation has to be applied.
        • n_mask_predictions:
          • The number of predictions, the BERT mask filling model should generate. The default value is set to 5.
    • augment_sent_t5():
      • Generate augmented data using T5 paraphrasing model.
      • Parameters:
        • sent:
          • The sentence on which augmentation has to be applied.
        • prefix:
          • The prefix for the T5 model input.
        • n_predictions:
          • The number of number augmentations, the function should return. The default value is set to 5.
        • top_k:
          • The number of predictions, the T5 model should generate. The default value is set to 120.
        • max_length:
          • The max length of the sentence to feed to the model. The default value is set to 256.
    • convert_to_active():
      • Converts a sentence to active voice, if found in passive voice. Otherwise returns the same sentence.
      • Parameters:
        • sent:
          • The sentence that has to be converted.
    • magic_once():
      • This is a wrapper method for augment_sent_mask_filling(), augment_sent_t5() and convert_to_active() methods. Using this, a sentence can be augmented using all the above mentioned techniques.
      • Since this method can operate on individual text data, it can be merged with other packages.
      • Parameters:
        • sent:
          • The sentence that has to be augmented.
        • paraphrase_prefix:
          • The prefix for the T5 model input.
        • n_paraphrase_predictions:
          • The number of number augmentations, the function should return. The default value is set to 5.
        • paraphrase_top_k:
          • The number of predictions, the T5 model should generate. The default value is set to 120.
        • paraphrase_max_length:
          • The max length of the sentence to feed to the model. The default value is set to 256.
        • n_mask_predictions:
          • The number of predictions, the BERT mask filling model should generate. The default value is set to None.
        • convert_to_active:
          • If the sentence should be converted to active voice. The default value is set to True.
    • magic_lamp():
      • This method can be used for augmenting whole dataset. Currently accepted dataset formats are: txt,csv,tsv and list.
      • If the dataset is in list or txt format, a list of augmented sentences will be returned. Also, a txt file with the name sentences_aug.txt is saved containing the output of the augmented data.
      • If a dataset is in csv or tsv format with labels, the dataset will be augmented along with keeping in memory the labels for the new samples and a pandas dataframe of the augmented data will be returned. A tsv file will be generated with the augmented output with name original_file_name_aug.tsv
      • Parameters:
        • sentences:
          • The dataset that has to be augmented. This can be a Python List, a txt, csv or tsv file.
        • paraphrase_prefix:
          • The prefix for the T5 model input.
        • n_paraphrase_predictions:
          • The number of number augmentations, the function should return. The default value is set to 5.
        • paraphrase_top_k:
          • The number of predictions, the T5 model should generate. The default value is set to 120.
        • paraphrase_max_length:
          • The max length of the sentence to feed to the model. The default value is set to 256.
        • n_mask_predictions:
          • The number of predictions, the BERT mask filling model should generate. The default value is set to None.
        • convert_to_active:
          • If the sentence should be converted to active voice. The default value is set to True.
        • label_column:
          • The name of the column that contains labeled data. The default value is set to None. This parameter is not required to be set if the dataset is in a Python List or a txt file.
        • data_column:
          • The name of the column that contains data. The default value is set to None. This parameter too is not required if the dataset is a Python List or a txt file.
        • column_names:
          • If the csv or tsv does not have column names, a Python list has to be passed to give the columns a name. Since this function also accepts Python List and a txt file, the default value is set to None. But, if csv or tsv files are used, this parameter has to be set.

References

Passive To Active licensed under the Apache License 2.0

Links

Please find an in depth explanation about the library on my blog.

License

Please check LICENSE for more details.

textgenie's People

Contributors

creatorrr avatar hetpandya avatar imzachjohnson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

textgenie's Issues

version issues

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
en-core-web-sm 3.3.0 requires spacy<3.4.0,>=3.3.0.dev0, but you have spacy 2.2.4 which is incompatible.

Problem with Spacy matcher

from textgenie import TextGenie

textgenie = TextGenie("hetpandya/t5-small-tapaco", "bert-base-uncased")
sentences = [
    "The video was posted on Facebook by Alex.",
    "I plan to run it again this time",
]
x = textgenie.magic_lamp(
    sentences, "paraphrase: ", n_mask_predictions=5, convert_to_active=True
)

print(x)

There seems to be a problem with active voice conversion. This is the code I'm running and this is the error:

Traceback (most recent call last): File "test2.py", line 11, in <module> sentences, "paraphrase: ", n_mask_predictions=5, convert_to_active=True File "\lib\site-packages\textgenie\textgenie.py", line 240, in magic_lamp convert_to_active, File "\lib\site-packages\textgenie\textgenie.py", line 136, in magic_once active_voice = self.convert_to_active(sent) File "\lib\site-packages\textgenie\textgenie.py", line 108, in convert_to_active if is_passive(sent): File "\lib\site-packages\textgenie\grammar_utils.py", line 280, in is_passive matcher.add("Passive", None, passive_rule) File "spacy\matcher\matcher.pyx", line 76, in spacy.matcher.matcher.Matcher.add TypeError: add() takes exactly 2 positional arguments (3 given)

bug in grammar_utils.py and correction

I found a bug in grammar_utils.py

When i try this program:

1. textgenies = TextGenie("hetpandya/t5-small-tapaco", "bert-base-uncased")
2. sentence=["It is only to bring the case within the scope of Section a that such an allegation is made"]
3. textgenies.magic_lamp(sentence, "paraphrase: ", n_mask_predictions=15, n_paraphrase_predictions=15, paraphrase_top_k=5, convert_to_active=True)

An error occurs => TypeError: 'bool' object is not callable

The problem is line 173 in grammar_utils.py => xcomp = pass2act(xcomp,True).strip(" .")
a parameter is missing

Modified code: xcomp = pass2act(xcomp,nlp,True).strip(" .")

The program now works

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.