Coder Social home page Coder Social logo

fabricator's People

Contributors

deathreaper0965 avatar eltociear avatar fhamborg avatar hallerpatrick avatar julian-risch avatar michelbartels avatar whoisjones avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

fabricator's Issues

Refactorings for submission

We should clean up and unify all contributions made until now + refactor main concepts such that they fulfill the following desiderata. Criteria is that they it is easy to understand what to do, when I want to perform any of the following:

  • I want to generate texts about a certain topic
  • I have unlabeled text and want to classify them into predefined categories such as for text classification
  • I have unlabeled text (+ optional label) and want to generate related texts such as for NLI, summarization, QA
  • I have tokens and want to annotate each token with a label such as for named entity recognition

This way I can create any dataset I want (generate texts / tokens from scratch and annotate them).

Following adjustments need to made (or at least checked how they are currently working):

  • Generation: The minimal input to DatasetGenerator is a prompt template + task description. No fewshot examples or unlabeled data required (e.g. Write me news articles.)
  • Generation: Include label options to generate texts for certain classes (e.g. Generate me a question type about class x for classes in X). It needs to be clear how to control the generated label distribution. I observed it's better not to let the LLM choose what to generate.
  • Generation: Add fewshot examples: This needs to be combinable with the above in a sense that our repo automatically iterates over all classes in label options or in the fewshot examples such that the prompt is "generate me a question type about class x. here are y examples of class x."
  • Provide prompt + unlabeled text. We can think of as plain annotation by instruction-tuned models (e.g. Generate a question to the given context).
  • Annotation: Prompt + unlabeled text + label options: like the above but classes will automatically included into prompt.
  • Annotation: Prompt + unlabeled text + fewshot examples: must be combined with the above. One needs to include now that label column from the fewshot dataset.

Guided Generation for Syntax dependent tasks

Tools like Guidance help during text generation by not necessarily improving on the prompt.

A new paper called "Efficient Guided Generation for Large Language Models" does the same but with a cheaper runtime. One can provide a regex with which the model is guided during text generation. This might help with syntax-heavy tasks. E.g NER token lists ([0, 1, 2, 3]), or some inline tagging Alex B-PER is O going O to O Los B-LOC Angeles I-LOC.

The algorithm requires access to the generation task itself. This would therefore only work with self-hosted models.

improve logo + fix width in readme

github markdown can't take fixed width, if using regular HTML tags we need some logic to switch between light and dark mode.
improve logo at some point, I can ask a friend to design a proper one for the future.

Idea on how to structure generation / annotation

Instead of having a unified generation function as we have now, we might want to adjust our repo in the future in a direction such that users can pick different approaches like:

For Generation:
ZEROGEN: efficient zero-shot learning via dataset generation (paper)
PROGEN: progressive dataset generation via in-context feedback (paper)

For Annotation:
CALIBRATION: prompt-based zero-shot learning with calibration (paper)
...

At last, we should keep the possibility to generate datasets on their own, defining their own sampling strategy, sample information criterion, etc.

Improve fewshot sampling naming convention

Currently we support:
sampling strategy = uniform, stratified and None
fewshot examples per class = int
fewshot sampling column = None (needs to be set when I want to generate data for a column different than the column to sample fewshot examples for, i.e. generate new movie reviews but sample one positive and negative example.)

I will complete this issue soon with all uses cases we have so that we can improve naming.

Cannot import DatasetGenerator

In a tutorial, we see the line:

from fabricator import DatasetGenerator

I can import fabricator after installing fabricator. However, no amount of path manipulation can get DatasetGenerator to be recognized for me.

Is 3.10 python a firm requirement? It looks like I tried 3.9.

Supporting other open-source LLMs

Hi team,
Thanks for such great library. It looks very promising.
I am wondering if it's possible to use an open-source model, instead of gpt-3.5-turbo. Is this possible?
Thanks

Remove spacy dependency

Currently we use spacy for convert token classification datasets (more precisely NER datasets) to convert a sequence of BIO tags into spans in order to prompt the LLM in natural language.

Goal: Write own conversion function for this from BIO tags -> spans and spans -> BIO tags by search substrings in the text. It is important to keep the tokenization of the original dataset which is currently an issue. Additionally, we remove the entire spacy dependency.

Custom Prefixes for data column and few shot column for prompt

Currently generate_data_for_columnand fewshot_example_columns are used as the prefixs for the prompt.

E.g.:

from datasets import Dataset
from ai_dataset_generator.prompts import BasePrompt

fewshot_examples = Dataset.from_dict({
    "text": ["This movie is great!", "This movie is bad!"],
    "label": ["positive", "negative"]
})

prompt_template = BasePrompt(
    task_description="Annotate movie reviews as either: {label_options}",
    label_options=["positive", "negative"],
    generate_data_for_column="label",
    fewshot_example_columns="text",
)

Has Output:

Annotate movie reviews as either: positive, negative

text: This movie is great!
label: positive

text: This movie is bad!
label: negative

text: {text}
label: 

With text: and label: as the prefixes.

Proposal/Motivation

What if I use a custom fine-tuned model, that does not work well with textand label as prefixes in the prompt, but was trained with sentence and prediction.

For more flexibility, those prefixes should be optionally configurable. For example:

from datasets import Dataset
from ai_dataset_generator.prompts import BasePrompt

fewshot_examples = Dataset.from_dict({
    "text": ["This movie is great!", "This movie is bad!"],
    "label": ["positive", "negative"]
})

prompt_template = BasePrompt(
    task_description="Annotate movie reviews as either: {label_options}",
    label_options=["positive", "negative"],
    generate_data_for_column=("label", "sentence"),  # Second tuple item contains the new prefix string
    fewshot_example_columns=("text", "prediction"),  # Second tuple item contains the new prefix string
)

Has Output:

Annotate movie reviews as either: positive, negative

sentence: This movie is great!
prediction: positive

sentence: This movie is bad!
prediction: negative

sentence: {text}
prediction: 

The default behaviour could stay the same and the column name is used as the prefix. If it is a tuple (or other structure) then the second item is used.

Split haystack dependencies

Instead of installing haystack and haystack[inference], we should install the ladder one if required. This is the case if anyone wants to use their own resources for model hosting instead of APIs such as the ones from OpenAI. To do so, we need to adapt the tests and probably implement again the dry_run logic from @HallerPatrick.

Reduce dependencies

  • Langchain can be replaced at some point and write an own string prompt template.
  • torch currently needs to be included to that tests with local models pass.
  • spacy provides a convenient solution to convert spans to bio tags and back, we might want to copy and adapt our own solution.

Naming of sampling strategies

Uniform always samples 1 class out of all classes whereas stratified 1 example per class. Might be confusing, to be improved.

More log infos

  • With what prompt to generate (based on first example or something)
  • Some generated output (maybe not every sample, but every 10th or something)

Make it possible to use all prompts with text inputs rather than label IDs

We can annotate single and multi-label by passing an id2label mapping or a list of labels. However, we should use labels in natural language rather than label IDs. This involves transforming all label IDs to their corresponding natural language form. Once flattened, a function doing this for you would be excellent. This also involves a reverse function to convert these labels back to their IDs. Further, we can extend the relatively short abbreviations to more expressive descriptions.

  • Function to convert dataset with label IDs to their natural language form.
  • Function to postprocess generated dataset with natural language labels back to its IDs.
  • Function to convert flattened NER datasets which are split into tokens and tags into strings and spans.
  • Function to convert strings and spans back into tokens and tags.
  • Function to calculate offsets in strings for postprocessing, also relevant for QA.

Ensure robust and fault tolerant generation

Due to the costs involved in using OpenAI, ensure that during generation the program does not crash, based on the library code. Generated examples should be saved immediately. Give user feedback if something fails (api calls, sampling, etc)

Sampling method that covers all labels

Currently examples are randomly sampled. We do not guarantee, that all labels are covered. For text classification this is easy, but for Sequence Labeling tasks, aka multi-label multi-class, we have to apply some heuristics

Naming convention for DataPoints / Prompts

We should somehow unify the naming of data points and prompts.
Initial idea is to remove any task related information from data points such that ExtractiveQADataPoint becomes something like MultiLabelDataPoint. The Prompts are task-specific, i.e. QAPrompt, NERPrompt, etc. and we might omit the entire idea of Annotation / Generation classes below and refactor Templates such that they either annotate a target variable or create unlabeled data in the style of target variable.

Currently, we have for DataPoints:

  • BaseDataPoint
  • SingleLabelDataPoint
  • SingleLabelClassificationDataPoint
  • ExtractiveQADataPoint
  • TextDataPoint

For PromptTemplates we have:

  • BasePrompt
  • AnnotationPrompt
  • GenerationPrompt
  • TextGenerationPrompt
  • QuestionAnnotationPrompt
  • AnswerAnnotationPrompt
  • ContextAnnotationPrompt
  • NamedEntityAnnotationPrompt

AnnotationPrompt Setting: Ich habe unlabeled data points und es soll ein Teil von diesem data point annotiert werden - in dem Fall habe ich eine frage und eine antwort zu der ich keinen context habe, es wird also die fehlende variable im data point annotiert - der Name macht mehr Sinn wenn du dir TextClassification dazu überlegst - Ich gebe einen unlabeled data point rein (z.b. nur text) und möchte die fehlende variable (sentiment des textes) annotieren.
Im gegensatz dazu ist die GenerationPrompt rein zum erstellen von unlabeled data - der Input ist dann sowas wie "Hier sind 2 Text Beispiele: [...]. Generiere mir weitere Texte in diesem Style." - Ich generiere mir also unlabeled data points und annotiere keine fehlende Variable von einem unlabeled data point.

"es wird also die fehlende variable im data point annotiert" -> die fehlende Variable wird also generiert!

How to properly differentiate between generate unlabeled and annotate unlabeled data?

Current the valid options in our repo are: (1) Set an input_variable but no target variable and output_format == "text" and pass no unlabeled_data into the generate function. (2) Set input_variable + target_variable, output_format any of your choice and pass it with unlabeled_data to generate function.

Regarding (2), that looks intuitive to me. Fill everything, get your unlabeled data annotated. But (1) requires type check at various points of code. One idea might be to split the tasks to make this easier to understand.

Large Git Pack Files

Cloning the repo is about 50M of data, due to some pack files in the .git folder. This Issue has an approach to delete them from the git history

Initial version for pypi package

initial version requires:

  • release 0.1 including the eval scripts
  • working version without these scripts that the repo keeps clean

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.