flairnlp / fabricator Goto Github PK

View Code? Open in Web Editor NEW

99.0 99.0 12.0 469 KB

[EMNLP 2023 Demo] fabricator - annotating and generating datasets with large language models.

License: Apache License 2.0

Python 100.00%

fabricator's People

Contributors

Stargazers

Watchers

Forkers

notspicyzhan eltociear tuhinmallick abhiwins javiervicho techthiyanes deathreaper0965 hallerpatrick nadav-nesher songkq balanaren timschopf

fabricator's Issues

Rename source directory to "src"

https://packaging.python.org/en/latest/discussions/src-layout-vs-flat-layout/

Refactorings for submission

We should clean up and unify all contributions made until now + refactor main concepts such that they fulfill the following desiderata. Criteria is that they it is easy to understand what to do, when I want to perform any of the following:

I want to generate texts about a certain topic
I have unlabeled text and want to classify them into predefined categories such as for text classification
I have unlabeled text (+ optional label) and want to generate related texts such as for NLI, summarization, QA
I have tokens and want to annotate each token with a label such as for named entity recognition

This way I can create any dataset I want (generate texts / tokens from scratch and annotate them).

Following adjustments need to made (or at least checked how they are currently working):

Generation: The minimal input to DatasetGenerator is a prompt template + task description. No fewshot examples or unlabeled data required (e.g. Write me news articles.)
Generation: Include label options to generate texts for certain classes (e.g. Generate me a question type about class x for classes in X). It needs to be clear how to control the generated label distribution. I observed it's better not to let the LLM choose what to generate.
Generation: Add fewshot examples: This needs to be combinable with the above in a sense that our repo automatically iterates over all classes in label options or in the fewshot examples such that the prompt is "generate me a question type about class x. here are y examples of class x."
Provide prompt + unlabeled text. We can think of as plain annotation by instruction-tuned models (e.g. Generate a question to the given context).
Annotation: Prompt + unlabeled text + label options: like the above but classes will automatically included into prompt.
Annotation: Prompt + unlabeled text + fewshot examples: must be combined with the above. One needs to include now that label column from the fewshot dataset.

Guided Generation for Syntax dependent tasks

Tools like Guidance help during text generation by not necessarily improving on the prompt.

A new paper called "Efficient Guided Generation for Large Language Models" does the same but with a cheaper runtime. One can provide a regex with which the model is guided during text generation. This might help with syntax-heavy tasks. E.g NER token lists ([0, 1, 2, 3]), or some inline tagging Alex B-PER is O going O to O Los B-LOC Angeles I-LOC.

The algorithm requires access to the generation task itself. This would therefore only work with self-hosted models.

execution speed degrades with increasing sizes of datasets to be annotated

during the annotation process of a 10k sized dataset that is to be annotated, the speed degrades. im unsure whether this due to openai's api, e.g., rate limits, or caused by our code, e.g., non-linear complexity, e.g., in the _inner_loop function (?)

cost reduction: generate and annotate within a single prompt (instead of first generating and then annotating)

improve logo + fix width in readme

github markdown can't take fixed width, if using regular HTML tags we need some logic to switch between light and dark mode.
improve logo at some point, I can ask a friend to design a proper one for the future.

Idea on how to structure generation / annotation

Instead of having a unified generation function as we have now, we might want to adjust our repo in the future in a direction such that users can pick different approaches like:

For Generation:
ZEROGEN: efficient zero-shot learning via dataset generation (paper)
PROGEN: progressive dataset generation via in-context feedback (paper)

For Annotation:
CALIBRATION: prompt-based zero-shot learning with calibration (paper)
...

At last, we should keep the possibility to generate datasets on their own, defining their own sampling strategy, sample information criterion, etc.

Improve fewshot sampling naming convention

Currently we support:
sampling strategy = uniform, stratified and None
fewshot examples per class = int
fewshot sampling column = None (needs to be set when I want to generate data for a column different than the column to sample fewshot examples for, i.e. generate new movie reviews but sample one positive and negative example.)

I will complete this issue soon with all uses cases we have so that we can improve naming.

Rename to Fabricator

Renaming repo / imports / etc.
Create logo

Cannot import DatasetGenerator

In a tutorial, we see the line:

from fabricator import DatasetGenerator

I can import fabricator after installing fabricator. However, no amount of path manipulation can get DatasetGenerator to be recognized for me.

Is 3.10 python a firm requirement? It looks like I tried 3.9.

severe: class_encode_column does not convert to same original int labels

for example: 0 (pos), 1 (neg) might become to 3 (pos), 0 (neg)

Supporting other open-source LLMs

Hi team,
Thanks for such great library. It looks very promising.
I am wondering if it's possible to use an open-source model, instead of gpt-3.5-turbo. Is this possible?
Thanks

Remove spacy dependency

Currently we use spacy for convert token classification datasets (more precisely NER datasets) to convert a sequence of BIO tags into spans in order to prompt the LLM in natural language.

Goal: Write own conversion function for this from BIO tags -> spans and spans -> BIO tags by search substrings in the text. It is important to keep the tokenization of the original dataset which is currently an issue. Additionally, we remove the entire spacy dependency.

Custom Prefixes for data column and few shot column for prompt

Currently generate_data_for_columnand fewshot_example_columns are used as the prefixs for the prompt.

E.g.:

from datasets import Dataset
from ai_dataset_generator.prompts import BasePrompt

fewshot_examples = Dataset.from_dict({
    "text": ["This movie is great!", "This movie is bad!"],
    "label": ["positive", "negative"]
})

prompt_template = BasePrompt(
    task_description="Annotate movie reviews as either: {label_options}",
    label_options=["positive", "negative"],
    generate_data_for_column="label",
    fewshot_example_columns="text",
)

Has Output:

Annotate movie reviews as either: positive, negative

text: This movie is great!
label: positive

text: This movie is bad!
label: negative

text: {text}
label:

With text: and label: as the prefixes.

Proposal/Motivation

What if I use a custom fine-tuned model, that does not work well with textand label as prefixes in the prompt, but was trained with sentence and prediction.

For more flexibility, those prefixes should be optionally configurable. For example:

from datasets import Dataset
from ai_dataset_generator.prompts import BasePrompt

fewshot_examples = Dataset.from_dict({
    "text": ["This movie is great!", "This movie is bad!"],
    "label": ["positive", "negative"]
})

prompt_template = BasePrompt(
    task_description="Annotate movie reviews as either: {label_options}",
    label_options=["positive", "negative"],
    generate_data_for_column=("label", "sentence"),  # Second tuple item contains the new prefix string
    fewshot_example_columns=("text", "prediction"),  # Second tuple item contains the new prefix string
)

Has Output:

Annotate movie reviews as either: positive, negative

sentence: This movie is great!
prediction: positive

sentence: This movie is bad!
prediction: negative

sentence: {text}
prediction:

The default behaviour could stay the same and the column name is used as the prefix. If it is a tuple (or other structure) then the second item is used.

Add predicted labels to CoNLL-03 dataset

This issue tracks the progress of adding labels predicted by gpt-3.5 to a subset of the English CoNLL-03 NER data as part of our label noise benchmark.

Split haystack dependencies

Instead of installing haystack and haystack[inference], we should install the ladder one if required. This is the case if anyone wants to use their own resources for model hosting instead of APIs such as the ones from OpenAI. To do so, we need to adapt the tests and probably implement again the dry_run logic from @HallerPatrick.

Reduce dependencies

Langchain can be replaced at some point and write an own string prompt template.
torch currently needs to be included to that tests with local models pass.
spacy provides a convenient solution to convert spans to bio tags and back, we might want to copy and adapt our own solution.

Naming of sampling strategies

Uniform always samples 1 class out of all classes whereas stratified 1 example per class. Might be confusing, to be improved.

More log infos

With what prompt to generate (based on first example or something)
Some generated output (maybe not every sample, but every 10th or something)

Make it possible to use all prompts with text inputs rather than label IDs

We can annotate single and multi-label by passing an id2label mapping or a list of labels. However, we should use labels in natural language rather than label IDs. This involves transforming all label IDs to their corresponding natural language form. Once flattened, a function doing this for you would be excellent. This also involves a reverse function to convert these labels back to their IDs. Further, we can extend the relatively short abbreviations to more expressive descriptions.

Function to convert dataset with label IDs to their natural language form.
Function to postprocess generated dataset with natural language labels back to its IDs.
Function to convert flattened NER datasets which are split into tokens and tags into strings and spans.
Function to convert strings and spans back into tokens and tags.
Function to calculate offsets in strings for postprocessing, also relevant for QA.

Ensure robust and fault tolerant generation

Due to the costs involved in using OpenAI, ensure that during generation the program does not crash, based on the library code. Generated examples should be saved immediately. Give user feedback if something fails (api calls, sampling, etc)

Sampling method that covers all labels

Currently examples are randomly sampled. We do not guarantee, that all labels are covered. For text classification this is easy, but for Sequence Labeling tasks, aka multi-label multi-class, we have to apply some heuristics

enable user to provide 1-sentence description for each label, which will be shown at the beginning of the prompt to the LLM

similar to what a human coder would see in the labeling instructions, i.e., they not only see the labels but also a brief description what each label is.

this would especially help if we have many classes and cant provide support examples for each class

Naming convention for DataPoints / Prompts

We should somehow unify the naming of data points and prompts.
Initial idea is to remove any task related information from data points such that ExtractiveQADataPoint becomes something like MultiLabelDataPoint. The Prompts are task-specific, i.e. QAPrompt, NERPrompt, etc. and we might omit the entire idea of Annotation / Generation classes below and refactor Templates such that they either annotate a target variable or create unlabeled data in the style of target variable.

Currently, we have for DataPoints:

BaseDataPoint
SingleLabelDataPoint
SingleLabelClassificationDataPoint
ExtractiveQADataPoint
TextDataPoint

For PromptTemplates we have:

BasePrompt
AnnotationPrompt
GenerationPrompt
TextGenerationPrompt
QuestionAnnotationPrompt
AnswerAnnotationPrompt
ContextAnnotationPrompt
NamedEntityAnnotationPrompt

AnnotationPrompt Setting: Ich habe unlabeled data points und es soll ein Teil von diesem data point annotiert werden - in dem Fall habe ich eine frage und eine antwort zu der ich keinen context habe, es wird also die fehlende variable im data point annotiert - der Name macht mehr Sinn wenn du dir TextClassification dazu überlegst - Ich gebe einen unlabeled data point rein (z.b. nur text) und möchte die fehlende variable (sentiment des textes) annotieren.
Im gegensatz dazu ist die GenerationPrompt rein zum erstellen von unlabeled data - der Input ist dann sowas wie "Hier sind 2 Text Beispiele: [...]. Generiere mir weitere Texte in diesem Style." - Ich generiere mir also unlabeled data points und annotiere keine fehlende Variable von einem unlabeled data point.

"es wird also die fehlende variable im data point annotiert" -> die fehlende Variable wird also generiert!

Failing CI pytest

pytest tests fail due to datasets data loading.

How to properly differentiate between generate unlabeled and annotate unlabeled data?

Current the valid options in our repo are: (1) Set an input_variable but no target variable and output_format == "text" and pass no unlabeled_data into the generate function. (2) Set input_variable + target_variable, output_format any of your choice and pass it with unlabeled_data to generate function.

Regarding (2), that looks intuitive to me. Fill everything, get your unlabeled data annotated. But (1) requires type check at various points of code. One idea might be to split the tasks to make this easier to understand.

release 0.1 including the eval scripts
working version without these scripts that the repo keeps clean