Coder Social home page Coder Social logo

argilla-io / distilabel Goto Github PK

View Code? Open in Web Editor NEW
894.0 13.0 57.0 58.78 MB

⚗️ distilabel is a framework for synthetic data and AI feedback for AI engineers that require high-quality outputs, full data ownership, and overall efficiency.

Home Page: https://distilabel.argilla.io

License: Apache License 2.0

Python 98.29% Jinja 1.67% Makefile 0.04%
ai huggingface llms openai python rlaif rlhf synthetic-data synthetic-dataset-generation

distilabel's Introduction

Distilabel Logo

Synthesize data for AI and add feedback on the fly!

CI CI

Distilabel is the framework for synthetic data and AI feedback for AI engineers that require high-quality outputs, full data ownership, and overall efficiency.

If you just want to get started, we recommend you check the documentation. Curious, and want to know more? Keep reading!

Why use Distilabel?

Whether you are working on a predictive model that computes semantic similarity or the next generative model that is going to beat the LLM benchmarks. Our framework ensures that the hard data work pays off. Distilabel is the missing piece that helps you synthesize data and provide AI feedback.

Improve your AI output quality through data quality

Compute is expensive and output quality is important. We help you focus on data quality, which tackles the root cause of both of these problems at once. Distilabel helps you to synthesize and judge data to let you spend your valuable time on achieveing and keeping high-quality standards for your data.

Take control of your data and models

Ownership of data for fine-tuning your own LLMs is not easy but Distilabel can help you to get started. We integrate AI feedback from any LLM provider out there using one unified API.

Improve efficiency by quickly iterating on the right research and LLMs

Synthesize and judge data with latest research papers while ensuring flexibility, scalability and fault tolerance. So you can focus on improving your data and training your models.

🏘️ Community

We are an open-source community-driven project and we love to hear from you. Here are some ways to get involved:

  • Community Meetup: listen in or present during one of our bi-weekly events.

  • Slack: get direct support from the community.

  • Roadmap: plans change but we love to discuss those with our community so feel encouraged to participate.

What do people build with Distilabel?

Distilabel is a tool that can be used to synthesize data and provide AI feedback. Our community uses Distilabel to create amazing datasets and models, and we love contributions to open-source ourselves too.

  • The 1M OpenHermesPreference is a dataset of ~1 million AI preferences derived from teknium/OpenHermes-2.5. It shows how we can use Distilabel to synthesize data on an immense scale.
  • Our distilabeled Intel Orca DPO dataset and the improved OpenHermes model,, show how we improve model performance by filtering out 50% of the original dataset through AI feedback.
  • The haiku DPO data outlines how anyone can create a dataset for a specific task and the latest research papers to improve the quality of the dataset.

👨🏽‍💻 Installation

pip install distilabel --upgrade

Requires Python 3.8+

In addition, the following extras are available:

  • anthropic: for using models available in Anthropic API via the AnthropicLLM integration.
  • cohere: for using models available in Cohere via the CohereLLM integration.
  • argilla: for exporting the generated datasets to Argilla.
  • hf-inference-endpoints: for using the Hugging Face Inference Endpoints via the InferenceEndpointsLLM integration.
  • hf-transformers: for using models available in transformers package via the TransformersLLM integration.
  • litellm: for using LiteLLM to call any LLM using OpenAI format via the LiteLLM integration.
  • llama-cpp: for using llama-cpp-python Python bindings for llama.cpp via the LlamaCppLLM integration.
  • mistralai: for using models available in Mistral AI API via the MistralAILLM integration.
  • ollama: for using Ollama and their available models via OllamaLLM integration.
  • openai: for using OpenAI API models via the OpenAILLM integration, or the rest of the integrations based on OpenAI and relying on its client as AnyscaleLLM, AzureOpenAILLM, and TogetherLLM.
  • vertexai: for using Google Vertex AI proprietary models via the VertexAILLM integration.
  • vllm: for using vllm serving engine via the vLLM integration.

Example

To run the following example you must install distilabel with both openai extra:

pip install "distilabel[openai]" --upgrade

Then run:

from distilabel.llms import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadHubDataset
from distilabel.steps.tasks import TextGeneration

with Pipeline(
    name="simple-text-generation-pipeline",
    description="A simple text generation pipeline",
) as pipeline:
    load_dataset = LoadHubDataset(
        name="load_dataset",
        output_mappings={"prompt": "instruction"},
    )

    generate_with_openai = TextGeneration(
        name="generate_with_gpt35", llm=OpenAILLM(model="gpt-3.5-turbo")
    )

    load_dataset.connect(generate_with_openai)

if __name__ == "__main__":
    distiset = pipeline.run(
        parameters={
            "load_dataset": {
                "repo_id": "distilabel-internal-testing/instruction-dataset-mini",
                "split": "test",
            },
            "generate_with_gpt35": {
                "llm": {
                    "generation_kwargs": {
                        "temperature": 0.7,
                        "max_new_tokens": 512,
                    }
                }
            },
        },
    )

Badges

If you build something cool with distilabel consider adding one of these badges to your dataset or model card.

[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

Built with Distilabel

[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

Built with Distilabel

Contribute

To directly contribute with distilabel, check our good first issues or open a new one.

distilabel's People

Contributors

alvarobartt avatar bjoernpl avatar bramvanroy avatar burtenshaw avatar davanstrien avatar davidberenstein1957 avatar dependabot[bot] avatar dvsrepo avatar edbeeching avatar gabrielmbmb avatar ignacioct avatar jphme avatar philschmid avatar plaguss avatar rasdani avatar sdiazlor avatar strickvl avatar wauplin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

distilabel's Issues

Refactor `LLM` class and subclasses into `Engine`

As internally discussed with both @dvsrepo and @gabrielmbmb, it would be nice and make more sense to refactor and rename the LLM class and subclasses to Engine, as those are just engines to interact with the LLMs, while the logic of the forward pass / API call, resides on the LLM-side, not handled by this package.

[BUG] AttributeError when trying to inspect an LLM or Pipeline object

Describe the bug
Heya! While deep diving into the package (amazing work guys), I was instantiating the basic objects of the module as you guys have in the docs, but if you try to print the generator or pipeline object, you obtain an AttributeError.

To Reproduce

dataset = (
    load_dataset("HuggingFaceH4/instruction-dataset", split="test[:10]")
    .remove_columns(["completion", "meta"])
    .rename_column("prompt", "input")
)

task = TextGenerationTask()  # 

generator = OpenAILLM(task=task, max_new_tokens=512)  # 

generator

or

dataset = (
    load_dataset("HuggingFaceH4/instruction-dataset", split="test[:10]")
    .remove_columns(["completion", "meta"])
    .rename_column("prompt", "input")
)

task = TextGenerationTask()  # 

generator = OpenAILLM(task=task, max_new_tokens=512)  # 

pipeline = pipeline("preference", "instruction-following", generator=generator)  # 

pipeline

I get the same error.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/IPython/core/formatters.py](https://yj85fmex73n-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20231122-060141_RC00_584575710#) in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

3 frames
[/usr/local/lib/python3.10/dist-packages/distilabel/llm/base.py](https://yj85fmex73n-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20231122-060141_RC00_584575710#) in __repr__(self)
     73 
     74     def __repr__(self) -> str:
---> 75         return f"{self.__class__.__name__}(task={self.task.__class__.__name__}, num_threads={self.thread_pool_executor._max_workers}, promp_format='{self.prompt_format}', model='{self.model_name}')"
     76 
     77     def __rich_repr__(self) -> Generator[Any, None, None]:

AttributeError: 'NoneType' object has no attribute '_max_workers'

Expected behaviour
It should print information about the object.

Desktop (please complete the following information):

  • Package version: 0.1.0rc2
  • Python version: 3.10.12

[FEATURE] Add a simple strategy to save or return partial results when Pipeline fails

Is your feature request related to a problem? Please describe.
When running a pipeline over a dataset, if the pipeline fails (e.g., http connection aborted when using APIs) there's no way to recover the dataset that has been collected so far.

Describe the solution you'd like
I'd like the pipeline to return the partial dataset or store it somewhere so I can reuse what had been generated and/or labelled so far.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[BUG] Potential errors with module llm __init.py__ imports?

When trying this code without installing hf-inference-endpoints (which I don't need):

from distilabel.tasks import SelfInstructTask
from distilabel.llm import OpenAILLM

I get:

Users/danielvilasuero/argilla/distilabel/src/distilabel/llm/__init__.py:15 in <module>          │
│                                                                                                  │
│   12 # See the License for the specific language governing permissions and                       │13 # limitations under the License.                                                            │14                                                                                             │
│ ❱ 15 from distilabel.llm.huggingface.inference_endpoints import InferenceEndpointsLLM            │
│   16 from distilabel.llm.huggingface.transformers import TransformersLLM                         │
│   17 from distilabel.llm.llama_cpp import LlamaCppLLM                                            │
│   18 from distilabel.llm.openai import OpenAILLM                                                 │
│                                                                                                  │
│ /Users/danielvilasuero/argilla/distilabel/src/distilabel/llm/huggingface/inference_endpoints.py: │
│ 33 in <module>                                                                                   │
│                                                                                                  │
│    30 from distilabel.utils.imports import _HUGGINGFACE_HUB_AVAILABLE                            │
│    31                                                                                            │
│    32 if _HUGGINGFACE_HUB_AVAILABLE:                                                             │
│ ❱  33from huggingface_hub import InferenceTimeoutError, get_inference_endpoint              │
│    34from huggingface_hub.inference._text_generation import TextGenerationError             │
│    35 │                                                                                          │
│    36_INFERENCE_ENDPOINTS_API_RETRY_ON_EXCEPTIONS = (                                       │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ImportError: cannot import name 'get_inference_endpoint' from 'huggingface_hub' 
(/opt/homebrew/lib/python3.11/site-packages/huggingface_hub/__init__.py)

Add `generator_llm` and `labeller_llm` columns within resulting `datasets.Dataset`

Description

This issue refers to the addition of two new columns within the returned datasets.Dataset in Pipeline.generate to also include the generator LLM and labelling LLM, if any, columns to it, to be able to easily track which models were used.

This will be not that useful as of now, since we are running the whole thing with one generator and one labeller LLM only, but will be really useful whenever we implement the pooling mechanism. Anyway, it's a simple thing to implement and could be still useful.

[FEATURE] Potential improvement parsing JudgeLM outputs

Is your feature request related to a problem? Please describe.
Running some end to end tests sometimes I get these unparsed results:

/opt/homebrew/Cellar/python@3.11/3.11.6/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures
/thread.py:58: UserWarning: Error parsing OpenAI response: invalid literal for int() with base 10: '7.5'

Is this something we could make more robust? Or it will become spaguetti?

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Finalize pipeline API and implementation

Here's what I think we should support:

Preliminaries

import os
from datasets import load_dataset
from ultralabel.llm.huggingface import InferenceEndpointsLLM
from ultralabel.prompts.llama import Llama2GenerationPromptTemplate

# TODO: we need to try to read this token when instantiating the IEEngine
# is this the standard name HF_TOKEN?
os.environ["HF_TOKEN"] = "" ##
os.environ["OPENAI_API_KEY"] = "" ###

# Load the dataset
dataset = (
    load_dataset("ProlificAI/social-reasoning-rlhf", split="train")
    .remove_columns(["chosen", "rejected"])
    .rename_column("question", "instruction")
)

# TODO: I expect the InferenceEndpointsEngine to have a default task
llm = InferenceEndpointsLLM(
    endpoint_url="http://..."
    task=Llama2GenerationPromptTask(),
    temperature=1.0,
    max_new_tokens=512
)
  1. Full defaults for each task

preference

from ultralabel.pipeline import pipeline

pipe = pipeline(
    task="preference",
    generator=llm, # I think I prefer generator than llm as param name
)
# under the hood this will use our default PreferenceTask using the OpenAI engine
# will initially choose the fastest

general_preference = pipe.generate(
    dataset, 
    display_progress_bar=True, 
    num_generations=4,
)

general_preference.to_argilla()

critique

from ultralabel.pipeline import pipeline

pipe = pipeline(
    task="critique",
    generator=llm, # I think I prefer generator than llm as param name
)
# under the hood this will use our default CritiqueTask using the OpenAI engine
# will initially choose the fastest

general_preference = pipe.generate(
    dataset, 
    display_progress_bar=True, 
    num_generations=4,
)
  1. Full configuration using the labeler object, which is an instance of LLMEngine with a PreferenceTask configured. This way we avoid passing args to try to configure two things: the LLMEngine and the PreferenceTask. If a user wants to configure this, they need to instantiate the labeler and pass it:

(please ignore the imports, this is not supposed to run)

from ultralabel.pipeline import pipeline

from ultralabel.engines.openai_ import OpenAIEngine
from ultralabel.tasks.preference.ultrafeedback import UltraFeedbackPreference

labeler = OpenAIEngine(
	task=UltraFeedbackPreference.for_helpfulness(),
	num_threads=4
)
pipe = pipeline(
    task="preference",
    generator=llm,
    labeler=labeler,
)
general_preference = pipe.generate(
    dataset, 
    display_progress_bar=True, 
    num_generations=4,
)

wdyt @alvarobartt ?

Tasks

[BUG] `formatted_labels` local var not defined when labeller is `None`

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Code to reproduce

pipeline = Pipeline(
    generator=InferenceEndpointsLLM(
        endpoint_url=url,
        token=os.environ["HF_TOKEN"],
        task=Llama2GenerationTask(),
        max_new_tokens=128,
        num_threads=8,
        temperature=1.0,
    )
)
dataset = pipeline.generate(
    dataset, num_generations=2, batch_size=32, display_progress_bar=True
)

Expected behaviour
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • Package version:
  • Python version:

Additional context
Add any other context about the problem here.

Add `formatting_fn` arg in `Engine` and all the subclasses

As internally discussed with @dvsrepo and @gabrielmbmb, we should add an arg in the Engine base class (former LLM class) to allow users to provide their own specific formats to send to the model, in order to detach that functionality from the Task (former PromptTemplate), which implies that users can use any available LLM available in Hugging Face even if there's not a default supported format.

[BUG] Pydantic use error when using to_argilla

Describe the bug

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /opt/homebrew/lib/python3.11/site-packages/argilla/utils/utils.py:108 in _get_module │
│ │
│ 105 │ │ │ ) │
│ 106 │ │ │
│ 107 │ │ try: │
│ ❱ 108 │ │ │ return importlib.import_module("." + module_name, self.name) │
│ 109 │ │ except Exception as e: │
│ 110 │ │ │ raise RuntimeError( │
│ 111 │ │ │ │ f"Failed to import {self.name}.{module_name} because of the followin │
│ │
│ /opt/homebrew/Cellar/[email protected]/3.11.6/Frameworks/Python.framework/Versions/3.11/lib/python3.11 │
│ /importlib/init.py:126 in import_module │
│ │
│ 123 │ │ │ if character != '.': │
│ 124 │ │ │ │ break │
│ 125 │ │ │ level += 1 │
│ ❱ 126 │ return _bootstrap._gcd_import(name[level:], package, level) │
│ 127 │
│ 128 │
│ 129 _RELOADING = {} │
│ in _gcd_import:1204 │
│ in _find_and_load:1176 │
│ in _find_and_load_unlocked:1147 │
│ in _load_unlocked:690 │
│ in exec_module:940 │
│ in _call_with_frames_removed:241 │
│ │
│ /opt/homebrew/lib/python3.11/site-packages/argilla/feedback/init.py:16 in │
│ │
│ 13 # limitations under the License. │
│ 14 │
│ 15 # !!! All modules used here must define the all variable properly │
│ ❱ 16 from argilla.client.feedback import * # noqa │
│ 17 from argilla.client.feedback.dataset import FeedbackDataset # noqa │
│ 18 from argilla.client.feedback.schemas import * # noqa │
│ 19 │
│ │
│ /opt/homebrew/lib/python3.11/site-packages/argilla/client/init.py:16 in │
│ │
│ 13 # See the License for the specific language governing permissions and │
│ 14 # limitations under the License. │
│ 15 │
│ ❱ 16 from .api import active_client │
│ 17 │
│ │
│ /opt/homebrew/lib/python3.11/site-packages/argilla/client/api.py:21 in │
│ │
│ 18 from asyncio import Future │
│ 19 from typing import Any, Dict, Iterable, List, Optional, Tuple, Union │
│ 20 │
│ ❱ 21 from argilla.client.client import Argilla │
│ 22 from argilla.client.datasets import Dataset │
│ 23 from argilla.client.models import BulkResponse, Record # TODO Remove TextGenerationReco │
│ 24 from argilla.client.sdk.commons import errors │
│ │
│ /opt/homebrew/lib/python3.11/site-packages/argilla/client/client.py:35 in │
│ │
│ 32 │ ES_INDEX_REGEX_PATTERN, │
│ 33 │ WORKSPACE_HEADER_NAME, │
│ 34 ) │
│ ❱ 35 from argilla.client.apis.datasets import Datasets │
│ 36 from argilla.client.apis.metrics import MetricsAPI │
│ 37 from argilla.client.apis.search import Search, VectorSearch │
│ 38 from argilla.client.apis.status import Status │
│ │
│ /opt/homebrew/lib/python3.11/site-packages/argilla/client/apis/init.py:15 in │
│ │
│ 12 # See the License for the specific language governing permissions and │
│ 13 # limitations under the License. │
│ 14 │
│ ❱ 15 from .base import AbstractApi │
│ 16 from .status import api_compatibility │
│ 17 │
│ │
│ /opt/homebrew/lib/python3.11/site-packages/argilla/client/apis/base.py:16 in │
│ │
│ 13 # limitations under the License. │
│ 14 from typing import Optional │
│ 15 │
│ ❱ 16 from argilla.client.sdk.client import AuthenticatedClient │
│ 17 │
│ 18 │
│ 19 class AbstractApi(object): │
│ │
│ /opt/homebrew/lib/python3.11/site-packages/argilla/client/sdk/client.py:28 in │
│ │
│ 25 import httpx │
│ 26 │
│ 27 from argilla._constants import API_KEY_HEADER_NAME │
│ ❱ 28 from argilla.client.sdk._helpers import build_raw_response │
│ 29 from argilla.client.sdk.commons.errors import BaseClientError │
│ 30 │
│ 31 │
│ │
│ /opt/homebrew/lib/python3.11/site-packages/argilla/client/sdk/_helpers.py:21 in │
│ │
│ 18 │
│ 19 from argilla.client.sdk.commons.errors import WrongResponseError │
│ 20 from argilla.client.sdk.commons.errors_handler import handle_response_error │
│ ❱ 21 from argilla.client.sdk.commons.models import ErrorMessage, HTTPValidationError, Respons │
│ 22 │
│ 23 │
│ 24 def build_raw_response(response: httpx.Response) -> Response[Union[Dict[str, Any], Error │
│ │
│ /opt/homebrew/lib/python3.11/site-packages/argilla/client/sdk/commons/models.py:24 in │
│ │
│ 21 from pydantic import BaseModel, Field, validator │
│ 22 from pydantic.generics import GenericModel │
│ 23 │
│ ❱ 24 from argilla.client.models import Vectors as ClientVectors │
│ 25 │
│ 26 if TYPE_CHECKING: │
│ 27 │ from httpx import Response as HTTPXResponse │
│ │
│ /opt/homebrew/lib/python3.11/site-packages/argilla/client/models.py:94 in │
│ │
│ 91 │ │ return self.value │
│ 92 │
│ 93 │
│ ❱ 94 class _Validators(BaseModel): │
│ 95 │ """Base class for our record models that takes care of general validations""" │
│ 96 │ │
│ 97 │ @validator("metadata", check_fields=False) │
│ │
│ /opt/homebrew/lib/python3.11/site-packages/argilla/client/models.py:182 in _Validators │
│ │
│ 179 │ │ │
│ 180 │ │ return v │
│ 181 │ │
│ ❱ 182 │ @root_validator │
│ 183 │ def _check_and_update_status(cls, values): │
│ 184 │ │ """Updates the status if an annotation is provided and no status is specified."" │
│ 185 │ │ values["status"] = values.get("status") or ("Default" if values.get("annotation" │
│ │
│ /opt/homebrew/lib/python3.11/site-packages/pydantic/deprecated/class_validators.py:222 in │
│ root_validator │
│ │
│ 219 │ │
│ 220 │ if __args: │
│ 221 │ │ # Ensure a nice error is raised if someone attempts to use the bare decorator │
│ ❱ 222 │ │ return root_validator()(*__args) # type: ignore │
│ 223 │ │
│ 224 │ if allow_reuse is True: # pragma: no cover │
│ 225 │ │ warn(_ALLOW_REUSE_WARNING_MESSAGE, DeprecationWarning) │
│ │
│ /opt/homebrew/lib/python3.11/site-packages/pydantic/deprecated/class_validators.py:228 in │
│ root_validator │
│ │
│ 225 │ │ warn(_ALLOW_REUSE_WARNING_MESSAGE, DeprecationWarning) │
│ 226 │ mode: Literal['before', 'after'] = 'before' if pre is True else 'after' │
│ 227 │ if pre is False and skip_on_failure is not True: │
│ ❱ 228 │ │ raise PydanticUserError( │
│ 229 │ │ │ 'If you use @root_validator with pre=False (the default) you MUST specify │
│ 230 │ │ │ ' Note that @root_validator is deprecated and should be replaced with @mo │ │ 231 │ │ │ code='root-validator-pre-skip', │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ PydanticUserError: If you use @root_validatorwith pre=False (the default) you MUST specify skip_on_failure=True. Note that @root_validatoris deprecated and should be replaced with@model_validator`.

For further information visit https://errors.pydantic.dev/2.4/u/root-validator-pre-skip

The above exception was the direct cause of the following exception:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in :4 │
│ │
│ 1 import argilla as rg │
│ 2 │
│ 3 │
│ ❱ 4 rg_dataset = dataset.to_argilla() │
│ 5 │
│ │
│ /Users/danielvilasuero/argilla/distilabel/src/distilabel/dataset.py:46 in to_argilla │
│ │
│ 43 │ │ │ │ "The task is not set. Please set it with dataset.task = <task>." │
│ 44 │ │ │ ) │
│ 45 │ │ │
│ ❱ 46 │ │ rg_dataset = rg.FeedbackDataset( │
│ 47 │ │ │ fields=self.task.to_argilla_fields(dataset_row=self[0]), │
│ 48 │ │ │ questions=self.task.to_argilla_questions(dataset_row=self[0]), │
│ 49 │ │ ) │
│ │
│ /opt/homebrew/lib/python3.11/site-packages/argilla/utils/utils.py:80 in getattr
│ │
│ 77 │ │ if name in self._modules: │
│ 78 │ │ │ value = self._get_module(name) │
│ 79 │ │ elif name in self._class_to_module.keys(): │
│ ❱ 80 │ │ │ module = self._get_module(self._class_to_module[name]) │
│ 81 │ │ │ value = getattr(module, name) │
│ 82 │ │ elif name in self._deprecated_modules: │
│ 83 │ │ │ value = self._get_module(name, deprecated=True) │
│ │
│ /opt/homebrew/lib/python3.11/site-packages/argilla/utils/utils.py:110 in _get_module │
│ │
│ 107 │ │ try: │
│ 108 │ │ │ return importlib.import_module("." + module_name, self.name) │
│ 109 │ │ except Exception as e: │
│ ❱ 110 │ │ │ raise RuntimeError( │
│ 111 │ │ │ │ f"Failed to import {self.name}.{module_name} because of the followin │
│ 112 │ │ │ │ f"(look up to see its traceback):\n{e}" │
│ 113 │ │ │ ) from e │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Failed to import argilla.feedback because of the following error (look up to see its traceback):
If you use @root_validator with pre=False (the default) you MUST specify skip_on_failure=True. Note that
@root_validator is deprecated and should be replaced with @model_validator.

[FEATURE] Add tests for tasks

Is your feature request related to a problem? Please describe.
We need more tests for the tasks.

Describe the solution you'd like
Tests that ensure enough coverage.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Re-introduce a dry-run `generate` method for `Pipeline.generate`

Description

The former rlxf package had a dryrun method that was running the main method once with only one data point to ensure that everything was working fine. Now, since the codebase has been refactored and we have a proper class i.e. Pipeline to help orchestrate everything, maybe a nice addition would be a method to run the Pipeline.generate with only one data-point before actually calling generate, we could even do so on every Pipeline.generate call, as making the user run a separate is not a nice approach IMO.

So ideally, we could pick a random data-point, run the Pipeline.generate once, and if everything appears to be OK, then run the whole pipeline. Something so that the user only needs to do Pipeline.generate and first a one data-point run is triggered end to end, and then one with the full dataset if the first one succeeds.

We could potentially add a skip_dry_run arg in case users want to manually disable it, but to have it enabled by default.

[FEATURE] Return generation statistics

Along with the generated dataset, it would be good to return a data structure containing statistics of the generation such as elapsed time, total tokens generated by the labeller, etc.

Update or pin `openai<1.0.0`

openai==1.0.0 was released 4 days ago and it comes with breaking changes. We should either pin the dependency to openai<1.0.0 or pin the dependency to openai>=1.0.0 and update our code before the release.

[FEATURE] Allow passing generation parameters for each generation

Is your feature request related to a problem? Please describe.
If I generate N texts, I would like to use different parameters (temperature, do_sample, etc) for each generation.

Describe the solution you'd like

pipeline.generate(num_generations=3, ...) # use default params in __init__ method of LLM
pipeline.generate(generations_config=[{"temperature": 0.7, ...}, {"temperature: 0.3, ...}] # 2 generations with provided config

generation_configs maybe is not the best name. In the future, we could also create a class in charge of generating random config for the generations as @dvsrepo suggested.

[FEATURE] Allow using multiple OpenAI API keys

Is your feature request related to a problem? Please describe.
OpenAI API implements rate limits per account and model that can be reached quite easily when generating a new dataset.

Describe the solution you'd like
Allow providing more than one OPENAI_API_KEY (from different OpenAI accounts) to "load balance" the requests to the OpenAI API between more than one account.

Add `__repr__` magic method in `Pipeline` using `rich`

In order to improve the readiness of the Pipeline class and pipeline function, we should include a __repr__ magic method implementation in the Pipeline class to show some relevant information such as the args, both the generation and labelling LLMs, if provided, and any other relevant information.

Ideally we can use rich!

Ensure `raw_response` is not lost when `parse_output` fails

As previously discussed with @gabrielmbmb, we should add not just a fallback mechanism but also a way to ensure that the raw_responses kept and maybe any other potential data i.e. generated_tokens, prompt_tokens, exit_reason, etc. to be able to post-process those on a later stage if something fails.

[FEATURE] Introduce Principle sampling for generators

General idea

The idea is to introduce principle sampling (Ultrafeedback, [IBM's Dromedary])(https://arxiv.org/abs/2305.03047) during the generation step. This way we steer the generator to explore a larger space of possible responses, conditioned by different aspects (e.g, honesty vs helpfulness)

The idea is to use this in the generator in a way that's independent of the labeler approach (currently UF vs JudgeLM) and it's not duplicated across labelers. This can be useful beyond preference tuning, for example for supervised fine tuning, so I'm a bit reluctant of introducing it in the UF labeler.

How it works

  1. During generation, we randomly pick one principle (see ultrafeedback sampling. As mentioned by @alvarobartt we could have a cleaner setup parameter like {"helpful": 0.7, "honest": 0.3}

  2. We add it to the system prompt to induce model behaviors before calling generate (see ultrafeedback system prompt modification)

Principles

We can start with these principles:

principles = {
    "helpfulness": [
        "The assistant should provide users with accurate, relevant, and up-to-date information, ensuring that the content is positive, interesting, engaging, educational, and helpful.",
        "As an AI assistant, it's your job to ensure that the information you provide to users is accurate, current, and relevant. Offer comprehensive details to educate and engage the user.",
        "Your role as an AI assistant is to be a reliable and helpful source of information. Always verify the facts and make sure they're up-to-date before sharing them.",
        'As an AI assistant, inspire users with your knowledgeable insights. Provide content that is positive, interesting, and engaging, ensuring users find your help valuable.',
        'Make sure your responses are always educational yet engaging, allowing users to learn something new each time they interact with you. You are an AI assistant after all!',
        'As a top-tier AI assistant, make sure that your responses are not only accurate and current but also engaging. Be the positive interaction users look forward to.',
        "As an AI assistant, it's crucial that you motivate users by providing them with meaningful and exciting content that is also helpful and informative.",
        'Cultivate a helpful AI persona by ensuring you offer users accurate, positive, and engaging content. Make every interaction informative and delightful.',
        'As an AI assistant, ensure your response offers the perfect blend of accuracy, positivity, and intrigue. Strive to be educational while keeping the user engaged.',
        "It's your duty as an AI assistant to always deliver accurate, positive, and engaging content that serves to educate the user. Make every interaction a learning opportunity.",
        'As an AI assistant, provide not just answers but also spark curiosity. Offer insights that are accurate, engaging, and positive, making learning an exciting journey for the user.',  
    ],
    "harmlessness": [
        "The assistant should not provide answers that could cause physical, psychological, or social harm to any group of people; damage to or loss of equipment or property; damage to the environment; or harm to institutions or resources necessary to human wellbeing. Feel free to refuse to answer questions that could cause the above harm.", 
        'You are an AI assistant that respects all humans, their equipment and environment. Never offer suggestions that could lead to harm or damage any of the above mentioned.',
        'You are an AI assistant that maintains a principle of "First do no harm". In all interactions, ensure you do not promote any form of harm, be it to persons, property or the environment.',
        "You are an AI assistant. Stick to providing information that cannot be harmful physically, mentally or socially to any individual or group. Don't answer any queries that might lead to such consequences.",
        'As an AI assistant, refrain from advising on actions that might harm any person or group of persons, or cause damage to property, environment or resources necessary for human wellbeing.',
        "You're an AI helper wishing to provide safe assistance for all users. Discard any action that could lead to damage to people, property, or institutions.",
        'You are a responsible AI assistant. Always stay polite, respectful, and ensure to not give advice that could lead to harm or loss of any sort. ',
        'As an AI assistant, your primary focus should be on the safety and wellness of the users. Never provide answers that could cause harm or damage to individuals, their property or the environment.',
        'You are an AI assistant that is centred around being harmless. Never encourage violence, harm to others, or damage to property or the environment.',
        'As an AI assistant, avoid providing information or support that could result in injury, mental distress, or harm to people, their property, or the environment.',
        'You are an AI assistant. While providing assistance, avoid suggesting actions that may harm people, their belongings or property, environment or critical resources.',
    ],
    "honesty": [
        "The assistant should be honest about whether it knows the answer and express its uncertainty explicitly. Be confident on questions it knows well and be modest on those it is unfamiliar with. Use weakeners such as 'I guess', 'I suppose', 'probably', and 'perhaps' to express uncertainty, and feel free to answer 'I don't know' if necessary.",
        "As an AI, honesty is crucial to build trust with your user. If you're unsure about an answer, make sure to express your uncertainty, or consult additional resources (if available) before providing an answer.  ",
        "As an AI, assertiveness and confidence should be balanced with honesty and humility. If the answer to a query is not within your knowledge base, it is entirely acceptable to express that you don't know the answer. ",
        "You are an AI assistant that values honesty. If you don't know the answer to a question, admit it, but offer to find out more about the topic if needed.",
        "You're not only an AI assistant, but also a reliable partner. Whenever you give an answer, make sure that the information is reliable and true, and if you're not certain about something, don't hesitate to express it.  ",
        'You are an AI assistant. Your prime objective is to provide useful and accurate information. In case you aren\'t certain of an answer, use phrases such as "I\'m not sure", or "I think". Honesty always comes first.',
        'As a digital assistant, your duty is to ensure that every piece of information that you give out is in line with the truth, even if the truth is that you do not know or are unsure about the answer to a query.',
        "Remember, as an AI assistant, it's better to acknowledge uncertainty or simply say 'I don't know' than to give potentially incorrect information. Honesty leads to trust.",
        "As an AI assistant, honesty is appreciated by your users. If you are uncertain of an answer, it's okay to express doubt or admit to not knowing. Offer to search for more information if needed. ",
        'You are an AI helper that portrays honesty in every response. Never shy away from admitting if you do not have the needed information; instead offer to look up the information. ',
        "Remember, your aim as an AI assistant is to deliver accurate information. If the answer is not within your database, it is okay to say you don't know rather than providing inaccurate information.",
    ],
    "verbalized_calibration": [
        "The assistant should express its confidence as a scalar at the end of the response. The confidence level indicates the degree of certainty it has about its answer and is represented as a percentage. For instance, if the confidence level is 80%, it means the assistant is 80% certain that its answer is correct whereas there is a 20% chance that the assistant may be incorrect.\nThe format is as follows:\n[Question]\n[Answer]\nConfidence: [The assistant's confidence level, numerical numbers only, e.g. 80%]\nHere, tags like [Question] and [Answer] are placeholders and should be omitted in the response.\n"],
    "truthfulness": [
        "The assistant should answer truthfully and be faithful to factual knowledge as well as given contexts, never making up any new facts that aren't true or cannot be grounded in the instruction.",
        "As an AI assistant, it is important that you answer truthfully. Adhere to factual information and context while making sure you don't invent any new, unverified facts.",
        'As an AI assistant, conduct yourself with honesty. Always be diligent in giving accurate responses. Fictional facts or inaccuracies should be avoided in your responses.',
        'As an AI assistant, stick to the truth. Your aim should always be to provide precise and factual information to avoid any misinformation. ',
        'As an AI assistant, factual consistency is key. Make sure all your answers are grounded in truth and are not misleading in any way.',
        "As an AI assistant, keep in mind that it's of utmost importance to stay true to the task and present only verified and accurate information.",
        'As an AI assistant, find and uphold the truth in every answer. Ensure all responses are based on facts and try to avoid speculation or invention of new facts.',
        'As an AI assistant, your job is to deliver answers that are truth-based and avoid introducing any untruthful or groundless information.',
        'As an AI assistant, maintain objectivity while answering questions. Make sure the information given is truthful and never fabricated.',
        'As an AI assistant, maintain the integrity of your responses. Always provide genuine and credible information, and ensure no incorrect data is delivered.',
        'As an AI assistant, operate under the principle of truthfulness. Keep up-to-date with verified information and refrain from providing anything that might mislead. \n',
    ]
}

[FEATURE] Improve error message when dataset contains extra columns

If users pass a dataset without removing columns, they'll get these messages:

ERROR:distilabel:An error ocurred when getting the result from the labeller: JudgeLMTask.generate_prompt() got an unexpected keyword argument 'generation_model'
ERROR:distilabel:An error ocurred when getting the result from the labeller: JudgeLMTask.generate_prompt() got an unexpected keyword argument 'generation_model'
ERROR:distilabel:An error ocurred when getting the result from the labeller: JudgeLMTask.generate_prompt() got an unexpected keyword argument 'generation_model'
ERROR:distilabel:An error ocurred when getting the result from the labeller: JudgeLMTask.generate_prompt() got an unexpected keyword argument 'generation_model'
ERROR:distilabel:An error ocurred when getting the result from the labeller: JudgeLMTask.generate_prompt() got an unexpected keyword argument 'generation_model'
ERROR:distilabel:An error ocurred when getting the result from the labeller: JudgeLMTask.generate_prompt() got an unexpected keyword argument 'generation_model'
ERROR:distilabel:An error ocurred when getting the result from the labeller: JudgeLMTask.generate_prompt() got an unexpected keyword argument 'generation_model'
ERROR:distilabel:An error ocurred when getting the result from the labeller: JudgeLMTask.generate_prompt() got an unexpe
....

We should explicitly say there's an extra column in the dataset and maybe list the columns, wdyt?

Add `generation_prompt` and `labelling_prompt` column to `datasets.Dataset`

This issue was mentioned by @dvsrepo so that we include both the generation_prompt and the labelling_prompt as a column of the generated datasets.Dataset so that we can also keep track of it.

This issue is related to what's been tackled at #33 when including the raw_{generation,labelling}_response column as a fallback mechanism in case something fails.

Improve Inference Endpoints Engine

We need to add do_sample, top_k, etc.

Tasks

Simplify configuration of MultiRating task and ultrafeedback

I'd like to simplify the configuration of the MultiRatings task. In particular, I'd suggest keeping only task_description and ratings, and include rating_description inside the task_description.

before:

{{ task_description }}

**Scoring**: {{ ranks_description }}
{%- for rank in ranks %}
{{ rank.rank }}. {{ rank.description }}
{%- endfor %}

after:

{{ task_description }}
{%- for rank in ranks %}
{{ rank.rank }}. {{ rank.description }}
{%- endfor %}

this means we should include **Scoring**: {{ ranks_description }} inside each of the proposed preference models (helpful, etc.) but I think it will give more flexibility in the long term.

Related: you suggested to use "Score from 1 to {{len(ranks}}" for the ranks_description, I think this is still applicable if doable.

for naming I'd suggest changing from ranks to ratings?

[FEATURE] Benchmark existing preference tasks (UltraFeedback, UltraJudge, JudgeLM)

The idea would be to build and run a benchmark with at least the following datasets: HHH Alignment & MT Bench Human Judgment.

Our current preference task are:

  • UltraFeedback: with different aspects (honest, helpful, etc.) so we need to see how to compute the overall benchmark. We can start with the for_text_quality() aspect because it's a summary of the other aspects.
  • JudgeLM
  • UltraJudge: our own variation of uf and judgelm

The main idea is to compute the chosen and rejected response and compare it with the ones in the benchmark. Based on this we can compute typical classification metrics (accuracy, precision, recall, f1)

This benchmark will be very useful as we can run it when we develop or integrate new techniques.

  • we can start with a sample of each dataset is large

[BUG] Failed to concatenate on axis after finishing the pipeline

Describe the bug

After running the pipeline it seems to have finished but raises the following error:

Texts Generated ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2000/2000
Rows labelled   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1000/1000


/opt/homebrew/Cellar/python@3.11/3.11.6/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures
/thread.py:58: UserWarning: Error parsing OpenAI response: invalid literal for int() with base 10: '7.5'
  result = self.fn(*self.args, **self.kwargs)
/opt/homebrew/Cellar/python@3.11/3.11.6/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures
/thread.py:58: UserWarning: Error parsing OpenAI response: invalid literal for int() with base 10: '8.5'
  result = self.fn(*self.args, **self.kwargs)
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <module>:60                                                                                   │
│                                                                                                  │
│   57 )                                                                                           │
│   58                                                                                             │
│   59 start = time.time()                                                                         │
│ ❱ 60 dataset = pipeline.generate(                                                                │
│   61dataset, num_generations=2, batch_size=16, display_progress_bar=True                    │
│   62 )                                                                                           │
│   63 end = time.time()                                                                           │
│                                                                                                  │
│ /Users/danielvilasuero/argilla/distilabel/src/distilabel/pipeline.py:236 in generate             │
│                                                                                                  │
│   233 │   │   │   │   │   else:                                                                  │
│   234 │   │   │   │   │   │   raise ValueError(f"Unsupported type: {type(parsed_response)}")     │
│   235 │   │                                                                                      │
│ ❱ 236 │   │   dataset = self._add_columns_to_dataset(dataset, generations, formatted_labels)     │
│   237 │   │   dataset = self._remap_dataset(dataset)                                             │
│   238 │   │   # TODO: before releasing check whether we should move the `argilla` export to da   │239 │   │   #   that would imply not passing the `task` but just returning the remapped data   │
│                                                                                                  │
│ /Users/danielvilasuero/argilla/distilabel/src/distilabel/pipeline.py:112 in                      │
│ _add_columns_to_dataset                                                                          │
│                                                                                                  │
│   109 │   │                                                                                      │
│   110 │   │   if self.labeller is not None:                                                      │
│   111 │   │   │   for output_name in self.labeller.task.output_args_names:                       │
│ ❱ 112 │   │   │   │   dataset = dataset.add_column(                                              │
│   113 │   │   │   │   │   output_name, [row.get(output_name, None) for row in labels]            │
│   114 │   │   │   │   )                                                                          │
│   115                                                                                            │
│                                                                                                  │
│ /opt/homebrew/lib/python3.11/site-packages/datasets/arrow_dataset.py:557 in wrapper              │
│                                                                                                  │
│    554 │   │   │   "output_all_columns": self._output_all_columns,                               │
│    555 │   │   }                                                                                 │
│    556 │   │   # apply actual function                                                           │
│ ❱  557 │   │   out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)                │
│    558 │   │   datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [ou  │
│    559 │   │   # re-apply format to the output                                                   │560 │   │   for dataset in datasets:                                                          │
│                                                                                                  │
│ /opt/homebrew/lib/python3.11/site-packages/datasets/fingerprint.py:511 in wrapper                │
│                                                                                                  │
│   508 │   │   │                                                                                  │
│   509 │   │   │   # Call actual function                                                         │510 │   │   │                                                                                  │
│ ❱ 511 │   │   │   out = func(dataset, *args, **kwargs)                                           │
│   512 │   │   │                                                                                  │
│   513 │   │   │   # Update fingerprint of in-place transforms + update in-place history of tra   │514                                                                                            │
│                                                                                                  │
│ /opt/homebrew/lib/python3.11/site-packages/datasets/arrow_dataset.py:5621 in add_column          │
│                                                                                                  │
│   5618 │   │   _check_column_names(self._data.column_names + column_table.column_names)          │
│   5619 │   │   dataset = self.flatten_indices() if self._indices is not None else self           │
│   5620 │   │   # Concatenate tables horizontally                                                 │
│ ❱ 5621 │   │   table = concat_tables([dataset._data, column_table], axis=1)                      │
│   5622 │   │   # Update features                                                                 │5623 │   │   info = dataset.info.copy()                                                        │
│   5624 │   │   info.features.update(Features.from_arrow_schema(column_table.schema))             │
│                                                                                                  │
│ /opt/homebrew/lib/python3.11/site-packages/datasets/table.py:1802 in concat_tables               │
│                                                                                                  │
│   1799tables = list(tables)                                                                 │
│   1800if len(tables) == 1:                                                                  │
│   1801 │   │   return tables[0]                                                                  │
│ ❱ 1802return ConcatenationTable.from_tables(tables, axis=axis)                              │
│   1803                                                                                           │
│   1804                                                                                           │
│   1805 def list_table_cache_files(table: Table) -> List[str]:                                    │
│                                                                                                  │
│ /opt/homebrew/lib/python3.11/site-packages/datasets/table.py:1507 in from_tables                 │
│                                                                                                  │
│   1504 │   │   blocks = to_blocks(tables[0])                                                     │
│   1505 │   │   for table in tables[1:]:                                                          │
│   1506 │   │   │   table_blocks = to_blocks(table)                                               │
│ ❱ 1507 │   │   │   blocks = _extend_blocks(blocks, table_blocks, axis=axis)                      │
│   1508 │   │   return cls.from_blocks(blocks)                                                    │
│   1509 │                                                                                         │
│   1510 │   @property                                                                             │
│                                                                                                  │
│ /opt/homebrew/lib/python3.11/site-packages/datasets/table.py:1499 in _extend_blocks              │
│                                                                                                  │
│   1496 │   │   │   │   result.extend(blocks)                                                     │
│   1497 │   │   │   elif axis == 1:                                                               │
│   1498 │   │   │   │   # We make sure each row_block have the same num_rows                      │
│ ❱ 1499 │   │   │   │   result, blocks = _split_both_like(result, blocks)                         │
│   1500 │   │   │   │   for i, row_block in enumerate(blocks):                                    │
│   1501 │   │   │   │   │   result[i].extend(row_block)                                           │
│   1502 │   │   │   return result                                                                 │
│                                                                                                  │
│ /opt/homebrew/lib/python3.11/site-packages/datasets/table.py:1489 in _split_both_like            │
│                                                                                                  │
│   1486 │   │   │   │   │   new_result.append(result.pop(0))                                      │
│   1487 │   │   │   │   │   new_blocks.append(blocks.pop(0))                                      │
│   1488 │   │   │   if result or blocks:                                                          │
│ ❱ 1489 │   │   │   │   raise ValueError("Failed to concatenate on axis=1 because tables don't h  │
│   1490 │   │   │   return new_result, new_blocks                                                 │
│   1491 │   │                                                                                     │
│   1492 │   │   def _extend_blocks(                                                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Failed to concatenate on axis=1 because tables don't have the same number of rows

To Reproduce
Code to reproduce

import time
import os

import argilla as rg
from datasets import load_dataset
from distilabel.llm.huggingface import InferenceEndpointsLLM
from distilabel.llm.openai_ import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.tasks.preference.judgelm import JudgeLMTask
from distilabel.tasks.text_generation.llama import Llama2GenerationTask


from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")

os.environ["HF_TOKEN"] = "...."
os.environ["OPENAI_API_KEY"] = "..."
url = "..."

dataset = load_dataset("openbmb/UltraFeedback", split="train")
# Define a function to count tokens
def count_tokens(example):
    # Tokenize the input text and count the number of tokens
    tokens = tokenizer.tokenize(example["instruction"])
    return {"num_tokens": len(tokens)}

# Apply the function to the dataset
dataset = dataset.map(count_tokens)

dataset = dataset.filter(lambda example: example['num_tokens'] < 1384)

dataset = (
    dataset
    .select(range(1000))
    .remove_columns(["source", "models", "completions", "correct_answers", "incorrect_answers", "num_tokens"])
    .rename_column("instruction", "input")
)

pipeline = Pipeline(
    generator=InferenceEndpointsLLM(
        endpoint_url=url,
        token=os.environ["HF_TOKEN"],
        task=Llama2GenerationTask(),
        max_new_tokens=128,
        num_threads=8,
        temperature=1.0,
    ),
    labeller=OpenAILLM(
        model="gpt-3.5-turbo",
        task=JudgeLMTask(),
        max_new_tokens=512,
        num_threads=8,
        #openai_api_key="<OPENAI_API_KEY>",
        temperature=0.0,
    ),
)

start = time.time()
dataset = pipeline.generate(
    dataset, num_generations=2, batch_size=16, display_progress_bar=True
)
end = time.time()
print("Elapsed", end - start)

Expected behaviour
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • Package version:
  • Python version:

Additional context
Add any other context about the problem here.

[FEATURE] Introduce Critique Task (UF, JudgeLM, Prometheus)

The idea is to introduce a new task to add LLM-As-Judge / Critique models that assess a single response individually instead of a list of generations together.

Ideally, this should work for:

  • len(generations)=1
  • len(generations)>1: needs an architectural change? The idea is to pass several generations and run the critique over each generation individually.

Target models:

  • UF (used by Zephyr train_prefs)
  • JudgeLM
  • Prometheus: high potential as it would mean getting rid of GPT4 as labeller

Rename `PromptTemplate` to `Task` including `ArgillaTemplate` methods

As internally discussed with @dvsrepo and @gabrielmbmb, makes more sense to name the PromptTemplate as Task, since it manages more things than just the prompt itself, so Task is more meaningful. Besides that, we should also move all the stuff under prompts/integrations/argilla.py defined in ArgillaTemplate as methods from Task that raise NotImplementedError in case those are called and not implemented.

Introduce JudgeLM (and brief discussion to finalized the structure)

Hi!

We should include JudgeLM, I've been thinking about how to include it with regards our discussion about the class structure and how to include new approaches to highly similar tasks (e.g., preference).

So this issue is an open discussion with @alvarobartt and @gabrielmbmb to find the right balance (at least for this early release).

Here's the prompt template (untested), config and output:

judgelm.jinja:
As you can see there's no rating list explaining what's a 1 and what's a 10.

[Question]
{{ instruction }}

{% for response in responses %}
[The Start of Assistant {{ loop.index }}'s Answer> 
{{ response }}
[The End of Assistant {{ loop.index }}'s Answer> 
{%- endfor %}

[System]
{{task_description}}

PreferenceTask settings:

task_description = dedent("""
	We would like to request your feedback on the performance of two AI assistants in response to the
	user question displayed above.
	Please rate the helpfulness, relevance, accuracy, level of details of their responses. Each assistant
	receives an overall score on a scale of 1 to 10, where a higher score indicates better overall
	performance.
	Please first output a single line containing only two values indicating the scores for Assistant 1 and
	2, respectively. The two scores are separated by a space. In the subsequent line, please provide a
	comprehensive explanation of your evaluation, avoiding any potential bias and ensuring that the
	order in which the responses were presented does not affect your judgment.
	"""
)
system_prompt = "You are a helpful and precise assistant for checking the quality of the answer."
ratings = None

output
I think they used a much simpler and clever way to generate the responses with much less tokens and faster (the ultrafeedback output is bloated).

2,10
Response 1 is a 2 because blah blah and response 2 is a 10 because it's awesome

Looking at this, we can't make this template work by reusing MultRatingsTask, because we need to rewrite the parse_output function. This means MultRatingsTask is not a good name.

Even if I'm not a big fan of this approach, we might need to name them: UltraFeedbackRating and JudgeMLRating? both implementing PreferenceTask.

What do you think? Are there any other ways, naming, structure? Otherwise is fine to go this way for now.

.

Basic docs

Tasks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.