Coder Social home page Coder Social logo

salesforce / dialogstudio Goto Github PK

View Code? Open in Web Editor NEW
437.0 12.0 29.0 13.33 MB

DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection and Instruction-Aware Models for Conversational AI

License: Apache License 2.0

Python 100.00%
conversational-ai dataset dialog language-model natural-language-understanding open-domain-dialog question-answering natural-language-generation open-source instruction-tuning

dialogstudio's Introduction



Paper, Huggingface, Model, Twitter

DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection and Instruction-Aware Models for Conversational AI

News!

  • 🎉 [AI Agent] March 18, 2024: Update xLAM for AI Agent. Check xLAM for the latest data and models relevant to AI Agent!
  • 🎉 [Dataset Viewer]. March 17 2024: Update for dataset viewer issues on HuggingFace: Please refer to this repo for view of each dataset, where we provide 5 converted examples along with 5 original examples under each data folder. For example, ShareGPT contains two files: converted_examples.json and original_example.json.
  • [Upload models] Aug 18, 2023. We upload version 1.0 models (dialogstudio-t5-base-v1.0, dialogstudio-t5-large-v1.0, dialogstudio-t5-3b-v1.0) trained on a few selected DialogStudio datasets and more than 1000 general tasks.
  • [Version 1.0.1] Aug 1, 2023. We resolved minor issues in a few dialogues, added prompts for selected knowledge-grounded datasets, removed requirements for HuggingFace login, and made updates to SODA and ShareGPT datasets.
  • [Initial Release] July 2023. We're thrilled to the initial release of the largest unified Dialog dataset collection. The full list of all available datasets is here.

Contents

Introduction

DialogStudio is a large collection and unified dialog datasets. The figure below provides a summary of the general statistics associated with DialogStudio. DialogStudio unified each dataset while preserving its original information, and this aids in supporting research on both individual datasets and Large Language Model (LLM) training. The full list of all available datasets is here.

The data are downloadable through Huggingface as introduced in Loading Data. We also provide examples for each dataset in this repo. For more granular and category-specific details, please refer to the individual folders corresponding to each category within the DialogStudio collection, e.g. MULTIWOZ2_2 dataset under the task-oriented-dialogues category.



DialogStudio evaluates dialogue quality based on six critical criteria, namely Understanding, Relevance, Correctness, Coherence, Completeness, and Overall Quality. Each criterion is scored on a scale of 1 to 5, with the highest scores reserved for exceptional dialogues.

Given the vast number of datasets incorporated into DialogStudio, we utilized 'gpt-3.5-turbo' to assess 33 distinct datasets. The corresponding script used for this evaluation can be accessed through the link.

The results of our dialogue quality assessment are presented below. We intend to release evaluation scores for individually selected dialogues in the upcoming period.



Loading Data

You can load any dataset in the DialogStudio from the HuggingFace hub by claiming the {dataset_name}, which is exactly the dataset folder name. All available datasets are described in dataset content.

Below is one example to load the MULTIWOZ2_2 dataset under the task-oriented-dialogues category:

Load the dataset

from datasets import load_dataset

dataset = load_dataset('Salesforce/dialogstudio', 'MULTIWOZ2_2')

Here is the output structure of MultiWOZ 2.2

DatasetDict({
    train: Dataset({
        features: ['original dialog id', 'new dialog id', 'dialog index', 'original dialog info', 'log', 'prompt', 'external knowledge non-flat', 'external knowledge', 'dst knowledge', 'intent knowledge'],
        num_rows: 8437
    })
    validation: Dataset({
        features: ['original dialog id', 'new dialog id', 'dialog index', 'original dialog info', 'log', 'prompt', 'external knowledge non-flat', 'external knowledge', 'dst knowledge', 'intent knowledge'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['original dialog id', 'new dialog id', 'dialog index', 'original dialog info', 'log', 'prompt', 'external knowledge non-flat', 'external knowledge', 'dst knowledge', 'intent knowledge'],
        num_rows: 1000
    })
})

Datasets

The datasets are split into several categories in this GitHub repository and HuggingFace hub. You can check the table of dataset for more information. And you can click into each folder to check a few examples:

Model

We've rolled out version 1.0 of models (dialogstudio-t5-base-v1.0, dialogstudio-t5-large-v1.0, dialogstudio-t5-3b-v1.0) trained on a few selected DialogStudio datasets. Check each Model Card for more details.

Below is one example for running model on CPU:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Salesforce/dialogstudio-t5-base-v1.0")
model = AutoModelForSeq2SeqLM.from_pretrained("Salesforce/dialogstudio-t5-base-v1.0")

input_text = "Answer the following yes/no question by reasoning step-by-step. Can you write 200 words in a single tweet?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

License

Our project follows the following structure with respect to licensing:

  1. For all the modified datasets in DialogStudio:
    • A portion of these datasets is under the Apache License 2.0.
    • Some retain their original licenses even after modification.
    • For a few datasets that lacked a license, we have cited the relevant papers.
  2. Original dataset licenses: For reference, we also put the originally available licenses for each dataset into their respective dataset folders.
  3. Code: Our codebase is under the Apache License 2.0.

For detailed licensing information, please refer to the specific licenses accompanying the original datasets. It is important to familiarize yourself with these terms as we do not assume responsibility for licensing issues.

Acknowledgement

We sincerely thank all dataset authors who have contributed to the Conversational AI field. Despite careful efforts, inaccuracies in our citations or references may occur. If you spot any errors or omissions, please raise an issue or submit a pull request to help us improve. Thank you!

Citation

The data and code in this repository is mostly developed for or derived from the paper below. If you utilize datasets from DialogStudio, we kindly request you cite both the original work and our own work (Accepted by EACL 2024 Findings as a long paper).

@article{zhang2023dialogstudio,
  title={DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI},
  author={Zhang, Jianguo and Qian, Kun and Liu, Zhiwei and Heinecke, Shelby and Meng, Rui and Liu, Ye and Yu, Zhou and Savarese, Silvio and Xiong, Caiming},
  journal={arXiv preprint arXiv:2307.10172},
  year={2023}
}

Contribution

We enthusiastically invite contributions from the community! Join us in our shared mission to propel the field of conversational AI forward!

dialogstudio's People

Contributors

eltociear avatar jianguoz avatar jimsalesforce avatar memray avatar qbetterk avatar skiingpacman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dialogstudio's Issues

questions about the language of the datasets

Hi there,

I wonder if you can provide details about the languages of the datasets so that I can filter out a specific language.

For example, I only want those datasets in Chinese, so how can I do that?

I will appreciate your help!

Does the diversity of model outputs have an impact on the metric score?

Hello, excellent work! I'm curious if the diversity of the model outputs will have an effect on the metric scores. Specifically, the baselines in the paper generate responses based on the context under the settings of zero-shot and few-shot, but the outputs of the models will be different when running multiple times, and whether this will affect the stability of the evaluation results.

多轮会话的外部知识设置

论文中写道,We use the format Instruction \n user utterance system response ... user utterance \n supported knowledge to train the model
这里是每次只预测一轮response吗,这样每轮的EXTERNAL KNOWLEDGE是确定的;
如果同时预测多轮response,不同轮次的EXTERNAL KNOWLEDGE是不同的吧,这种情况该怎么处理呢

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.