hkust-nlp / deita Goto Github PK
View Code? Open in Web Editor NEWDeita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]
License: Apache License 2.0
Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]
License: Apache License 2.0
Only prompt? or prompt+response?
like Chinese and more.
Have you conducted ablation experiments with three factors: Complexity, Quality, and Diversity? Which one has the greatest impact on performance improvement?
Hello, thank you for your great work!
Regarding the EVOL COMPLEXITY method in the paper, where ChatGPT ranks and scores the complexity of samples, I have recently observed that many LLMs tend to score samples in a descending order from high to low. For example, a sample sequence like ABCD tends to be scored from high to low, and when the order is adjusted (e.g., CDAB), the scoring trend remains similar.
Have you observed a similar phenomenon, and if so, have you made any corresponding adjustments in your experiments?
First of all, thank you, and huge congrats on the paper release! Really enjoyed reading it.
I wanted to ask if you can share any details on how you trained your scorer. Was it simple next token prediciton on the collected data samples? 2k each?
The id2score function has different values in mistral_scorer.py and llama_scorer.py.
such as:
def id2score(self):
id2score = {
28740: "1",
28750: "2",
28770: "3",
28781: "4",
28782: "5",
28784: "6"
}
return id2score
Hi, thanks for your great work! I notice that you used ChatGPT for scorer but it seems that there is no place for us to insert our own token. Does this mean we cannot use this scorer for arbitary model?
Moreover, do you think it can be used for a evaluation metric of llm output? Thanks.
May I ask where the preference data used in your dialogue model during the DPO process comes from? Is there an open-source plan for it? Thank you.
This is a very interesting work! Thanks for publishing dataset deita-complexity-scorer-data and deita-quality-scorer-data.
According table 14 and table 18 in this work (prompt for ranking and scoring), capturing the small differences among EVOL variants is important. Does the EVOL process of instruction dataset has been released?
I find 9481 training samples in deita-complexity-scorer-data and 9276 training samples in deita-quality-scorer-data, but I can not find the EVOL process of each instruction.
Question 1: Does the EVOL process (relationship from M=1 to M=5) of instruction dataset has been released?
Question 2: Do deita-complexity-scorer-data and deita-quality-scorer-data have done Elimination Evolving as described in WizardLM.
thanks!!!
If I use pip install vllm directly, there seems to be version incompatibility issues.
Thanks!
Hi,
First of all, thank you for sharing your wonderful work!
I was searching for efficient ways of mining instructions used in instruction-tuning LLMs.
While reading the manuscript and investigating your provided open-sourced 6k & 10k datasets,
I could not intuitively understand why the SFT (6k) +DPO (10k) training method increases the performance of
the multi-choice question answering tasks such as ARC-challenge and MMLU?
In the dataset, the instances are composed of conversations between humans and GPT which don't have any clue about solving multi-choice QA problems.
Do you have any ideas why it worked?
Hello,
While the paper and the code both say that cosine distance is used to promote diversity, it seems that the current implementation computes cosine similarity instead of distance:
deita/src/deita/selection/filter/base.py
Lines 33 to 35 in 983e98f
If cosine similarity is used, it rather enforces data similarity than diversity. Any clarifications would be much appreciated!
Best,
Sang
thank u for your patience
Hi,
Thanks for your interesting and great work. I want to know if the 6k dataset is a subset of 10k dataset.
Hi,
First of all, thank you for your work and the great repo!
As stated in the title, could you please provide the original data pool used in your paper, especially
Best regards
Hi the team,
Great work! I wonder whether it is possible to publish your training script for scoring model? I find my local version not working. Thanks!
Hello, it appears that the scorer models on the hub are 7b models rather than 13b specified in the model card.
from transformers import AutoModelForCausalLM
model_name = "hkust-nlp/deita-quality-scorer"
model = AutoModelForCausalLM.from_pretrained(model_name)
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
num_params_in_billions = num_params / 1_000_000_000
print(f"Number of parameters in the {model_name} model: {num_params_in_billions} B") #6.73B
Do you have the 13b versions available?
It seems each sample in the deita dataset consists of a lot of turns and is super long (>10k tokens). Your paper mentioned the max length of input is 2048 for SFT. Does that mean most text of each training sample is truncated and discarded?
Dear Authors,
Thank you for you great work! I'm trying to reproduce the reported MT-Bench scores with the released code and data.
Trying to reproduce:
DEITA-7B-v1.0 (6K) --> mt-bench: 7.22
DEITA-7B-v1.0-sft --> mt-bench: 7.32
Data I used:
hkust-nlp/deita-6k-v0
hkust-nlp/deita-10k-v0
Code I used:
https://github.com/hkust-nlp/deita/blob/main/examples/train/sft.sh
The scores for both 6k and 10k I got are around 7.06
(vs. 7.22
, 7.32
). The difference seems larger than regular SFT and MT-Bench eval variability.
Any suggestions to resolve the discrepancy would be appreciated.
Thanks!
Hi,
I cannot find the computational cost of the algorithm (selection of the data samples) in the paper. Do you have the complexity of the algorithm and any tables with runtimes, showing how long it takes to run the selection algorithm (based on quality, complexity, and diversity) given X instances?
Thanks.
As illustrated in the left part of Figure 1, we then ask ChatGPT to rank and score these 6 samples
(prompt in Appendix E.2), obtaining the complexity scores c corresponding to the instructions. We
emphasize that, distinct from direct scoring, we give ChatGPT all 6 samples within one prompt – these
samples represent different evolution stages of the same original sample and such a scoring scheme
helps ChatGPT capture the small complexity differences among them, which leads to complexity
scores to achieve finer-grained complexity differentiation among samples.
So the score used as filter is the highest complexity or the original score? Is the use of Evol the method to make the score more confident?
Hi, thanks for your work! I have some questions about your experiment in training the complexity scorer.
“Pool=50K” denotes the data selection procedure is conducted in a 50K-sized subset due to the cost of using ChatGPT to annotate the entire pool."
1、The data used for "EVOL COMPLEXITY (Pool=50K)" is sampled from 50K samples while that for "EVOL COMPLEXITY" is sampled from the original data pool?
2、How do you sample the data from the original data pool?
Hope for your reply!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.