The deita from hkust-nlp

What content is encoded when Llama13B encoded a sentence

Only prompt? or prompt+response?

Can we support more languages?

like Chinese and more.

request: is code of "Repr Filter" can be open source too?

Have you conducted ablation experiments with three factors: Complexity, Quality, and Diversity? Which one has the greatest impact on performance improvement?

[Question] Regarding the order bias in sample scoring.

Hello, thank you for your great work!

Regarding the EVOL COMPLEXITY method in the paper, where ChatGPT ranks and scores the complexity of samples, I have recently observed that many LLMs tend to score samples in a descending order from high to low. For example, a sample sequence like ABCD tends to be scored from high to low, and when the order is adjusted (e.g., CDAB), the scoring trend remains similar.

Have you observed a similar phenomenon, and if so, have you made any corresponding adjustments in your experiments?

How did you train the complexity & quality scorer

First of all, thank you, and huge congrats on the paper release! Really enjoyed reading it.

I wanted to ask if you can share any details on how you trained your scorer. Was it simple next token prediciton on the collected data samples? 2k each?

What is the significance of the id2score function?

The id2score function has different values in mistral_scorer.py and llama_scorer.py.
such as:

    def id2score(self):
        id2score = {
                28740: "1",
                28750: "2",
                28770: "3",
                28781: "4",
                28782: "5",
                28784: "6"
                }
        
        return id2score

Some questions about running the scorer for arbitary model

Hi, thanks for your great work! I notice that you used ChatGPT for scorer but it seems that there is no place for us to insert our own token. Does this mean we cannot use this scorer for arbitary model?

Moreover, do you think it can be used for a evaluation metric of llm output? Thanks.

data of deita's dpo+sft

May I ask where the preference data used in your dialogue model during the DPO process comes from? Is there an open-source plan for it? Thank you.

Does the EVOL process of instruction dataset has been released?

This is a very interesting work! Thanks for publishing dataset deita-complexity-scorer-data and deita-quality-scorer-data.

According table 14 and table 18 in this work (prompt for ranking and scoring), capturing the small differences among EVOL variants is important. Does the EVOL process of instruction dataset has been released?
I find 9481 training samples in deita-complexity-scorer-data and 9276 training samples in deita-quality-scorer-data, but I can not find the EVOL process of each instruction.

Question 1: Does the EVOL process (relationship from M=1 to M=5) of instruction dataset has been released?
Question 2: Do deita-complexity-scorer-data and deita-quality-scorer-data have done Elimination Evolving as described in WizardLM.

thanks!!!

If I want to use vllm, which version should I install

If I use pip install vllm directly, there seems to be version incompatibility issues.

Thanks!

Questions about performance improvement in Open LLM leaderboard

Hi,
First of all, thank you for sharing your wonderful work!

I was searching for efficient ways of mining instructions used in instruction-tuning LLMs.
While reading the manuscript and investigating your provided open-sourced 6k & 10k datasets,
I could not intuitively understand why the SFT (6k) +DPO (10k) training method increases the performance of
the multi-choice question answering tasks such as ARC-challenge and MMLU?

In the dataset, the instances are composed of conversations between humans and GPT which don't have any clue about solving multi-choice QA problems.

Do you have any ideas why it worked?

Cosine distance computation

Hello,

While the paper and the code both say that cosine distance is used to promote diversity, it seems that the current implementation computes cosine similarity instead of distance:

deita/src/deita/selection/filter/base.py

Lines 33 to 35 in 983e98f

    
           matrix_norm = matrix / matrix.norm(dim=1)[:, None] 
        
           matrix_2_norm = matrix_2 / matrix_2.norm(dim=1)[:, None] 
        
           return torch.mm(matrix_norm, matrix_2_norm.t())

If cosine similarity is used, it rather enforces data similarity than diversity. Any clarifications would be much appreciated!

Best,
Sang

[question] repr_filter encoding only the instruction or instruction and anwser?

thank u for your patience

[Question] Is the 6k dataset is a subset of 10k dataset.

Hi,

Thanks for your interesting and great work. I want to know if the 6k dataset is a subset of 10k dataset.

Could you please publish the original data pool?

Hi,
First of all, thank you for your work and the great repo!

As stated in the title, could you please provide the original data pool used in your paper, especially $X_{sota}$. I have tried to obtain the dataset following the reference in the paper. However, I cannot find a version of ShareGPT and UltraChat Huggingface datasets that match the statistics stated in the paper. I would greatly appreciate it if you could provide the dataset or teach me how to filter out the two datasets from existing Huggingface datasets.

Best regards

[Question] Script to train Scorer model?

Hi the team,

Great work! I wonder whether it is possible to publish your training script for scoring model? I find my local version not working. Thanks!

Scorer models on hub are 7b not 13b

Hello, it appears that the scorer models on the hub are 7b models rather than 13b specified in the model card.

from transformers import AutoModelForCausalLM

model_name = "hkust-nlp/deita-quality-scorer"
model = AutoModelForCausalLM.from_pretrained(model_name)
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
num_params_in_billions = num_params / 1_000_000_000

print(f"Number of parameters in the {model_name} model: {num_params_in_billions} B") #6.73B

Do you have the 13b versions available?

The length of samples

It seems each sample in the deita dataset consists of a lot of turns and is super long (>10k tokens). Your paper mentioned the max length of input is 2048 for SFT. Does that mean most text of each training sample is truncated and discarded?

reproduce mt-bench score

Dear Authors,

Thank you for you great work! I'm trying to reproduce the reported MT-Bench scores with the released code and data.

Trying to reproduce:
DEITA-7B-v1.0 (6K) --> mt-bench: 7.22
DEITA-7B-v1.0-sft --> mt-bench: 7.32

Data I used:
hkust-nlp/deita-6k-v0
hkust-nlp/deita-10k-v0

Code I used:
https://github.com/hkust-nlp/deita/blob/main/examples/train/sft.sh

The scores for both 6k and 10k I got are around 7.06 (vs. 7.22, 7.32). The difference seems larger than regular SFT and MT-Bench eval variability.

Any suggestions to resolve the discrepancy would be appreciated.

Thanks!

Computational cost of the algorithm

Hi,

I cannot find the computational cost of the algorithm (selection of the data samples) in the paper. Do you have the complexity of the algorithm and any tables with runtimes, showing how long it takes to run the selection algorithm (based on quality, complexity, and diversity) given X instances?

Thanks.

Question about which score to ultimately use for the filtering process.

As illustrated in the left part of Figure 1, we then ask ChatGPT to rank and score these 6 samples
(prompt in Appendix E.2), obtaining the complexity scores c corresponding to the instructions. We
emphasize that, distinct from direct scoring, we give ChatGPT all 6 samples within one prompt – these
samples represent different evolution stages of the same original sample and such a scoring scheme
helps ChatGPT capture the small complexity differences among them, which leads to complexity
scores to achieve finer-grained complexity differentiation among samples.

So the score used as filter is the highest complexity or the original score? Is the use of Evol the method to make the score more confident?

[question]Did you use the mean value of all token embedding in repr filter?

Questions about the "Pool=50K" in your paper.

Hi, thanks for your work! I have some questions about your experiment in training the complexity scorer.

“Pool=50K” denotes the data selection procedure is conducted in a 50K-sized subset due to the cost of using ChatGPT to annotate the entire pool."

1、The data used for "EVOL COMPLEXITY (Pool=50K)" is sampled from 50K samples while that for "EVOL COMPLEXITY" is sampled from the original data pool?
2、How do you sample the data from the original data pool?
Hope for your reply!

	matrix_norm = matrix / matrix.norm(dim=1)[:, None]
	matrix_2_norm = matrix_2 / matrix_2.norm(dim=1)[:, None]
	return torch.mm(matrix_norm, matrix_2_norm.t())

hkust-nlp / deita Goto Github PK

deita's People

Contributors

Stargazers

Watchers

Forkers

deita's Issues

Recommend Projects

Recommend Topics

Recommend Org