Coder Social home page Coder Social logo

deita's People

Contributors

jxhe avatar vpeterv avatar winglian avatar zeng-wh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

deita's Issues

[Question] Regarding the order bias in sample scoring.

Hello, thank you for your great work!

Regarding the EVOL COMPLEXITY method in the paper, where ChatGPT ranks and scores the complexity of samples, I have recently observed that many LLMs tend to score samples in a descending order from high to low. For example, a sample sequence like ABCD tends to be scored from high to low, and when the order is adjusted (e.g., CDAB), the scoring trend remains similar.

Have you observed a similar phenomenon, and if so, have you made any corresponding adjustments in your experiments?

How did you train the complexity & quality scorer

First of all, thank you, and huge congrats on the paper release! Really enjoyed reading it.

I wanted to ask if you can share any details on how you trained your scorer. Was it simple next token prediciton on the collected data samples? 2k each?

What is the significance of the id2score function?

The id2score function has different values ​​in mistral_scorer.py and llama_scorer.py.
such as:

    def id2score(self):
        id2score = {
                28740: "1",
                28750: "2",
                28770: "3",
                28781: "4",
                28782: "5",
                28784: "6"
                }
        
        return id2score

Some questions about running the scorer for arbitary model

Hi, thanks for your great work! I notice that you used ChatGPT for scorer but it seems that there is no place for us to insert our own token. Does this mean we cannot use this scorer for arbitary model?

Moreover, do you think it can be used for a evaluation metric of llm output? Thanks.

data of deita's dpo+sft

May I ask where the preference data used in your dialogue model during the DPO process comes from? Is there an open-source plan for it? Thank you.

Does the EVOL process of instruction dataset has been released?

This is a very interesting work! Thanks for publishing dataset deita-complexity-scorer-data and deita-quality-scorer-data.

According table 14 and table 18 in this work (prompt for ranking and scoring), capturing the small differences among EVOL variants is important. Does the EVOL process of instruction dataset has been released?
I find 9481 training samples in deita-complexity-scorer-data and 9276 training samples in deita-quality-scorer-data, but I can not find the EVOL process of each instruction.

Question 1: Does the EVOL process (relationship from M=1 to M=5) of instruction dataset has been released?
Question 2: Do deita-complexity-scorer-data and deita-quality-scorer-data have done Elimination Evolving as described in WizardLM.

thanks!!!

Questions about performance improvement in Open LLM leaderboard

Hi,
First of all, thank you for sharing your wonderful work!

I was searching for efficient ways of mining instructions used in instruction-tuning LLMs.
While reading the manuscript and investigating your provided open-sourced 6k & 10k datasets,
I could not intuitively understand why the SFT (6k) +DPO (10k) training method increases the performance of
the multi-choice question answering tasks such as ARC-challenge and MMLU?

In the dataset, the instances are composed of conversations between humans and GPT which don't have any clue about solving multi-choice QA problems.

Do you have any ideas why it worked?

Cosine distance computation

Hello,

While the paper and the code both say that cosine distance is used to promote diversity, it seems that the current implementation computes cosine similarity instead of distance:

matrix_norm = matrix / matrix.norm(dim=1)[:, None]
matrix_2_norm = matrix_2 / matrix_2.norm(dim=1)[:, None]
return torch.mm(matrix_norm, matrix_2_norm.t())

If cosine similarity is used, it rather enforces data similarity than diversity. Any clarifications would be much appreciated!

Best,
Sang

Could you please publish the original data pool?

Hi,
First of all, thank you for your work and the great repo!

As stated in the title, could you please provide the original data pool used in your paper, especially $X_{sota}$. I have tried to obtain the dataset following the reference in the paper. However, I cannot find a version of ShareGPT and UltraChat Huggingface datasets that match the statistics stated in the paper. I would greatly appreciate it if you could provide the dataset or teach me how to filter out the two datasets from existing Huggingface datasets.

Best regards

Scorer models on hub are 7b not 13b

Hello, it appears that the scorer models on the hub are 7b models rather than 13b specified in the model card.

from transformers import AutoModelForCausalLM

model_name = "hkust-nlp/deita-quality-scorer"
model = AutoModelForCausalLM.from_pretrained(model_name)
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
num_params_in_billions = num_params / 1_000_000_000

print(f"Number of parameters in the {model_name} model: {num_params_in_billions} B") #6.73B

Do you have the 13b versions available?

The length of samples

It seems each sample in the deita dataset consists of a lot of turns and is super long (>10k tokens). Your paper mentioned the max length of input is 2048 for SFT. Does that mean most text of each training sample is truncated and discarded?

reproduce mt-bench score

Dear Authors,

Thank you for you great work! I'm trying to reproduce the reported MT-Bench scores with the released code and data.

Trying to reproduce:
DEITA-7B-v1.0 (6K) --> mt-bench: 7.22
DEITA-7B-v1.0-sft --> mt-bench: 7.32

Data I used:
hkust-nlp/deita-6k-v0
hkust-nlp/deita-10k-v0

Code I used:
https://github.com/hkust-nlp/deita/blob/main/examples/train/sft.sh

The scores for both 6k and 10k I got are around 7.06 (vs. 7.22, 7.32). The difference seems larger than regular SFT and MT-Bench eval variability.

Any suggestions to resolve the discrepancy would be appreciated.

Thanks!

Computational cost of the algorithm

Hi,

I cannot find the computational cost of the algorithm (selection of the data samples) in the paper. Do you have the complexity of the algorithm and any tables with runtimes, showing how long it takes to run the selection algorithm (based on quality, complexity, and diversity) given X instances?

Thanks.

Question about which score to ultimately use for the filtering process.

As illustrated in the left part of Figure 1, we then ask ChatGPT to rank and score these 6 samples
(prompt in Appendix E.2), obtaining the complexity scores c corresponding to the instructions. We
emphasize that, distinct from direct scoring, we give ChatGPT all 6 samples within one prompt – these
samples represent different evolution stages of the same original sample and such a scoring scheme
helps ChatGPT capture the small complexity differences among them, which leads to complexity
scores to achieve finer-grained complexity differentiation among samples.

So the score used as filter is the highest complexity or the original score? Is the use of Evol the method to make the score more confident?

Questions about the "Pool=50K" in your paper.

Hi, thanks for your work! I have some questions about your experiment in training the complexity scorer.

“Pool=50K” denotes the data selection procedure is conducted in a 50K-sized subset due to the cost of using ChatGPT to annotate the entire pool."

1、The data used for "EVOL COMPLEXITY (Pool=50K)" is sampled from 50K samples while that for "EVOL COMPLEXITY" is sampled from the original data pool?
2、How do you sample the data from the original data pool?
Hope for your reply!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.