Hello, I see ShareGPT's dataset is listed on the readme, but the download for alpaca f

ShareGPT Dataset? about alpaca-cot HOT 8 OPEN

teknium1 commented on September 21, 2024

ShareGPT Dataset?

from alpaca-cot.

Comments (8)

teknium1 commented on September 21, 2024 1

Sorry, I found the dataset on your huggingface.

I looked over it though, and the dataset format might be concerning.

I may be ignorant, but if trained on the ShareGPT alpaca-format dataset, the model may not coherently learn.

ex.

        "instruction": "I have an interview about product speccing with the company Weekend Health. Give me an example of a question they might ask with regards about a new feature",
        "input": "",
        "output": "\"Can you walk us through the process you would use to gather requirements and create a specification for a new feature that our users have been requesting, such as a personalized nutrition plan feature?\""
    },
    {
        "instruction": "Give me a few more",
        "input": "",
        "output": "* \"How would you approach creating a specification for a new feature that integrates with third-party fitness trackers and allows for seamless data tracking and analysis for our users?\"\n* \"Can you give an example of a time when you had to navigate competing stakeholder"
    },

These 2 sequences will likely not be related to eachother during training, making it much more irratic than the way vicuna's original dataset in their format would learn to be

from alpaca-cot.

teknium1 commented on September 21, 2024 1

It would be wise, imo, to alter the vicuna pipeline being used to simply throw away the sequences that get split off, or perhaps, if needed, throw out all convos that are too long, maybe make a 2k ctx length version and a 4k one - since 4k llama models have started to appear (although are not working well at all rn, they will soon)

from alpaca-cot.

dumpmemory commented on September 21, 2024

did u check the _context.json version ?

from alpaca-cot.

float-trip commented on September 21, 2024

Looked into this for a bit. sharegpt_context.json has the same issue to an extent. It seems that everyone is processing the ShareGPT data using Vicuna's pipeline, including this part, which chunks long conversations based on token count.

So rather than throwing out data after hitting the context window, we have a fair amount of chats in sharegpt_context.json that start in the middle of things with the first prompt being something like "[HM]: continue". Not sure if training on this is harmful or helpful.

from alpaca-cot.

teknium1 commented on September 21, 2024

I also think since a lot of datasets are doing this that it likely has something to do with the vicuna "random stopping" issues

from alpaca-cot.

dkqkxx commented on September 21, 2024

At present, there are some works to clean the ShareGPT dataset, and we will continue to pay attention to it.

from alpaca-cot.

teknium1 commented on September 21, 2024

At present, there are some works to clean the ShareGPT dataset, and we will continue to pay attention to it.

Can you link any of those?

from alpaca-cot.

dkqkxx commented on September 21, 2024

At present, there are some works to clean the ShareGPT dataset, and we will continue to pay attention to it.

Can you link any of those?

https://paratranz.cn/projects/6725
https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered

from alpaca-cot.

ShareGPT Dataset? about alpaca-cot HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent