Coder Social home page Coder Social logo

ShareGPT Dataset? about alpaca-cot HOT 8 OPEN

teknium1 avatar teknium1 commented on September 21, 2024
ShareGPT Dataset?

from alpaca-cot.

Comments (8)

teknium1 avatar teknium1 commented on September 21, 2024 1

Sorry, I found the dataset on your huggingface.

I looked over it though, and the dataset format might be concerning.

I may be ignorant, but if trained on the ShareGPT alpaca-format dataset, the model may not coherently learn.

ex.

        "instruction": "I have an interview about product speccing with the company Weekend Health. Give me an example of a question they might ask with regards about a new feature",
        "input": "",
        "output": "\"Can you walk us through the process you would use to gather requirements and create a specification for a new feature that our users have been requesting, such as a personalized nutrition plan feature?\""
    },
    {
        "instruction": "Give me a few more",
        "input": "",
        "output": "* \"How would you approach creating a specification for a new feature that integrates with third-party fitness trackers and allows for seamless data tracking and analysis for our users?\"\n* \"Can you give an example of a time when you had to navigate competing stakeholder"
    },

These 2 sequences will likely not be related to eachother during training, making it much more irratic than the way vicuna's original dataset in their format would learn to be

from alpaca-cot.

teknium1 avatar teknium1 commented on September 21, 2024 1

It would be wise, imo, to alter the vicuna pipeline being used to simply throw away the sequences that get split off, or perhaps, if needed, throw out all convos that are too long, maybe make a 2k ctx length version and a 4k one - since 4k llama models have started to appear (although are not working well at all rn, they will soon)

from alpaca-cot.

dumpmemory avatar dumpmemory commented on September 21, 2024

did u check the _context.json version ?

from alpaca-cot.

float-trip avatar float-trip commented on September 21, 2024

Looked into this for a bit. sharegpt_context.json has the same issue to an extent. It seems that everyone is processing the ShareGPT data using Vicuna's pipeline, including this part, which chunks long conversations based on token count.

So rather than throwing out data after hitting the context window, we have a fair amount of chats in sharegpt_context.json that start in the middle of things with the first prompt being something like "[HM]: continue". Not sure if training on this is harmful or helpful.

from alpaca-cot.

teknium1 avatar teknium1 commented on September 21, 2024

I also think since a lot of datasets are doing this that it likely has something to do with the vicuna "random stopping" issues

from alpaca-cot.

dkqkxx avatar dkqkxx commented on September 21, 2024

At present, there are some works to clean the ShareGPT dataset, and we will continue to pay attention to it.

from alpaca-cot.

teknium1 avatar teknium1 commented on September 21, 2024

At present, there are some works to clean the ShareGPT dataset, and we will continue to pay attention to it.

Can you link any of those?

from alpaca-cot.

dkqkxx avatar dkqkxx commented on September 21, 2024

At present, there are some works to clean the ShareGPT dataset, and we will continue to pay attention to it.

Can you link any of those?

https://paratranz.cn/projects/6725
https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered

from alpaca-cot.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.