Comments (8)
Sorry, I found the dataset on your huggingface.
I looked over it though, and the dataset format might be concerning.
I may be ignorant, but if trained on the ShareGPT alpaca-format dataset, the model may not coherently learn.
ex.
"instruction": "I have an interview about product speccing with the company Weekend Health. Give me an example of a question they might ask with regards about a new feature",
"input": "",
"output": "\"Can you walk us through the process you would use to gather requirements and create a specification for a new feature that our users have been requesting, such as a personalized nutrition plan feature?\""
},
{
"instruction": "Give me a few more",
"input": "",
"output": "* \"How would you approach creating a specification for a new feature that integrates with third-party fitness trackers and allows for seamless data tracking and analysis for our users?\"\n* \"Can you give an example of a time when you had to navigate competing stakeholder"
},
These 2 sequences will likely not be related to eachother during training, making it much more irratic than the way vicuna's original dataset in their format would learn to be
from alpaca-cot.
It would be wise, imo, to alter the vicuna pipeline being used to simply throw away the sequences that get split off, or perhaps, if needed, throw out all convos that are too long, maybe make a 2k ctx length version and a 4k one - since 4k llama models have started to appear (although are not working well at all rn, they will soon)
from alpaca-cot.
did u check the _context.json version ?
from alpaca-cot.
Looked into this for a bit. sharegpt_context.json has the same issue to an extent. It seems that everyone is processing the ShareGPT data using Vicuna's pipeline, including this part, which chunks long conversations based on token count.
So rather than throwing out data after hitting the context window, we have a fair amount of chats in sharegpt_context.json that start in the middle of things with the first prompt being something like "[HM]: continue". Not sure if training on this is harmful or helpful.
from alpaca-cot.
I also think since a lot of datasets are doing this that it likely has something to do with the vicuna "random stopping" issues
from alpaca-cot.
At present, there are some works to clean the ShareGPT dataset, and we will continue to pay attention to it.
from alpaca-cot.
At present, there are some works to clean the ShareGPT dataset, and we will continue to pay attention to it.
Can you link any of those?
from alpaca-cot.
At present, there are some works to clean the ShareGPT dataset, and we will continue to pay attention to it.
Can you link any of those?
https://paratranz.cn/projects/6725
https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered
from alpaca-cot.
Related Issues (20)
- GPTeacher Code-Instruct HOT 1
- 8卡V100跑moss OOV HOT 2
- Prompt设置 HOT 1
- About the tokenizer
- The text meaning in zh_helpfulness_context.json in Alpaca-CoT / MOSS / moss-002-sft
- DataCollatorForLanguageModeling uses the unmasked labels
- web.py中缺少--size参数 HOT 1
- inference结果差异比较大,请问是什么原因 HOT 2
- 是否可以提供一个Gdrive和百度云的下载方式 HOT 2
- 是否可以支持qlora
- 你好,群二维码过期了 HOT 1
- About the source of the dataset
- What is the relationship between the data and the link you provided?
- 你好,能更新下群信息么 HOT 2
- Adding Contributors Section In readme.md
- 你好,群二维码过期了,能更新一下么~ HOT 6
- main分支下的readme顺序,以及base模型能否提供huggingface的链接 HOT 4
- 1
- Error when loading sharegpt dataset
- tabular_llm分支提供的数据集存在错误
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from alpaca-cot.