Coder Social home page Coder Social logo

Comments (6)

dwzhu-pku avatar dwzhu-pku commented on August 25, 2024 1

hi @abacaj! Thank you for this interesting point!
It's really worth noticing. I will look into this in a few days.

from pose.

abacaj avatar abacaj commented on August 25, 2024 1

Thanks! I did a few more runs and using better data really seems to help (especially for much longer context).

from pose.

dwzhu-pku avatar dwzhu-pku commented on August 25, 2024 1

Hello @abacaj,

Thank you once again for your interest in our project. I appreciate your suggestion and have conducted tests on topic retrieval in longchat. I am pleased to share my findings with you.

1) Comparison between LLaMA-7B-PoSE-YaRN-128k and NousResearch/Yarn-Llama-2-7b-128k

As outlined in our paper, our LLaMA-7B-PoSE-YaRN-128k model is based on llama1 and was trained for 1000 steps with a context length of 2k and a global batch size of 64. On the other hand, NousResearch/Yarn-Llama-2-7b-128k is based on llama2 and was trained for 600 steps with a context length of 64k and the same global batch size. Therefore, our model is relatively under-trained. We released this model primarily to showcase the potential of PoSE in extending the context length significantly. Given the limited training, the suboptimal performance is understandable. We anticipate that increasing the context length during training, such as using a training context length of 16k to achieve a window expansion of 128k, could yield significant performance improvements.

2) Why Later checkpoints like step-1000 perform much worse than earlier checkpoints

I have also encountered a similar issue. When using the eval script of longchat for topic retrieval evaluation, I found that the PoSE-based model outperformed the Full fine-tune model, which performed poorly. Upon closer examination of the responses generated by the latter, I found that most predictions began in a specific way (I set max new token to 150 here ):

Below is a record of our previous conversation on 5 different topics. You are the USER, and I am the ASSISTANT. At the beginning of each topic, the USER will say 'I would like to discuss the topic of <TOPIC>'. Memorize each <TOPIC>. At the end of the record, I will ask you to retrieve the first topic. Now the record start. USER: I would like to discuss the topic of ...

This is exactly the beginning of each prompt in evaluation/topics/testcases. It seems that the fullft model does not know to directly return the first topic, but instead returns the entire starting content. This makes it impossible for chatgpt in auto_topic_eval to judge it as correct, even though it indeed contains the first topic. When I inserted a simple logic to extract the topic from the above content below this line, I found that the predictions of the fullft model were quite accurate.

Given that the PoSE-trained model did not exhibit this issue, one possible explanation is that, as the number of training tokens increased in the dataset I used, the model tended to output "Below is a record of xxx" rather than directly providing the first topic. This aligns with your conclusion that "better data can improve the model's performance in topic retrieval." I believe my findings offer a plausible explanation for this interesting phenomenon.

3) Longchat topic retrieval vs passkey retrieval

Thank your very very much for introducing me the longchat topic retrieval task. It presents a greater challenge than passkey retrieval and is a better measure of the model's long context capabilities. However, I found that it’s not perfect as well, largely because its output is difficult to parse, relies heavily on chatgpt, and lacks convenience. In contrast, passkey retrieval is more user-friendly in this aspect. In future work, I will consider incorporating both evaluation methods in my works.

I hope this information is helpful 🙂

from pose.

abacaj avatar abacaj commented on August 25, 2024 1

Appreciate the reply! Will keep experimenting with PoSE (it is a much cheaper method to extend context).

from pose.

abacaj avatar abacaj commented on August 25, 2024

I verify LongChat benchmark against NousResearch/Yarn-Llama-2-7b-128k.
That model scores 74% on the same test (6k~> 10_topics.jsonl).

image

from pose.

abacaj avatar abacaj commented on August 25, 2024

I train TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T using your code for PoSE. I notice something strange, or maybe interesting. Later checkpoints like step-1000 perform much worse than earlier checkpoints.

Running the same test against step-100 checkpoint gives me 72%, running it against step-1000 gives me 56%.

from pose.

Related Issues (16)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.