Hi nice work on the paper. I pulled your code and model locally (LLaMA-7B-PoSE-YaRN-12

hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Models do well on PASS-KEY retrieval but very badly on longchat topic retrieval about pose HOT 6 CLOSED

abacaj commented on August 25, 2024

Models do well on PASS-KEY retrieval but very badly on longchat topic retrieval

from pose.

Comments (6)

dwzhu-pku commented on August 25, 2024 1

hi @abacaj! Thank you for this interesting point!
It's really worth noticing. I will look into this in a few days.

from pose.

abacaj commented on August 25, 2024 1

Thanks! I did a few more runs and using better data really seems to help (especially for much longer context).

from pose.

dwzhu-pku commented on August 25, 2024 1

Hello @abacaj,

Thank you once again for your interest in our project. I appreciate your suggestion and have conducted tests on topic retrieval in longchat. I am pleased to share my findings with you.

1) Comparison between LLaMA-7B-PoSE-YaRN-128k and NousResearch/Yarn-Llama-2-7b-128k

As outlined in our paper, our LLaMA-7B-PoSE-YaRN-128k model is based on llama1 and was trained for 1000 steps with a context length of 2k and a global batch size of 64. On the other hand, NousResearch/Yarn-Llama-2-7b-128k is based on llama2 and was trained for 600 steps with a context length of 64k and the same global batch size. Therefore, our model is relatively under-trained. We released this model primarily to showcase the potential of PoSE in extending the context length significantly. Given the limited training, the suboptimal performance is understandable. We anticipate that increasing the context length during training, such as using a training context length of 16k to achieve a window expansion of 128k, could yield significant performance improvements.

2) Why Later checkpoints like step-1000 perform much worse than earlier checkpoints

I have also encountered a similar issue. When using the eval script of longchat for topic retrieval evaluation, I found that the PoSE-based model outperformed the Full fine-tune model, which performed poorly. Upon closer examination of the responses generated by the latter, I found that most predictions began in a specific way (I set max new token to 150 here ):

Below is a record of our previous conversation on 5 different topics. You are the USER, and I am the ASSISTANT. At the beginning of each topic, the USER will say 'I would like to discuss the topic of <TOPIC>'. Memorize each <TOPIC>. At the end of the record, I will ask you to retrieve the first topic. Now the record start. USER: I would like to discuss the topic of ...

This is exactly the beginning of each prompt in evaluation/topics/testcases. It seems that the fullft model does not know to directly return the first topic, but instead returns the entire starting content. This makes it impossible for chatgpt in auto_topic_eval to judge it as correct, even though it indeed contains the first topic. When I inserted a simple logic to extract the topic from the above content below this line, I found that the predictions of the fullft model were quite accurate.

Given that the PoSE-trained model did not exhibit this issue, one possible explanation is that, as the number of training tokens increased in the dataset I used, the model tended to output "Below is a record of xxx" rather than directly providing the first topic. This aligns with your conclusion that "better data can improve the model's performance in topic retrieval." I believe my findings offer a plausible explanation for this interesting phenomenon.

3) Longchat topic retrieval vs passkey retrieval

Thank your very very much for introducing me the longchat topic retrieval task. It presents a greater challenge than passkey retrieval and is a better measure of the model's long context capabilities. However, I found that it’s not perfect as well, largely because its output is difficult to parse, relies heavily on chatgpt, and lacks convenience. In contrast, passkey retrieval is more user-friendly in this aspect. In future work, I will consider incorporating both evaluation methods in my works.

I hope this information is helpful 🙂

from pose.

abacaj commented on August 25, 2024 1

Appreciate the reply! Will keep experimenting with PoSE (it is a much cheaper method to extend context).

from pose.

abacaj commented on August 25, 2024

I verify LongChat benchmark against NousResearch/Yarn-Llama-2-7b-128k.
That model scores 74% on the same test (6k~> 10_topics.jsonl).

from pose.

abacaj commented on August 25, 2024

I train TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T using your code for PoSE. I notice something strange, or maybe interesting. Later checkpoints like step-1000 perform much worse than earlier checkpoints.

Running the same test against step-100 checkpoint gives me 72%, running it against step-1000 gives me 56%.

from pose.

Models do well on PASS-KEY retrieval but very badly on longchat topic retrieval about pose HOT 6 CLOSED

Comments (6)

Related Issues (16)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent