您好，我理解的是如果想拓展到4096的长度，那就把4096的example分成两个2048长度的example，那样一个example就变成两个了（当然会加一些term去链

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

训练过程 about pose HOT 7 CLOSED

eyuansu62 commented on August 25, 2024

训练过程

from pose.

Comments (7)

eyuansu62 commented on August 25, 2024 1

感谢回复。所以其实是直接修改pos来做的，那可以看作是训练2048的代价，但真的看到了128K的范围，可以这么理解吗

from pose.

dwzhu-pku commented on August 25, 2024 1

是的

from pose.

eyuansu62 commented on August 25, 2024 1

很不错的工作，代码写的也很好！

from pose.

wasiahmad commented on August 25, 2024 1

Thank you @dwzhu-pku for clarification.

from pose.

dwzhu-pku commented on August 25, 2024

您好，感谢提问。我们的方法并不是把4096的example切成两个2048来训练。直观上来看，模型接受超过原有长下文范围的长输入时，主要是有两方面变化：其一是出现了一些在预训练阶段没有出现过的位置；其二是注意力机制所要处理的token数变多了。目前我们采取的策略主要是为了解决第一个问题，即让模型适应这些新出现的位置。我们的做法是在一个固定的上下文窗口内，通过改变token的位置编码，来达到模拟长输入的目的。

举例来说，如果想从2048拓展到4096，对于基于RoPE的模型来说，就是要让他适应0~4095这些相对位置。对于第一个训练样例，我们将2048的上下文窗口划分成512 / 1536，第一部分的位置编码是$[0,511]$ ，第二部分的位置编码我们改成$[2560,4095]$。这样，第一个训练样例覆盖的相对位置范围就是 $[0, 511] \cup [2049, 4095]$ 。对于第二个样例，我们将2048的上下文窗口划分成1024 / 1024，第一部分的位置编码是$[0,1023]$，第二部分的位置编码我们改成$[1537,2560]$。这样，第一个训练样例覆盖的相对位置范围就是 $[0,2560]$。

关于您说的“保证拓展到无限长度”，这个说法有点含糊。实际上，我们无法论证模型可以拓展到无限长度而不出现任何问题，只能通过一系列实验来说明某些方面的能力得到了比较好的保持。具体来说，我们现有的实验结果表明，在32k左右的上下文长度上，语言模型任务和密码检索任务都获得了良好的结果，分别说明模型在这个长度下具备良好的语言建模能力，并且能对这个上下文窗口内的所有token都进行有效的注意力操作。在128k等更长的长度上，我们通过语言模型任务表明，模型的语言建模能力保持的较好；并通过标准化测试集上的结果表明，模型在正常的自然语言理解任务上没有明显的性能下降。

为了方便其他网友查看，我将问题和答案都翻译成英文如下 (For the benefit of other readers, I've translated the question and answer into English below:)：

Question: Hello! I understand that if you want to extend to a length of 4096, you would split the 4096-length example into two 2048-length examples, effectively creating two examples (of course, with some terms to link the two examples). However, during training, each individual sample is still 2048 in length, and it hasn't truly seen 4096. How do you ensure it can extend to an unlimited length?

Reply:

Hello! Thank you for the question. Our approach is not to split a 4096-length example into two 2048-length examples for training. Intuitively, when the model receives longer input beyond the original context window, there are two main changes: some positions that were not encountered during pretraining and an increased number of tokens for the attention mechanism. Our current strategy mainly addresses the first issue, making the model adapt to these new positions. Our approach involves simulating long inputs within a fixed context window by modifying token position encodings.

For example, if you wish to extend from 2048 to 4096 tokens, for a RoPE-based model, you want the model to adapt to relative positions from 0 to 4095. For the first training example, we divide the 2048 context window into 512 / 1536. The position encodings for the first part are $[0,511]$, and for the second part, we change them to $[2560,4095]$. This way, the relative position range covered by the first training example is $[0, 511] \cup [2049, 4095]$. For the second example, we divide the 2048 context window into 1024 / 1024. The position encodings for the first part are $[0,1023]$, and for the second part, we change them to $[1537,2560]$. This way, the relative position range covered by the first training example is $[0,2560]$.

Regarding what you mentioned about "ensuring extension to infinite length," this statement is a bit vague. In reality, we cannot prove that the model can extend to an infinite length without encountering any issues. We can only demonstrate certain aspects of its capability through a series of experiments. Specifically, our existing experimental results show good results in language modeling and password retrieval tasks at around a 32k context length. For longer lengths like 128k, our language modeling tasks show that the model maintains good language modeling capability, and evaluation on standard benchmarks indicates no significant performance drop in regular natural language understanding tasks.

from pose.

wasiahmad commented on August 25, 2024

@dwzhu-pku I have a couple of questions.

For the first training example, we divide the 2048 context window into 512 / 1536. For the second example, we divide the 2048 context window into 1024 / 1024.

How do you make those decisions? In other words, when you decide to split the sequence into N chunks, how do you determine the lengths?

This way, the relative position range covered by the first training example is [0, 511] U [2049, 4095].

You mentioned "The position encodings for the first part are [0, 511], and for the second part, we change them to [2560, 4095]". Then how does it cover the position range [2049, 4095]? Do you refer to min and max relative distances (2560-511=2049 and 4095-0=4095) between tokens given the two parts?

from pose.

dwzhu-pku commented on August 25, 2024

hi @wasiahmad, thank you for looking at our work!

Q1: In other words, when you decide to split the sequence into N chunks, how do you determine the lengths?

We randomly divide the original context window into $N$ chunks. Take $N=2$ for example, the length $l_0$ of chunk $c_0$ is randomly sampled from $[1,2047]$，and the length $l_1$ of chunk $c_1$ is $2,048 - l_0$. Because this paper is just a pilot attempt for decoupling train & target length, we have only experimented with this simple strategy, just to verify the effectiveness of our positional skip-wise training. However, I do believe that designing more delicated strategies beyond random sampling, both for bias terms and chunk lengths, can further improve the training efficiency.

Q2: You mentioned "The position encodings for the first part are [0, 511], and for the second part, we change them to [2560, 4095]". Then how does it cover the position range [2049, 4095]? Do you refer to min and max relative distances (2560-511=2049 and 4095-0=4095) between tokens given the two parts?

Yes, exactly. We focus on relative position because it is what RoPE intrinsically encodes :-)

from pose.

训练过程 about pose HOT 7 CLOSED

Comments (7)

Related Issues (16)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent