Coder Social home page Coder Social logo

Comments (8)

fyabc avatar fyabc commented on August 25, 2024 3

在目前QWen-7B-chat模型的训练中,采用如下的label mask策略:

  1. <|im_start|>, <|im_end|>等特殊token不作mask;
  2. 对system和每轮query的内容添加mask;
  3. 每轮对话中的角色信息("system\n", "user\n", "assistant\n")添加mask。

因此,对于上面的示例,添加mask的tokens有:
001 - 002 (system role)
003 - 004 (system content)
008 - 009 (query role)
010 - 013 (query content)
017 - 018 (answer role)
026 - 027 (query role)
028 - 031 (query content)
035 - 036 (answer role)

from qwen.

logicwong avatar logicwong commented on August 25, 2024 1

@Sanster 注意到代码里FlashSelfAttention的causal默认为True,因此先前有效位的token不会attend到后边的padding token,所以对效果不会产生影响。

from qwen.

Sanster avatar Sanster commented on August 25, 2024

感谢回复,还有一个疑问不知道你们有没有做过实验,training 使用 flash attention 没有 attention_mask(这个正常,除了 mpt 好像其它开源模型带 flash attention 训练的都没有 attention_mask) code line,在 SFT 训练中,长度不同的 query 组 batch 需要加 padding,这部分 padding 不加 attention_mask 对效果影响大吗?

from qwen.

meta-tabchen avatar meta-tabchen commented on August 25, 2024

在目前QWen-7B-chat模型的训练中,采用如下的label mask策略:

  1. <|im_start|>, <|im_end|>等特殊token不作mask;
  2. 对system和每轮query的内容添加mask;
  3. 每轮对话中的角色信息("system\n", "user\n", "assistant\n")添加mask。

因此,对于上面的示例,添加mask的tokens有: 001 - 002 (system role) 003 - 004 (system content) 008 - 009 (query role) 010 - 013 (query content) 017 - 018 (answer role) 026 - 027 (query role) 028 - 031 (query content) 035 - 036 (answer role)

有个问题请教下,注意到这里把query也mask了,也就是会同时优化query部分的loss,这个是有什么特别的考量么?相比会比vicuna 的方式(不优化query部分loss)会有提升么?

from qwen.

fyabc avatar fyabc commented on August 25, 2024

@meta-tabchen 这里提到的“对某处作mask“的意思是不优化此处的loss,因此QWen不会对query部分loss进行优化,这个和vicuna是一样的。

from qwen.

meta-tabchen avatar meta-tabchen commented on August 25, 2024

@meta-tabchen 这里提到的“对某处作mask“的意思是不优化此处的loss,因此QWen不会对query部分loss进行优化,这个和vicuna是一样的。

明白了,抱歉,理解错了

from qwen.

xyfZzz avatar xyfZzz commented on August 25, 2024
<|im_start|>assistant

请问下<|im_start|>为什么不作mask呢?也就是为什么需要优化<|im_start|>的loss?因为我理解推理的时候应该是把<|im_start|>assistant拼接到输入端然后让模型去预测后面的部分,应该模型不需要去预测<|im_start|>这个token

from qwen.

cizhenshi avatar cizhenshi commented on August 25, 2024
<|im_start|>assistant

请问下<|im_start|>为什么不作mask呢?也就是为什么需要优化<|im_start|>的loss?因为我理解推理的时候应该是把<|im_start|>assistant拼接到输入端然后让模型去预测后面的部分,应该模型不需要去预测<|im_start|>这个token

同问

from qwen.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.