Coder Social home page Coder Social logo

[BUG] model的forward函数接收attention_mask的时候,若attention_mask[i, 0]==0,则序列i输出的logits全都是NaN值 about qwen HOT 6 CLOSED

leileqiTHU avatar leileqiTHU commented on August 26, 2024
[BUG] model的forward函数接收attention_mask的时候,若attention_mask[i, 0]==0,则序列i输出的logits全都是NaN值

from qwen.

Comments (6)

jklj077 avatar jklj077 commented on August 26, 2024

Hi, I'm not sure I understand your use case. May I know what results you were expecting? You have literally prevented the initial position from attending to itself and it should be expected that the model did not know what next token would be.

from qwen.

leileqiTHU avatar leileqiTHU commented on August 26, 2024

yeah sorry that I may not make it clear.

I was trying to use model.forward function directly rather than calling model.generate function, in order to observe its behavior in the forward pass.
My input is of different lengths, so I have to pad them to the same lengths. I used left padding, prepending pad <endoftext> tokens. In my opinion, those pad tokens should not be attended, and attention_mask is used in this scenario, setting those positions to 0 so that the model won't attend to those pad tokens in the forward pass.
However, I got all NaN logits, which confuses me. I tried not to pass the attention_mask parameter, and there are no NaN values in the logits, which is I expected. So I infer that this may be the problem of the attention_mask. To further locate the problem, I tried different attention_masks, finally found out that If we set the first position to 0 (in which case the model won't attend to the first token which is a pad token), the return values of model.forward function , i.e., the logits, will all be NaN values.

Also, I tried Qwen1.5-7B-Chat model, and it does not have this problem, i.e., even if I set the attention_mask of the first position to 0, the output will still be free of NaN values. So I suspect that this may be a problem of Qwen-7B-Chat.

But also, I may make mistakes, please let me know if I do.

from qwen.

leileqiTHU avatar leileqiTHU commented on August 26, 2024

And If the masked tokens in the left positions should not know what the next token should be due to that they are prevented from attending to themselves, why are the logits of other un-masked positions (the right positions ) are also NaN values? Did I get it wrong?

from qwen.

jklj077 avatar jklj077 commented on August 26, 2024

Hi, after reading through your comments, and if I understood correctly, Qwen1.5 was working as you would expect. I would suggest just using Qwen1.5.

P.S.: Investigating the original issue is more complicated than it appeared. Was flash attention enabled? Were you following the instructions in README to do batch inference?

from qwen.

cageyoko avatar cageyoko commented on August 26, 2024

遇到了相同的问题,

  1. 不用flash-attn batchsize=1可以正常出结果 batchsize>1时候有padding的样本过模型后输出为nan
  2. 安装flash-attn 就好了

from qwen.

jklj077 avatar jklj077 commented on August 26, 2024

Hi, Qwen1.0 models and code will not be updated anymore. Please try Qwen2.0 instead.

from qwen.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.