Traceback (most recent call last): File "finetune.py", line 93, in main()

tokenize_dataset_rows.py 中 <code class="

你好，想请教一下，这个代码里面 prompt_ids 的最后一位为<code cla

看了一下，gmask的条件写反了应该是 <a class="user-mentio

其实我也有这个疑问，制作数据不是应该按"[Round {}] 问：{} 答：{} "给出吗？谁能按这个格式的数据跑通了，效果咋样？谢谢了。

finetune指定--per_device_train_batch_size 大于1时报错,about mymusise/chatglm-tuning

Comments (15)

zhangzuizui commented on July 17, 2024 6

tokenize_dataset_rows.py 中 tokenizer.encode() 加上 padding="max_length"，然后重新预处理一遍

长度padding到一致后，在batch_size > 1的情况下会报这个错

...
...
File "/opt/tiger/ChatGLM-Tuning/modeling_chatglm.py", line 268, in attention_fn
    attention_scores.masked_fill_(attention_mask, -10000.0)
RuntimeError: The expanded size of the tensor (512) must match the existing size (2) at non-singleton dimension 2.  Target sizes: [2, 32, 512, 512].  Tensor sizes: [2, 512]

以及我觉得有一个点，使得目前这份代码是有问题的，就是我这个报错里的modeling_chatglm.py：line 268，在目前的数据处理中，attention_mask全被赋值为了True，这使得在计算完attention_scores后，所有的分数都会被填充为-10000.0。这显然是不对的，会导致模型训了个寂寞，但是我也不知道为啥这样设置的时候，在batch_size==1的情况下还能算出loss，而且模型还在逐渐收敛。。不过根据我这几天在chatglm finetune上的经验，我认为这个设置下模型的训练是有问题的

有一些可以参考的资料我认为对解决这个问题有帮助

transformers bloom的源码：bloom在这一行做了attention_scores的masked_fill操作
bloom的attention_mask的处理，一般是会把prompts对应的token置为0，targets对应的token置为1，这个可以参考peft causallm的例子
bloom对输入到模型中的，shape和input_ids一致且值为0/1的attention_mask，会做一个额外处理，将其转换成我这个报错里的shape和bool型，对应源码，最后得到的attention_mask对于非targets的部分全都是True，这样会在attention_scores里被mask掉，然后targets的部分会给它弄成一个左下全是False（包含对角线），右上全是True的矩阵，这样使得targets对应的token只能“看”到前面的token
参考glm的tokenization源码，这里做的事情跟刚才我说的bloom在模型里做的事情是一样的。不过在glm这里是提前把attention_mask一步到位处理好，bloom是把一部分的处理放到了forward后

目前我按照上述我说的这个形式进行了尝试，发现模型会被训坏（lora的参数与你们相同），主要体现在模型生成的内容不流畅了，但能看出注入了新的知识。

另外当前的数据处理上我认为还有一个有问题的地方是，chatglm的tokenizer做完encode后会在input_ids后面拼一个[gMASK]的token_id和bos的token_id，在glm的设置里，[gMASK]表示生成的句子需要往这里面填。直接把一句话的prompts和targes事先拼到一起，一股脑丢到tokenizer里面去显然是不正确的，至少是与模型预训练时的设置有gap

希望我的回复对你们之后的工作有所帮助

from chatglm-tuning.

zhangzuizui commented on July 17, 2024 4

batch size > 1 的问题已更新，对input_ids做了下预处理，之前有问题是因为glm在生成position_ids只用了batch里面的第一个sentence(源码)，所以即使加上padding后，如果后面的句子长于第一个就会有shape不一致的问题。

所以这里我做了个预处理把最长的sentence放在batch里面的第一个，但我感觉这样可能也不是最好的处理方式，如果我说得不对大家敲我(

具体可以查看这块修改： f7ba507#diff-a5ff12394959301eb25d323d44fc987a79336e66a44eb3e4def7ae3515f35430R35

这个repo的处理方法我今天跑通了，结果基本符合预期，不过感觉复读机的情况有点严重。

这个老哥的做法修改了原模型里生成mask部分的代码，把数据处理成了我上面所说的格式，不过这个代码用了自己写的框架，看起来有点费劲。可以着重看一下get_masks get_position_ids
此外，建议对照着原版的实现修改，特别是保留get_masks的staticmethod装饰器，否则在使用deepspeed训练，混合精度设置为bf16时会出问题

放一下目前我的处理代码以供参考，mask和position_ids生成的部分配合我上面提到的那个repo食用

def preprocess_function(self, examples):
    inputs = examples[self.text_column]
    targets = examples[self.label_column]
    prompt_ids = self.tokenizer(
        inputs,
        max_length=self.max_input_length,
        truncation=True,
    ).input_ids
    target_ids = self.tokenizer(
        targets,
        max_length=self.max_length - self.max_input_length - 1,
        truncation=True,
        add_special_tokens=False,
    ).input_ids
    input_ids = []
    labels = []
    for i in range(len(prompt_ids)):
        cur_input_ids = (
            prompt_ids[i] + target_ids[i] + [self.tokenizer.eos_token_id]
        )
        # 从开头到[gMASK]都是-100
        cur_labels = (
            [-100] * (len(prompt_ids[i]) - 1)
            + [self.tokenizer.bos_token_id]
            + target_ids[i]
            + [self.tokenizer.eos_token_id]
        )
        assert len(cur_input_ids) == len(cur_labels)
        cur_input_ids += [self.tokenizer.pad_token_id] * (
            self.max_length - len(cur_input_ids)
        )
        cur_labels += [self.tokenizer.pad_token_id] * (
            self.max_length - len(cur_labels)
        )
        input_ids.append(torch.tensor(cur_input_ids, dtype=torch.long))
        labels.append(torch.tensor(cur_labels, dtype=torch.long))
    
    model_inputs = {
        "input_ids": input_ids,
        "labels": labels
    }
    return BatchEncoding(model_inputs)

from chatglm-tuning.

jzsues commented on July 17, 2024 2

你好，想请教一下，这个代码里面prompt_ids的最后一位为[gMASK]么？此外，这个get_masks 以及 get_position_ids 都是对batch里面第1个句子计算得到的，如果batch size大于1的话，这样做是正确的么？

@sijeh 按照上面的get_masks和get_position_ids， batch size 大于1的时候是有问题的，做了下更新，可以参考下最新的代码： https://github.com/mymusise/ChatGLM-Tuning/blob/update_mask/finetune.py#L58

看了下GLM的代码，mask和position里面的context_length和seq_len应该是不一样的。

下面是一些pseudocode，跑起来验证了下，生成的position ids和attention mask应该是没问题的，供参考。

def get_attention_mask(tokenizer, input_ids):
    seq = input_ids.tolist()
    context_len = seq.index(tokenizer.bos_token_id) + 1
    seq_len = len(seq)
    attention_mask = torch.ones((seq_len, seq_len))
    attention_mask.tril_()
    attention_mask[..., :context_len - 1] = 1
    attention_mask.unsqueeze_(0)
    attention_mask = (attention_mask < 0.5).bool()
    return attention_mask


def get_position_ids(tokenizer, input_ids, position_encoding_2d=True):
    seq = input_ids.tolist()
    seq_len = seq.index(tokenizer.bos_token_id)
    context_len = len(seq)
    cond1 = torch.where(input_ids == MASK)[0]
    cond2 = torch.where(input_ids == gMASK)[0]

    assert len(cond1) > 0 or len(cond2) > 0, ValueError('you have to add either [MASK] or [gMASK] in your input')
    mask_position = cond1[0] if len(cond1) > 0 else cond2[0]

    if position_encoding_2d:
        position_ids = torch.arange(context_len, dtype=torch.long)
        position_ids[seq_len:] = mask_position
        block_position_ids = torch.cat((
            torch.zeros(seq_len, dtype=torch.long),
            torch.arange(context_len - seq_len, dtype=torch.long) + 1
        ))
        position_ids = torch.stack((position_ids, block_position_ids), dim=0)
    else:
        position_ids = torch.arange(context_len, dtype=torch.long)
        position_ids[context_len - 1:] = mask_position
        
    return position_ids


prompt = self.tokenizer.encode(
      self.pairs[index]['prompt'],
      max_length=self.max_seq_length,
      truncation=True,
      add_special_tokens=True,

  )
completion = self.tokenizer.encode(
    self.pairs[index]['completion'],
    max_length=self.max_seq_length,
    truncation=True,
    add_special_tokens=False,
)


input_ids = prompt + completion + [self.tokenizer.eos_token_id]
labels = [IGNORE_INDEX] * (len(prompt) - 1) + [self.tokenizer.bos_token_id] + completion + [self.tokenizer.eos_token_id]
input_ids = input_ids[:self.max_seq_length] if len(input_ids) > self.max_seq_length else input_ids
labels = labels[:self.max_seq_length] if len(labels) > self.max_seq_length else labels
input_ids = torch.tensor(input_ids, dtype=torch.long)
labels = torch.tensor(labels, dtype=torch.long)
attention_mask = get_attention_mask(self.tokenizer, input_ids)
position_ids = get_position_ids(self.tokenizer, input_ids, self.position_encoding_2d)

from chatglm-tuning.

mymusise commented on July 17, 2024 1

batch size > 1 的问题已更新，对input_ids做了下预处理，之前有问题是因为glm在生成position_ids只用了batch里面的第一个sentence(源码)，所以即使加上padding后，如果后面的句子长于第一个就会有shape不一致的问题。

所以这里我做了个预处理把最长的sentence放在batch里面的第一个，但我感觉这样可能也不是最好的处理方式，如果我说得不对大家敲我(

具体可以查看这块修改： f7ba507#diff-a5ff12394959301eb25d323d44fc987a79336e66a44eb3e4def7ae3515f35430R35

from chatglm-tuning.

mymusise commented on July 17, 2024 1

你好，想请教一下，这个代码里面prompt_ids的最后一位为[gMASK]么？此外，这个get_masks 以及 get_position_ids 都是对batch里面第1个句子计算得到的，如果batch size大于1的话，这样做是正确的么？

@sijeh 按照上面的get_masks和get_position_ids， batch size 大于1的时候是有问题的，做了下更新，可以参考下最新的代码：
https://github.com/mymusise/ChatGLM-Tuning/blob/update_mask/finetune.py#L58

from chatglm-tuning.

mymusise commented on July 17, 2024 1

看了一下，gmask的条件写反了应该是

@zhangzuizui 感谢review！gmask=True这块我也是存疑惑的。

因为对比了下原模型代码use_gmask的判定，输入的序列应该都会命中use_gmask=True（没有MASK token）。但是这样出来的position_ids确实和原RoPE方法不一样，我不确定这是不是ChatGLM-6B在这上面的魔改。

而且我刚看了下ChatGLM的论文，好像他们130B的模型已经抛弃了2D得position_ids，只用一维，不知道6B用的是哪种。

顺便贴下原论文 B.3 部分：

A two-dimensional absolute position encoding method is proposed in vanilla GLM for modeling
both intra- and inter-span position information. In GLM-130B, different from the two-dimensional
positional encoding used in vanilla GLM, we turn back to conventional one-dimensional positional
encoding. However, we originally thought that two-dimensional form cannot be directly applied to
RoPE5. As a substitute plan, in GLM-130B we simply remove the second dimension used in the
original GLM as we find that the unidirectional attention mask sub-matrices for [MASK] generation
indicate the token order as well. This observation results in our transforming GLM-130B’s positional
encoding into a one-dimensional one according to the following strategies:
• For sequences corrupted by short spans, we discard the second-dimensional position encoding.
• For sequences corrupted by a long span at the end, we change the positional ids to one-dimensional
0, 1, · · · , s-1, and generated tokens will just prolong the first-dimensional positional encoding
from the last context token s-1.

from chatglm-tuning.

mymusise commented on July 17, 2024

已知问题，在TODO里，这两天会修复上，修复好我会通知你 🚀

from chatglm-tuning.

OedoSoldier commented on July 17, 2024

tokenize_dataset_rows.py 中 tokenizer.encode() 加上 padding="max_length"，然后重新预处理一遍

from chatglm-tuning.

OedoSoldier commented on July 17, 2024

tokenize_dataset_rows.py 中 tokenizer.encode() 加上 padding="max_length"，然后重新预处理一遍

长度padding到一致后，在batch_size > 1的情况下会报这个错
...
...
File "/opt/tiger/ChatGLM-Tuning/modeling_chatglm.py", line 268, in attention_fn
    attention_scores.masked_fill_(attention_mask, -10000.0)
RuntimeError: The expanded size of the tensor (512) must match the existing size (2) at non-singleton dimension 2.  Target sizes: [2, 32, 512, 512].  Tensor sizes: [2, 512]
以及我觉得有一个点，使得目前这份代码是有问题的，就是我这个报错里的modeling_chatglm.py：line 268，在目前的数据处理中，attention_mask全被赋值为了True，这使得在计算完attention_scores后，所有的分数都会被填充为-10000.0。这显然是不对的，会导致模型训了个寂寞，但是我也不知道为啥这样设置的时候，在batch_size==1的情况下还能算出loss，而且模型还在逐渐收敛。。不过根据我这几天在chatglm finetune上的经验，我认为这个设置下模型的训练是有问题的

有一些可以参考的资料我认为对解决这个问题有帮助

transformers bloom的源码：bloom在这一行做了attention_scores的masked_fill操作

bloom的attention_mask的处理，一般是会把prompts对应的token置为0，targets对应的token置为1，这个可以参考peft causallm的例子

bloom对输入到模型中的，shape和input_ids一致且值为0/1的attention_mask，会做一个额外处理，将其转换成我这个报错里的shape和bool型，对应源码，最后得到的attention_mask对于非targets的部分全都是True，这样会在attention_scores里被mask掉，然后targets的部分会给它弄成一个左下全是False（包含对角线），右上全是True的矩阵，这样使得targets对应的token只能“看”到前面的token

参考glm的tokenization源码，这里做的事情跟刚才我说的bloom在模型里做的事情是一样的。不过在glm这里是提前把attention_mask一步到位处理好，bloom是把一部分的处理放到了forward后

目前我按照上述我说的这个形式进行了尝试，发现模型会被训坏（lora的参数与你们相同），主要体现在模型生成的内容不流畅了，但能看出注入了新的知识。

另外当前的数据处理上我认为还有一个有问题的地方是，chatglm的tokenizer做完encode后会在input_ids后面拼一个[gMASK]的token_id和bos的token_id，在glm的设置里，[gMASK]表示生成的句子需要往这里面填。直接把一句话的prompts和targes事先拼到一起，一股脑丢到tokenizer里面去显然是不正确的，至少是与模型预训练时的设置有gap

希望我的回复对你们之后的工作有所帮助

查看 modeling_chatglm.py 可以发现正确的 prompt 形式应当是：

prompt = ""
for i, (old_query, response) in enumerate(history):
    prompt += "[Round {}]\n问：{}\n答：{}\n".format(i, old_query, response)
prompt += "[Round {}]\n问：{}\n答：".format(len(history), query)

from chatglm-tuning.

archwolf118 commented on July 17, 2024

其实我也有这个疑问，制作数据不是应该按"[Round {}]\n问：{}\n答：{}\n"给出吗？谁能按这个格式的数据跑通了，效果咋样？谢谢了。

from chatglm-tuning.

mymusise commented on July 17, 2024

attention_mask全被赋值为了True，这使得在计算完attention_scores后，所有的分数都会被填充为-10000.0

@zhangzuizui 这块把mask全设为True是参考的minimal-llama的处理，我也试过不传attention_mask，由model内部自己生成，但是我发现这样loss很快就收敛了，然后infer出来的东西也乱了。(至于为什么work我还得追追看🤣)

当然可能跟你说的gMASK有关，目前gMASK放到了最后应该也是有问题的，这块我再debug看看glm的get_masks怎么搞的 (

from chatglm-tuning.

mymusise commented on July 17, 2024

查看 modeling_chatglm.py 可以发现正确的 prompt 形式应当是：

prompt = ""
for i, (old_query, response) in enumerate(history):
    prompt += "[Round {}]\n问：{}\n答：{}\n".format(i, old_query, response)
prompt += "[Round {}]\n问：{}\n答：".format(len(history), query)

@OedoSoldier 我也发现这个问题，如果用中文的指令集prompt用这个格式应该是最好的

但我想如果要做finetune的话，指令集的prompt格式应该还好？

但我也没试过用这个格式train一遍做对比，试过的兄弟可以敲敲我 (

from chatglm-tuning.

sijeh commented on July 17, 2024

batch size > 1 的问题已更新，对input_ids做了下预处理，之前有问题是因为glm在生成position_ids只用了batch里面的第一个sentence(源码)，所以即使加上padding后，如果后面的句子长于第一个就会有shape不一致的问题。
所以这里我做了个预处理把最长的sentence放在batch里面的第一个，但我感觉这样可能也不是最好的处理方式，如果我说得不对大家敲我(
具体可以查看这块修改： f7ba507#diff-a5ff12394959301eb25d323d44fc987a79336e66a44eb3e4def7ae3515f35430R35

这个repo的处理方法我今天跑通了，结果基本符合预期，不过感觉复读机的情况有点严重。

这个老哥的做法修改了原模型里生成mask部分的代码，把数据处理成了我上面所说的格式，不过这个代码用了自己写的框架，看起来有点费劲。可以着重看一下get_masks get_position_ids 此外，建议对照着原版的实现修改，特别是保留get_masks的staticmethod装饰器，否则在使用deepspeed训练，混合精度设置为bf16时会出问题

放一下目前我的处理代码以供参考，mask和position_ids生成的部分配合我上面提到的那个repo食用
def preprocess_function(self, examples):
    inputs = examples[self.text_column]
    targets = examples[self.label_column]
    prompt_ids = self.tokenizer(
        inputs,
        max_length=self.max_input_length,
        truncation=True,
    ).input_ids
    target_ids = self.tokenizer(
        targets,
        max_length=self.max_length - self.max_input_length - 1,
        truncation=True,
        add_special_tokens=False,
    ).input_ids
    input_ids = []
    labels = []
    for i in range(len(prompt_ids)):
        cur_input_ids = (
            prompt_ids[i] + target_ids[i] + [self.tokenizer.eos_token_id]
        )
        # 从开头到[gMASK]都是-100
        cur_labels = (
            [-100] * (len(prompt_ids[i]) - 1)
            + [self.tokenizer.bos_token_id]
            + target_ids[i]
            + [self.tokenizer.eos_token_id]
        )
        assert len(cur_input_ids) == len(cur_labels)
        cur_input_ids += [self.tokenizer.pad_token_id] * (
            self.max_length - len(cur_input_ids)
        )
        cur_labels += [self.tokenizer.pad_token_id] * (
            self.max_length - len(cur_labels)
        )
        input_ids.append(torch.tensor(cur_input_ids, dtype=torch.long))
        labels.append(torch.tensor(cur_labels, dtype=torch.long))
    
    model_inputs = {
        "input_ids": input_ids,
        "labels": labels
    }
    return BatchEncoding(model_inputs)

你好，想请教一下，这个代码里面prompt_ids的最后一位为[gMASK]么？此外，这个get_masks 以及 get_position_ids 都是对batch里面第1个句子计算得到的，如果batch size大于1的话，这样做是正确的么？

from chatglm-tuning.

zhangzuizui commented on July 17, 2024

@sijeh 按照上面的get_masks和get_position_ids， batch size 大于1的时候是有问题的，做了下更新，可以参考下最新的代码： https://github.com/mymusise/ChatGLM-Tuning/blob/update_mask/finetune.py#L58

这个修改，对position_ids的生成是不是错误的地方？
假如数据是这样：token1, token2, [gMASK], token3, token4
那position_ids应该是: [[0, 1, 2, 2, 2], [0, 0, 0, 1, 2]]
现在是[[0, 1, 2, 3, 4], [0, 0, 0, 1, 2]]

看了一下，gmask的条件写反了应该是

from chatglm-tuning.

mymusise commented on July 17, 2024

batch size > 1 问题已修复，如有问题再reopen

from chatglm-tuning.

finetune指定--per_device_train_batch_size 大于1时报错 about chatglm-tuning HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent