Coder Social home page Coder Social logo

Comments (5)

yyqi17 avatar yyqi17 commented on September 11, 2024 1

以下是修改后跑通deepspeed单机多卡的主要替换代码(替换 trainer=LoRATrainer 及之后的部分):

model_engine, optimizer, train_dataloader, _ = deepspeed.initialize(config=conf,
                                                                    model=model,
                                                                    model_parameters=model.parameters(),
                                                                    training_data=train_dataset,
                                                                    collate_fn=coll_fn)
model_engine.train()
for i_epoch in range(global_args.num_train_epochs):
    for micro_step, batch in enumerate(train_dataloader):
        input_ids = batch["input_ids"].to(model_engine.local_rank)
        labels = batch["labels"].to(model_engine.local_rank)
        
        outputs = model_engine.forward(input_ids=input_ids, labels=labels)
        loss = outputs[0]
        
        model_engine.backward(loss)
        model_engine.step()

    save_dir = f'{global_args.output_dir}/{i_epoch}'
    model_engine.save_pretrained(save_dir)

补充:

  1. 这里coll_fn用原始的DataCollatorForChatGLM会有问题,coll_fn是一个单独的函数(类似DataCollatorForChatGLM.call
  2. model加载时复用了官方脚本里的加载方式

最后train.sh里的python改成deepspeed启动就可以了

from chatglm-6b-qlora.

shuxueslpi avatar shuxueslpi commented on September 11, 2024

暂时还有点问题,我也在调试,会尽快更新

from chatglm-6b-qlora.

yyqi17 avatar yyqi17 commented on September 11, 2024

以下是修改后跑通deepspeed单机多卡的主要替换代码(替换 trainer=LoRATrainer 及之后的部分):

model_engine, optimizer, train_dataloader, _ = deepspeed.initialize(config=conf,
                                                                    model=model,
                                                                    model_parameters=model.parameters(),
                                                                    training_data=train_dataset,
                                                                    collate_fn=coll_fn)
model_engine.train()
for i_epoch in range(global_args.num_train_epochs):
    for micro_step, batch in enumerate(train_dataloader):
        input_ids = batch["input_ids"].to(model_engine.local_rank)
        labels = batch["labels"].to(model_engine.local_rank)
        
        outputs = model_engine.forward(input_ids=input_ids, labels=labels)
        loss = outputs[0]
        
        model_engine.backward(loss)
        model_engine.step()

    save_dir = f'{global_args.output_dir}/{i_epoch}'
    model_engine.save_pretrained(save_dir)

补充:

  1. 这里coll_fn用原始的DataCollatorForChatGLM会有问题,coll_fn是一个单独的函数(类似DataCollatorForChatGLM.call
  2. model加载时复用了官方脚本里的加载方式

最后train.sh里的python改成deepspeed启动就可以了

这个conf是lora_config吗

不是,conf是deepspeed的配置,比如像下面这样

conf = {"train_micro_batch_size_per_gpu": args.per_device_train_batch_size,
      "gradient_accumulation_steps": args.gradient_accumulation_steps,
      "gradient_clipping": 1.0,
      "optimizer": {
          "type": "Adam",
          "params": {
              "lr": args.learning_rate,
              "betas": [
                  0.9,
                  0.95
              ],
              "eps": 1e-8,
              "weight_decay": args.weight_decay
          }
      },
      "fp16": {
          "enabled": False
      },
      "zero_optimization": {
          "stage": args.zero_stage,
          "offload_optimizer": {
              "device": "cpu",
              "pin_memory": True
          },
          "allgather_partitions": True,
          "allgather_bucket_size": 2e8,
          "overlap_comm": True,
          "reduce_scatter": True,
          "reduce_bucket_size": 2e8,
          "contiguous_gradients": True
      },
  }

from chatglm-6b-qlora.

WellWang-S avatar WellWang-S commented on September 11, 2024

以下是修改后跑通deepspeed单机多卡的主要替换代码(替换 trainer=LoRATrainer 及之后的部分):

model_engine, optimizer, train_dataloader, _ = deepspeed.initialize(config=conf,
                                                                    model=model,
                                                                    model_parameters=model.parameters(),
                                                                    training_data=train_dataset,
                                                                    collate_fn=coll_fn)
model_engine.train()
for i_epoch in range(global_args.num_train_epochs):
    for micro_step, batch in enumerate(train_dataloader):
        input_ids = batch["input_ids"].to(model_engine.local_rank)
        labels = batch["labels"].to(model_engine.local_rank)
        
        outputs = model_engine.forward(input_ids=input_ids, labels=labels)
        loss = outputs[0]
        
        model_engine.backward(loss)
        model_engine.step()

    save_dir = f'{global_args.output_dir}/{i_epoch}'
    model_engine.save_pretrained(save_dir)

补充:

  1. 这里coll_fn用原始的DataCollatorForChatGLM会有问题,coll_fn是一个单独的函数(类似DataCollatorForChatGLM.call
  2. model加载时复用了官方脚本里的加载方式

最后train.sh里的python改成deepspeed启动就可以了

多卡训练会报错,untimeError: Expected all tensors to be on the same device, but found at least two devices,你有遇到吗

from chatglm-6b-qlora.

yyqi17 avatar yyqi17 commented on September 11, 2024

以下是修改后跑通deepspeed单机多卡的主要替换代码(替换 trainer=LoRATrainer 及之后的部分):

model_engine, optimizer, train_dataloader, _ = deepspeed.initialize(config=conf,
                                                                    model=model,
                                                                    model_parameters=model.parameters(),
                                                                    training_data=train_dataset,
                                                                    collate_fn=coll_fn)
model_engine.train()
for i_epoch in range(global_args.num_train_epochs):
    for micro_step, batch in enumerate(train_dataloader):
        input_ids = batch["input_ids"].to(model_engine.local_rank)
        labels = batch["labels"].to(model_engine.local_rank)
        
        outputs = model_engine.forward(input_ids=input_ids, labels=labels)
        loss = outputs[0]
        
        model_engine.backward(loss)
        model_engine.step()

    save_dir = f'{global_args.output_dir}/{i_epoch}'
    model_engine.save_pretrained(save_dir)

补充:

  1. 这里coll_fn用原始的DataCollatorForChatGLM会有问题,coll_fn是一个单独的函数(类似DataCollatorForChatGLM.call
  2. model加载时复用了官方脚本里的加载方式

最后train.sh里的python改成deepspeed启动就可以了

多卡训练会报错,untimeError: Expected all tensors to be on the same device, but found at least two devices,你有遇到吗

我遇到的时候这个报错是来自于model加载部分,也就是在这块代码之前model=xxxModel()那里,或许可以看一下model_device_map是不是正确的

from chatglm-6b-qlora.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.