Comments (2)
请问作者这个能用多个节点多卡进行分布式训练吗,我用4个节点,每个节点两张gpu,但只有一个节点正常工作,另外几个节点的GPU并没有工作。
谢谢!
应该是可以直接支持的,以下来源于llama2.c:
To run on a single GPU small debug run, example:
$ python -m train.py --compile=False --eval_iters=10 --batch_size=8
To run with DDP on 4 gpus on 1 node, example:
$ torchrun --standalone --nproc_per_node=4 train.py
To run with DDP on 4 gpus across 2 nodes, example:
- Run on the first (master) node with example IP 123.456.123.456:
$ torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=123.456.123.456 --master_port=1234 train.py - Run on the worker node:
$ torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr=123.456.123.456 --master_port=1234 train.py
(If your cluster does not have Infiniband interconnect prepend NCCL_IB_DISABLE=1)
from baby-llama2-chinese.
好的谢谢!
from baby-llama2-chinese.
Related Issues (20)
- Ignore the `freqs_cis` buffer so that DDP does not broadcast it at construction time
- 为了丰富和扩充本项目,这里开源了使用deepspeed进行训练的代码和权重(1.75B)
- 请问下这个报错是什么信息?
- 请问下这个报错是哪里配置的不对吗?
- Problem with tokenizer? HOT 3
- 请问单卡16G显存的4060Ti能训练吗? HOT 1
- 关于运行一段时间,机器断电,如何继续训练 HOT 2
- c4-zh数据有问题 HOT 3
- 预训练模型参数和eval参数维度不匹配的问题
- 交个作业吧
- 请问支持tensorrt llm部署吗
- 作者,这个项目支持断点续训嘛 HOT 2
- 请问在处理微调数据集时为何要限制文本长度? HOT 1
- 预训练阶段,每条训练样本混杂着不同的句子(不同句子用<eos>隔开)
- chatglm_tokenizer 模块是在哪个软件包中? HOT 2
- 请问哪步加的 Positional embeddings HOT 1
- 请问大数据量怎么加载呢?
- 请问语言模型的强化学习有可以参考的开源项目吗?
- smallvocab tokenizer
- 模型的回答较长,输出结果不完整要怎么解决
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from baby-llama2-chinese.