Coder Social home page Coder Social logo

yangjianxin1 / firefly-llama2-chinese Goto Github PK

View Code? Open in Web Editor NEW
373.0 11.0 25.0 2.07 MB

Firefly中文LLaMA-2大模型,支持增量预训练Baichuan2、Llama2、Llama、Falcon、Qwen、Baichuan、InternLM、Bloom等大模型

Python 100.00%
firefly llama llama-2 llama2 llm baichuan baichuan-13b bloom chatglm falcon

firefly-llama2-chinese's Issues

AttributeError: 'NoneType' object has no attribute 'get'

torchrun --nproc_per_node=1 train.py --train_args_file train_args/Glm.yaml
Traceback (most recent call last):
File "/home/yierde/anaconda3/envs/tn/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/home/yierde/anaconda3/envs/tn/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/yierde/anaconda3/envs/tn/lib/python3.11/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/yierde/anaconda3/envs/tn/lib/python3.11/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/yierde/anaconda3/envs/tn/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/yierde/anaconda3/envs/tn/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 237, in launch_agent
result = agent.run()
^^^^^^^^^^^
File "/home/yierde/anaconda3/envs/tn/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/yierde/anaconda3/envs/tn/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
result = self._invoke_run(role)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/yierde/anaconda3/envs/tn/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 844, in _invoke_run
self._initialize_workers(self._worker_group)
File "/home/yierde/anaconda3/envs/tn/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/yierde/anaconda3/envs/tn/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 681, in _initialize_workers
worker_ids = self._start_workers(worker_group)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/yierde/anaconda3/envs/tn/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/yierde/anaconda3/envs/tn/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 271, in _start_workers
self._pcontext = start_processes(
^^^^^^^^^^^^^^^^
File "/home/yierde/anaconda3/envs/tn/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/init.py", line 207, in start_processes
redirs = to_map(redirects, nprocs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/yierde/anaconda3/envs/tn/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 162, in to_map
map[i] = val_or_map.get(i, Std.NONE)
^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'get'

怎么解决它? 我没办法跑起来。。。

关于训练细节

您好,请问在指令微调时,验证集是怎么构建的?大概多大?

数据处理模块问题,train_texts长度为1

您好,咨询以下数据处理模块的问题。我的pt数据路径下共有5个txt文件,在加载阶段也都是可以正常加载,如下所示:
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:13<00:00, 2.68s/it]
2024-01-22 17:31:05.728 | INFO | component.dataset:load_dataset:120 - Total num of training text: 5
2024-01-22 17:31:05.728 | INFO | component.dataset:load_dataset:123 - Start tokenizing data ...
0%| | 0/1 [00:00<?, ?it/s]
可以看到加载完数据后开始tokenizing的时候,总数据才是1。我看了下component的dataset.py,在load_datasets里打印了train_texts的长度,结果就是1,也就是说把数据全部添加到train_texts列表的时候,多了一层[],这样的话,在for i in tqdm(range(0, len(train_texts), self.tokenize_batch)) tokenizing时步长是不是不对?而且我测试的时候这一步经常会爆内存(内存320G),看起来是循环出问题了。不清楚为什么会多一层列表,是哪个环节读取txt的时候出的问题呢。
如果我把改为load_datasets里的代码改为train_texts = train_texts[0],可以看到tqdm显示数据总量似乎正确,但这样是会正常处理数据吗?希望可以得到解答。

微调模型保存问题

@yangjianxin1 ,您好, 在调用torchrun --nproc_per_node={num_gpus} train.py --train_args_file train_args/llama2-13b-ext.yaml这个命令之后,全量模型微调运行结束,为什么没有保存微调后的模型在output文件夹下?output文件夹下只有一些训练参数文件,没有模型文件?

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Error operation not supported at line 351 in file /home/tim/git/bitsandbytes/csrc/pythonInterface.c
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 31779) of binary: /root/miniconda3/envs/chatglm_ft/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/chatglm_ft/bin/torchrun", line 8, in
sys.exit(main())
File "/root/miniconda3/envs/chatglm_ft/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/chatglm_ft/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/root/miniconda3/envs/chatglm_ft/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/root/miniconda3/envs/chatglm_ft/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/chatglm_ft/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

CUDA:11.7
CentOS Linux release 7.7.1908 (Core)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.