neuronx-nemo-megatron's People
Forkers
aroparasaws awaelchli yasuhisa-nakashima keitaw antoineblanot evellasques neo9061 zhaoting ptoulme-awsneuronx-nemo-megatron's Issues
Issue with Llama conversion for new release
I noticed that in the latest release, llama_module.py was replaced with falcon_module.py. And then, in test_llama.sh, you rely on megatron_gpt_pretraining.py (which relies on MegatronGPTModel instead of llama_module.py).
The problem is, MegatronGPTModel at some point relies on transformer.py (instead of llama_module.py) and there, for Swiglu, you've replaced the two separate MLP layers (dense_h_to_4h
and dense_h_to_4h_2
) with a single one, twice as large:
self.dense_h_to_4h = tensor_parallel.ColumnParallelLinear(
hidden_size,
2*ffn_hidden_size if self.glu_activation_family else ffn_hidden_size,
gather_output=False,
init_method=init_method,
skip_bias_add=True,
resume_from_checkpoint=resume_from_checkpoint,
use_cpu_initialization=use_cpu_initialization,
bias=bias,
sequence_parallel_enabled=sequence_parallel,
no_async_tensor_model_parallel_allreduce=no_async_tensor_model_parallel_allreduce,
gradient_accumulation_fusion=gradient_accumulation_fusion,
transfer_with_static_ring=transfer_with_static_ring,
)
While in llama_module.py
you had:
self.dense_h_to_4h = tensor_parallel.ColumnParallelLinear(
hidden_size,
ffn_hidden_size, # NOTE: When using geglu, divide ffn dim by 2/3 to keep overall params the same.
gather_output=False,
init_method=init_method,
skip_bias_add=True,
use_cpu_initialization=use_cpu_initialization,
bias=bias,
sequence_parallel_enabled=sequence_parallel,
no_async_tensor_model_parallel_allreduce=no_async_tensor_model_parallel_allreduce,
gradient_accumulation_fusion=gradient_accumulation_fusion,
transfer_with_static_ring=transfer_with_static_ring,
)
if activation in ['geglu', 'reglu', 'swiglu']:
# Separate linear layer for *GLU activations.
# Source: https://github.com/huggingface/transformers/blob/bee361c6f1f7704f8c688895f2f86f6e5ff84727/src/transformers/models/t5/modeling_t5.py#L292
self.dense_h_to_4h_2 = tensor_parallel.ColumnParallelLinear(
But then you would have to change the checkpoint conversion script for llama as well, it's currently:
translation = {
"model.language_model.embedding.word_embeddings.weight": (1, "model.embed_tokens.weight", 0, 0),
# a['model']['language_model']['word_embeddings']['weight']
"input_layernorm.weight": (0, "input_layernorm.weight", None, 0),
"self_attention.query_key_value.weight": (1, "self_attn.query_key_value.weight", 0, 0),
"self_attention.dense.weight": (1, "self_attn.o_proj.weight", 1, 0),
"post_attention_layernorm.weight": (0, "post_attention_layernorm.weight", None, 0),
"self_attention.core_attention.rotary_emb.inv_freq": (0, "self_attn.rotary_emb.inv_freq", None, 0),
"mlp.dense_h_to_4h.weight": (1, "mlp.gate_proj.weight", 0, 0),
"mlp.dense_h_to_4h_2.weight": (1, "mlp.up_proj.weight", 0, 0),
"mlp.dense_4h_to_h.weight": (1, "mlp.down_proj.weight", 1, 0),
"model.language_model.encoder.final_layernorm.weight": (0, "model.norm.weight", None, 0),
"model.language_model.output_layer.weight": (1, "lm_head.weight", 0, 0),
}
This is currently causing a crash when I try to load a checkpoint converted from HF Llama since it expects dense_h_to_4h
to be a concatenation of gate_proj
and up_proj
(from the HF checkpoint):
RuntimeError: Error(s) in loading state_dict for MegatronGPTModel:
size mismatch for model.language_model.encoder.layers.0.mlp.dense_h_to_4h.weight: copying a param with
shape torch.Size([1376, 4096]) from checkpoint, the shape in current model is torch.Size([2752, 4096]).
Typo in nemo.collections.nlp.parts.serialization.py
Hi,
In serialization.py, the save
method will use SimpleSaver
(when
save_xser
is true):
class SimpleSaver:
def __init__(self):
pass
def add_save_task(self, data, path):
torch.save(data, path)
However, in line 105, it tries to instantiate SimplerSaver
(instead of SimpleSaver
):
if saver is None:
saver = SimplerSaver()
This is causing a crash:
saver = SimplerSaver()
NameError: name 'SimplerSaver' is not defined
Validation crashes when continuing training on a previous checkpoint
Background:
When val_check_interval
is set to 1
, validation loss works as expected. However, when one loads a previous checkpoint (by setting +model.load_xser=True
and +model.resume_from_checkpoint=/shared/model/mp_rank_00/megatron_bigcode--step\=154-consumed_samples\=156672.0.ckpt
), MegatronGPTModel.validation_epoch_end()
is crashing on line 805
this is the error message:
averaged_loss = torch.stack(outputs).mean()
File "/shared/home/env/aws_neuron_venv_pytorch_py39v2/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", li
ne 180, in on_run_end
File "/shared/home/env/aws_neuron_venv_pytorch_py39v2/lib/python3.9/site-packages/hydra/_internal/utils.py", line 213, in run_and_report
File "/shared/home/env/aws_neuron_venv_pytorch_py39v2/lib/python3.9/site-packages/nemo/collections/nlp/models/language_modeling/megatron_gpt
_model.py", line 805, in validation_epoch_end
RuntimeError: stack expects a non-empty TensorList
RuntimeError: stack expects a non-empty TensorList
self._evaluation_epoch_end(self._outputs)
File "/shared/home/env/aws_neuron_venv_pytorch_py39v2/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", li
ne 288, in _evaluation_epoch_end
File "/shared/home/env/aws_neuron_venv_pytorch_py39v2/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_
train
trainer.fit(model)
self._outputs = self.epoch_loop.run(self._data_fetcher)self._run(model, ckpt_path=self.ckpt_path)
Basically, outputs
is an empty list. But for what I understood, in that particular case line 802 parallel_state.is_pipeline_last_stage()
should return False
.
HF->nemo checkpoint conversion script for llama2 models
Hi, while the code for training llama2 architecture (without GQA?) models has been released, I couldn't find any scripts for converting a pre-trained llama2's HF weights to a nemo checkpoint. If that is correct, we can only train a llama2 from scratch instead of finetuning a pre-trained one at the moment.
Is there a plan to release a llama2 conversion script?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.