aws-neuron / neuronx-nemo-megatron Goto Github PK

Python 57.31% C++ 3.67% Cuda 4.11% C 0.02% Shell 0.61% Makefile 0.04% CSS 0.04% HTML 0.07% Dockerfile 0.05% Groovy 0.02% JavaScript 0.01% TeX 0.24% Ruby 0.07% Jupyter Notebook 33.74%

neuronx-nemo-megatron's People

Contributors

Stargazers

Watchers

Forkers

aroparasaws awaelchli yasuhisa-nakashima keitaw antoineblanot evellasques neo9061 zhaoting ptoulme-aws

neuronx-nemo-megatron's Issues

Issue with Llama conversion for new release

I noticed that in the latest release, llama_module.py was replaced with falcon_module.py. And then, in test_llama.sh, you rely on megatron_gpt_pretraining.py (which relies on MegatronGPTModel instead of llama_module.py).

The problem is, MegatronGPTModel at some point relies on transformer.py (instead of llama_module.py) and there, for Swiglu, you've replaced the two separate MLP layers (dense_h_to_4h and dense_h_to_4h_2) with a single one, twice as large:

        self.dense_h_to_4h = tensor_parallel.ColumnParallelLinear(
            hidden_size,
            2*ffn_hidden_size if self.glu_activation_family else ffn_hidden_size,
            gather_output=False,
            init_method=init_method,
            skip_bias_add=True,
            resume_from_checkpoint=resume_from_checkpoint,
            use_cpu_initialization=use_cpu_initialization,
            bias=bias,
            sequence_parallel_enabled=sequence_parallel,
            no_async_tensor_model_parallel_allreduce=no_async_tensor_model_parallel_allreduce,
            gradient_accumulation_fusion=gradient_accumulation_fusion,
            transfer_with_static_ring=transfer_with_static_ring,
        )

While in llama_module.pyyou had:

 self.dense_h_to_4h = tensor_parallel.ColumnParallelLinear(
            hidden_size,
            ffn_hidden_size,  # NOTE: When using geglu, divide ffn dim by 2/3 to keep overall params the same.
            gather_output=False,
            init_method=init_method,
            skip_bias_add=True,
            use_cpu_initialization=use_cpu_initialization,
            bias=bias,
            sequence_parallel_enabled=sequence_parallel,
            no_async_tensor_model_parallel_allreduce=no_async_tensor_model_parallel_allreduce,
            gradient_accumulation_fusion=gradient_accumulation_fusion,
            transfer_with_static_ring=transfer_with_static_ring,
        )

        if activation in ['geglu', 'reglu', 'swiglu']:
            # Separate linear layer for *GLU activations.
            # Source: https://github.com/huggingface/transformers/blob/bee361c6f1f7704f8c688895f2f86f6e5ff84727/src/transformers/models/t5/modeling_t5.py#L292
            self.dense_h_to_4h_2 = tensor_parallel.ColumnParallelLinear(

But then you would have to change the checkpoint conversion script for llama as well, it's currently:

translation = {
        "model.language_model.embedding.word_embeddings.weight": (1, "model.embed_tokens.weight", 0, 0),
        # a['model']['language_model']['word_embeddings']['weight']
        "input_layernorm.weight": (0, "input_layernorm.weight", None, 0),
        "self_attention.query_key_value.weight": (1, "self_attn.query_key_value.weight", 0, 0),
        "self_attention.dense.weight": (1, "self_attn.o_proj.weight", 1, 0),
        "post_attention_layernorm.weight": (0, "post_attention_layernorm.weight", None, 0),
        "self_attention.core_attention.rotary_emb.inv_freq": (0, "self_attn.rotary_emb.inv_freq", None, 0),
        "mlp.dense_h_to_4h.weight": (1, "mlp.gate_proj.weight", 0, 0),
        "mlp.dense_h_to_4h_2.weight": (1, "mlp.up_proj.weight", 0, 0),
        "mlp.dense_4h_to_h.weight": (1, "mlp.down_proj.weight", 1, 0),
        "model.language_model.encoder.final_layernorm.weight": (0, "model.norm.weight", None, 0),
        "model.language_model.output_layer.weight": (1, "lm_head.weight", 0, 0),
    }

This is currently causing a crash when I try to load a checkpoint converted from HF Llama since it expects dense_h_to_4h to be a concatenation of gate_proj and up_proj (from the HF checkpoint):

RuntimeError: Error(s) in loading state_dict for MegatronGPTModel:
        size mismatch for model.language_model.encoder.layers.0.mlp.dense_h_to_4h.weight: copying a param with 
shape torch.Size([1376, 4096]) from checkpoint, the shape in current model is torch.Size([2752, 4096]).

Typo in nemo.collections.nlp.parts.serialization.py

Hi,

In serialization.py, the save method will use SimpleSaver (when
save_xser is true):

class SimpleSaver:

    def __init__(self):
        pass

    def add_save_task(self, data, path):
        torch.save(data, path)

However, in line 105, it tries to instantiate SimplerSaver (instead of SimpleSaver):

  if saver is None:
      saver = SimplerSaver()

This is causing a crash:

    saver = SimplerSaver()
NameError: name 'SimplerSaver' is not defined

Validation crashes when continuing training on a previous checkpoint

Background:

When val_check_interval is set to 1, validation loss works as expected. However, when one loads a previous checkpoint (by setting +model.load_xser=True and +model.resume_from_checkpoint=/shared/model/mp_rank_00/megatron_bigcode--step\=154-consumed_samples\=156672.0.ckpt), MegatronGPTModel.validation_epoch_end() is crashing on line 805

neuronx-nemo-megatron/nemo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

Line 805 in 6038bb8

averaged_loss = torch.stack(outputs).mean()

this is the error message:

averaged_loss = torch.stack(outputs).mean()
  File "/shared/home/env/aws_neuron_venv_pytorch_py39v2/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", li
ne 180, in on_run_end
  File "/shared/home/env/aws_neuron_venv_pytorch_py39v2/lib/python3.9/site-packages/hydra/_internal/utils.py", line 213, in run_and_report
  File "/shared/home/env/aws_neuron_venv_pytorch_py39v2/lib/python3.9/site-packages/nemo/collections/nlp/models/language_modeling/megatron_gpt
_model.py", line 805, in validation_epoch_end
RuntimeError: stack expects a non-empty TensorList
RuntimeError: stack expects a non-empty TensorList
    self._evaluation_epoch_end(self._outputs)
  File "/shared/home/env/aws_neuron_venv_pytorch_py39v2/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", li
ne 288, in _evaluation_epoch_end

  File "/shared/home/env/aws_neuron_venv_pytorch_py39v2/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_
train
        trainer.fit(model)    
self._outputs = self.epoch_loop.run(self._data_fetcher)self._run(model, ckpt_path=self.ckpt_path)

Basically, outputs is an empty list. But for what I understood, in that particular case line 802 parallel_state.is_pipeline_last_stage() should return False.

HF->nemo checkpoint conversion script for llama2 models

Hi, while the code for training llama2 architecture (without GQA?) models has been released, I couldn't find any scripts for converting a pre-trained llama2's HF weights to a nemo checkpoint. If that is correct, we can only train a llama2 from scratch instead of finetuning a pre-trained one at the moment.
Is there a plan to release a llama2 conversion script?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.