Coder Social home page Coder Social logo

aws-neuron / neuronx-nemo-megatron Goto Github PK

View Code? Open in Web Editor NEW
26.0 26.0 12.0 51.43 MB

Python 57.31% C++ 3.67% Cuda 4.11% C 0.02% Shell 0.61% Makefile 0.04% CSS 0.04% HTML 0.07% Dockerfile 0.05% Groovy 0.02% JavaScript 0.01% TeX 0.24% Ruby 0.07% Jupyter Notebook 33.74%

neuronx-nemo-megatron's People

Contributors

5cp avatar amazon-auto avatar amithrm avatar aroparasaws avatar aws-mesharma avatar aws-sadaf avatar dependabot[bot] avatar hahtk avatar micwade-aws avatar ptoulme-aws avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

neuronx-nemo-megatron's Issues

Issue with Llama conversion for new release

I noticed that in the latest release, llama_module.py was replaced with falcon_module.py. And then, in test_llama.sh, you rely on megatron_gpt_pretraining.py (which relies on MegatronGPTModel instead of llama_module.py).

The problem is, MegatronGPTModel at some point relies on transformer.py (instead of llama_module.py) and there, for Swiglu, you've replaced the two separate MLP layers (dense_h_to_4h and dense_h_to_4h_2) with a single one, twice as large:

        self.dense_h_to_4h = tensor_parallel.ColumnParallelLinear(
            hidden_size,
            2*ffn_hidden_size if self.glu_activation_family else ffn_hidden_size,
            gather_output=False,
            init_method=init_method,
            skip_bias_add=True,
            resume_from_checkpoint=resume_from_checkpoint,
            use_cpu_initialization=use_cpu_initialization,
            bias=bias,
            sequence_parallel_enabled=sequence_parallel,
            no_async_tensor_model_parallel_allreduce=no_async_tensor_model_parallel_allreduce,
            gradient_accumulation_fusion=gradient_accumulation_fusion,
            transfer_with_static_ring=transfer_with_static_ring,
        )

While in llama_module.pyyou had:

 self.dense_h_to_4h = tensor_parallel.ColumnParallelLinear(
            hidden_size,
            ffn_hidden_size,  # NOTE: When using geglu, divide ffn dim by 2/3 to keep overall params the same.
            gather_output=False,
            init_method=init_method,
            skip_bias_add=True,
            use_cpu_initialization=use_cpu_initialization,
            bias=bias,
            sequence_parallel_enabled=sequence_parallel,
            no_async_tensor_model_parallel_allreduce=no_async_tensor_model_parallel_allreduce,
            gradient_accumulation_fusion=gradient_accumulation_fusion,
            transfer_with_static_ring=transfer_with_static_ring,
        )

        if activation in ['geglu', 'reglu', 'swiglu']:
            # Separate linear layer for *GLU activations.
            # Source: https://github.com/huggingface/transformers/blob/bee361c6f1f7704f8c688895f2f86f6e5ff84727/src/transformers/models/t5/modeling_t5.py#L292
            self.dense_h_to_4h_2 = tensor_parallel.ColumnParallelLinear(

But then you would have to change the checkpoint conversion script for llama as well, it's currently:

translation = {
        "model.language_model.embedding.word_embeddings.weight": (1, "model.embed_tokens.weight", 0, 0),
        # a['model']['language_model']['word_embeddings']['weight']
        "input_layernorm.weight": (0, "input_layernorm.weight", None, 0),
        "self_attention.query_key_value.weight": (1, "self_attn.query_key_value.weight", 0, 0),
        "self_attention.dense.weight": (1, "self_attn.o_proj.weight", 1, 0),
        "post_attention_layernorm.weight": (0, "post_attention_layernorm.weight", None, 0),
        "self_attention.core_attention.rotary_emb.inv_freq": (0, "self_attn.rotary_emb.inv_freq", None, 0),
        "mlp.dense_h_to_4h.weight": (1, "mlp.gate_proj.weight", 0, 0),
        "mlp.dense_h_to_4h_2.weight": (1, "mlp.up_proj.weight", 0, 0),
        "mlp.dense_4h_to_h.weight": (1, "mlp.down_proj.weight", 1, 0),
        "model.language_model.encoder.final_layernorm.weight": (0, "model.norm.weight", None, 0),
        "model.language_model.output_layer.weight": (1, "lm_head.weight", 0, 0),
    }

This is currently causing a crash when I try to load a checkpoint converted from HF Llama since it expects dense_h_to_4h to be a concatenation of gate_proj and up_proj (from the HF checkpoint):

RuntimeError: Error(s) in loading state_dict for MegatronGPTModel:
        size mismatch for model.language_model.encoder.layers.0.mlp.dense_h_to_4h.weight: copying a param with 
shape torch.Size([1376, 4096]) from checkpoint, the shape in current model is torch.Size([2752, 4096]).

Typo in nemo.collections.nlp.parts.serialization.py

Hi,

In serialization.py, the save method will use SimpleSaver (when
save_xser is true):

class SimpleSaver:

    def __init__(self):
        pass

    def add_save_task(self, data, path):
        torch.save(data, path)

However, in line 105, it tries to instantiate SimplerSaver (instead of SimpleSaver):

  if saver is None:
      saver = SimplerSaver()

This is causing a crash:

    saver = SimplerSaver()
NameError: name 'SimplerSaver' is not defined

Validation crashes when continuing training on a previous checkpoint

Background:

When val_check_interval is set to 1, validation loss works as expected. However, when one loads a previous checkpoint (by setting +model.load_xser=True and +model.resume_from_checkpoint=/shared/model/mp_rank_00/megatron_bigcode--step\=154-consumed_samples\=156672.0.ckpt), MegatronGPTModel.validation_epoch_end() is crashing on line 805


this is the error message:

averaged_loss = torch.stack(outputs).mean()
  File "/shared/home/env/aws_neuron_venv_pytorch_py39v2/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", li
ne 180, in on_run_end
  File "/shared/home/env/aws_neuron_venv_pytorch_py39v2/lib/python3.9/site-packages/hydra/_internal/utils.py", line 213, in run_and_report
  File "/shared/home/env/aws_neuron_venv_pytorch_py39v2/lib/python3.9/site-packages/nemo/collections/nlp/models/language_modeling/megatron_gpt
_model.py", line 805, in validation_epoch_end
RuntimeError: stack expects a non-empty TensorList
RuntimeError: stack expects a non-empty TensorList
    self._evaluation_epoch_end(self._outputs)
  File "/shared/home/env/aws_neuron_venv_pytorch_py39v2/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", li
ne 288, in _evaluation_epoch_end

  File "/shared/home/env/aws_neuron_venv_pytorch_py39v2/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_
train
        trainer.fit(model)    
self._outputs = self.epoch_loop.run(self._data_fetcher)self._run(model, ckpt_path=self.ckpt_path)

Basically, outputs is an empty list. But for what I understood, in that particular case line 802 parallel_state.is_pipeline_last_stage() should return False.

HF->nemo checkpoint conversion script for llama2 models

Hi, while the code for training llama2 architecture (without GQA?) models has been released, I couldn't find any scripts for converting a pre-trained llama2's HF weights to a nemo checkpoint. If that is correct, we can only train a llama2 from scratch instead of finetuning a pre-trained one at the moment.
Is there a plan to release a llama2 conversion script?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.