hazyresearch / based Goto Github PK

View Code? Open in Web Editor NEW

176.0 176.0 10.0 1.82 MB

Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"

License: Apache License 2.0

Python 71.64% Makefile 0.02% C++ 7.02% Cuda 19.04% Jupyter Notebook 2.28%

based's People

Contributors

Stargazers

Watchers

Forkers

apollohuang1 dumpmemory axelmagn lihuibng stjordanis eltociear redbrain hssn-20 xindongol xwx1999

based's Issues

Upstreaming SWDE, FDA, and Squad-completion to Eval Harness

Hi!

Congrats on the really great work. I'll definitely be trying Based out and referencing your work here in future :)

Was really happy to see you found the Eval Harness useful! I wanted to see if you were interested in or needed any help upstreaming the custom evals you created to the main harness--it'd be great to have these more easily reproducible so future work can compare to the evaluations you report! I'd be happy to help on this front.

Inquiry on 'params' Interpretation and Request for DNA Modeling Code and Scripts

Hello,

I would like to extend my sincere appreciation for the outstanding work you have done. While going through the paper, I came across a parameter labeled 'params' in the table3.:

Based on my understanding, these values appear to be denoted in million (M). Could you please confirm if this interpretation is correct?

Additionally, I am curious to know if there is any possibility of gaining access to the code and scripts used for pre-training the model on the hg38 dataset and fine-tuning it on genomic benchmarks?

Thanks.

License

Hi,
Thanks for releasing this! Would you mind adding a license to the code and larger weights?
Thanks!

Type Error in GPTLMHeadModel

I am having a go at running inference and evaluation for this model, and running into a TypeError in GPTLMHeadModel:

In [1]: import torch
   ...: from transformers import AutoTokenizer
   ...: from based.models.gpt import GPTLMHeadModel
   ...: 
   ...: tokenizer = AutoTokenizer.from_pretrained("gpt2")
   ...: model = GPTLMHeadModel.from_pretrained_hf("hazyresearch/based-360m").to("cuda", dtype=torch.float
   ...: 16)
tokenizer_config.json: 100%|███████████████████████████████████████████| 26.0/26.0 [00:00<00:00, 260kB/s]
config.json: 100%|██████████████████████████████████████████████████████| 665/665 [00:00<00:00, 8.64MB/s]
vocab.json: 100%|███████████████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 12.1MB/s]
merges.txt: 100%|█████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 8.99MB/s]
tokenizer.json: 100%|███████████████████████████████████████████████| 1.36M/1.36M [00:00<00:00, 17.8MB/s]
config.json: 100%|██████████████████████████████████████████████████| 2.86k/2.86k [00:00<00:00, 36.7MB/s]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[1], line 6
      3 from based.models.gpt import GPTLMHeadModel
      5 tokenizer = AutoTokenizer.from_pretrained("gpt2")
----> 6 model = GPTLMHeadModel.from_pretrained_hf("hazyresearch/based-360m").to("cuda", dtype=torch.float16)

File /based/models/gpt.py:468, in GPTPreTrainedModel.from_pretrained_hf(cls, pretrained_model_name, device, **kwargs)
    466 config_data = load_config_hf(pretrained_model_name)
    467 config = GPT2Config(**config_data)
--> 468 model = cls(config, device=device, **kwargs)
    469 state_dict = load_state_dict_hf(pretrained_model_name, device=device)
    471 # remove the 'model.' prefix from the keys

File /based/models/gpt.py:741, in GPTLMHeadModel.__init__(self, config, process_group, device, dtype)
    739 super().__init__(config)
    740 self.process_group = process_group
--> 741 self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs)
    742 self.tie_word_embeddings = getattr(config, "tie_word_embeddings", True)
    743 lm_head_bias = getattr(config, "lm_head_bias", False)

File /based/models/gpt.py:585, in GPTModel.__init__(self, config, process_group, device, dtype)
    569     self.embeddings = ParallelGPT2Embeddings(
    570         config.hidden_size,
    571         vocab_size,
   (...)
    575         **factory_kwargs,
    576     )
    578 # We change the order of dropout, residual and layer norm:
    579 # Instead of LN -> Attn / MLP -> Dropout -> Add, we do:
    580 # Dropout -> Add -> LN -> Attn / MLP, returning both the residual branch (output of Add) and
    581 # the main branch (output of MLP). The model definition is unchanged, but the mapping of the
    582 # nn.Dropout probabilities are changed.
    583 # This is for performance reason: we can fuse dropout + add + layer_norm.
    584 self.layers = nn.ModuleList(
--> 585     [
    586         create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)
    587         for i in range(config.num_hidden_layers)
    588     ]
    589 )
    590 self.fused_dropout_add_ln = getattr(config, "fused_dropout_add_ln", False)
    591 if self.fused_dropout_add_ln:

File /based/models/gpt.py:586, in <listcomp>(.0)
    569     self.embeddings = ParallelGPT2Embeddings(
    570         config.hidden_size,
    571         vocab_size,
   (...)
    575         **factory_kwargs,
    576     )
    578 # We change the order of dropout, residual and layer norm:
    579 # Instead of LN -> Attn / MLP -> Dropout -> Add, we do:
    580 # Dropout -> Add -> LN -> Attn / MLP, returning both the residual branch (output of Add) and
    581 # the main branch (output of MLP). The model definition is unchanged, but the mapping of the
    582 # nn.Dropout probabilities are changed.
    583 # This is for performance reason: we can fuse dropout + add + layer_norm.
    584 self.layers = nn.ModuleList(
    585     [
--> 586         create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)
    587         for i in range(config.num_hidden_layers)
    588     ]
    589 )
    590 self.fused_dropout_add_ln = getattr(config, "fused_dropout_add_ln", False)
    591 if self.fused_dropout_add_ln:

File /based/models/gpt.py:371, in create_block(config, layer_idx, process_group, device, dtype, **kwargs)
    369 mlp_cls = create_mlp_cls(config, layer_idx, process_group=process_group, **factory_kwargs)
    370 use_rms_norm = getattr(config, "rms_norm", False)
--> 371 norm_cls = partial(
    372     nn.LayerNorm if not use_rms_norm else RMSNorm,
    373     eps=config.layer_norm_epsilon,
    374     **factory_kwargs,
    375 )
    376 # TD [2022-07-30]: Force residual in fp32, seems to make fp16 training more stable
    377 residual_in_fp32 = getattr(config, "residual_in_fp32", False)

TypeError: the first argument must be callable

For reproducibility, I have been running this in a docker container:

FROM nvidia/cuda:11.8.0-devel-ubuntu22.04

RUN apt-get update && apt-get install -y \
    apt-utils \
    python3.10 \
    python3-pip \
    git \
    && rm -rf /var/lib/apt/lists/*

RUN pip install --upgrade pip
RUN pip install \
    torch==2.1.2 \
    torchvision==0.16.2 \
    torchaudio==2.1.2 \
    --index-url https://download.pytorch.org/whl/cu118 # due to observed causal-conv1d dependency

RUN pip install \
    jupyter==1.0.0 \
    hydra-core==1.3.2

RUN pip install jupyter
COPY . .
RUN pip install .

Any idea what could be going wrong here?

Apply to existing model

Hi,
Is it possible to apply Based to an existing model?
Thanks!

Taylor approximation is not equal to the math definition

Thank you for sharing your code along with your paper. This makes things reproducible and is extremely appreciated.

The Taylor approximation of the exponential should be 1 + qk + (qk)^2/2.
However, in my understanding this code actually computes 1 + qk + q^2k^2/2 which is not equivalent. Am I correct? If not could you point me to the part of the code that computes the Taylor expansion?

simple implementation

This is very interesting work! Sadly, right now the code is extremely complicated and bloated. Are there plans to make a nano-gpt like version or a simple jupyter notebook doing step-by-step building the model and training? This would be very useful to people who want to try your approach. The main issue right now is that extracting the model from this code is very difficult. This will likely limit the impact of this work.

FYI: HuggingFace Transformers Request

Hi all, just a heads up: I filed an issue with huggingface/transformers requesting model support for BASED via their library.

My engagement over the past few days has been part of an exploratory analysis of BASED for my employer. I hope it isn't too much of an intrusion, and please feel free to reach out if you have any questions or would like to coordinate.

a question, thank you for your reply

Hi, thank you for your nice work. I have a question about training. If I want to train your model on the pile-uncopyrighted dataset (just uncopyrighted pile), how should I prepare or pre-process the dataset?

How to run prefill phase of inference benchmark?

Hi, I want to test how much time it takes for prefilling phase.
I want to test it with Llama2-7B model, and compare Based against FlashAttention.
How can I do it?
I want to test it with different sequence lengths and number of batches.
It seems that I need to modify based_inference.py. However, I am not even able to install it because there is no "test_build_utils" library.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.