Coder Social home page Coder Social logo

royesha / genslm Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ramanathanlab/genslm

0.0 0.0 0.0 23.55 MB

GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics

License: MIT License

Shell 4.26% Python 94.97% Makefile 0.63% Dockerfile 0.14%

genslm's Introduction

GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics

genslm_header

Preprint

Available here: https://www.biorxiv.org/content/10.1101/2022.10.10.511571v2

Table of Contents

  1. Installation
  2. Usage
  3. Contributing
  4. License
  5. Citations

Installation

To install genslm on most systems:

pip install git+https://github.com/ramanathanlab/genslm

GenSLMs were trained on the Polaris and Perlmutter supercomputers. For installation on these systems, please see INSTALL.md.

Usage

Our pre-trained models and datasets can be downloaded from this Globus Endpoint.

Use GenSLMs to compute sequence embeddings for downsteam tasks, generate synthetic sequences, or easily extend them to your own application.

Compute embeddings Open in Colab

import torch
import numpy as np
from torch.utils.data import DataLoader
from genslm import GenSLM, SequenceDataset

model = GenSLM("genslm_25M_patric", model_cache_dir="/content/gdrive/MyDrive")
model.eval()

# Input data is a list of gene sequences
sequences = [
    "ATGAAAGTAACCGTTGTTGGAGCAGGTGCAGTTGGTGCAAGTTGCGCAGAATATATTGCA",
    "ATTAAAGATTTCGCATCTGAAGTTGTTTTGTTAGACATTAAAGAAGGTTATGCCGAAGGT",
]

dataset = SequenceDataset(sequences, model.seq_length, model.tokenizer)
dataloader = DataLoader(dataset)

# Compute averaged-embeddings for each input sequence
embeddings = []
with torch.no_grad():
    for batch in dataloader:
        outputs = model(batch["input_ids"], batch["attention_mask"], output_hidden_states=True)
        # outputs.hidden_states shape: (layers, batch_size, sequence_length, hidden_size)
        emb = outputs.hidden_states[0].detach().cpu().numpy()
        # Compute average over sequence length
        emb = np.mean(emb, axis=1)
        embeddings.append(emb)

# Concatenate embeddings into an array of shape (num_sequences, hidden_size)
embeddings = np.concatenate(embeddings)
embeddings.shape
>>> (2, 512)

Generate synthetic sequences Open in Colab

from genslm import GenSLM

model = GenSLM("genslm_25M_patric", model_cache_dir="/content/gdrive/MyDrive")
model.eval()

# Prompt the language model with a start codon
prompt = model.tokenizer.encode("ATG", return_tensors="pt")

tokens = model.model.generate(
    prompt,
    max_length=10,  # Increase this to generate longer sequences
    min_length=10,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    num_return_sequences=2,  # Change the number of sequences to generate
    remove_invalid_values=True,
    use_cache=True,
    pad_token_id=model.tokenizer.encode("[PAD]")[0],
    temperature=1.0,
)

sequences = model.tokenizer.batch_decode(tokens, skip_special_tokens=True)

for sequence in sequences:
    print(sequence)

>>> ATG GTT ATT TCA TCT GAT TTA CCA ACT
>>> ATG TTC ATT CTT CCG GCA CTT ATC GAA

Diffusion Model

A novel hierarchical language model with two levels: the top level uses a diffusion model to capture global context and longer-range interactions across the entire genome sequence; the bottom level uses a transformer for codon-level modeling, guided by the top-level diffusion model. This model enables us to prospectively model SARS-CoV-2 evolution by leveraging its generative capabilities.

Please refer to this codebase for diffusion model usage: https://github.com/da03/hierarchical_diffusion_LM

High Performance Computing

We have a CLI tool to make it easier to launch training jobs on various HPC platforms. You can specify which system you would like to submit to by specifiying the -T, --template option. We currently have templates for polaris and perlmutter. By default, submitted jobs will output results to the directory where the submit command was run, you can use the -w option to specifiy a different workdir. Please run python -m genslm.hpc.submit --help for more information. See config.py for documentation on the yaml options, and note that config.yaml paths MUST be absolute.

module load conda/2022-07-19
conda activate genslm
python -m genslm.hpc.submit -T polaris -a gpu_hack -q debug -t 00:10:00 -n 1 -j test-job-0 -v "-c config.yaml" 

Module specific arguments are passed verbatim by the -v flag, args must be inside quotes.

For additional commands, please see COMMANDS.md.

Contributing

Please report bugs, enhancement requests, or questions through the Issue Tracker.

If you are looking to contribute, please see CONTRIBUTING.md.

License

genslm has a MIT license, as seen in the LICENSE.md file.

Citations

If you use our models in your research, please cite this paper:

@article{zvyagin2022genslms,
  title={GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics.},
  author={Zvyagin, Max T and Brace, Alexander and Hippe, Kyle and Deng, Yuntian and Zhang, Bin and Bohorquez, Cindy Orozco and Clyde, Austin and Kale, Bharat and Perez-Rivera, Danilo and Ma, Heng and others},
  journal={bioRxiv},
  year={2022},
  publisher={Cold Spring Harbor Laboratory}
}

genslm's People

Contributors

maxzvyagin avatar braceal avatar kphippe avatar hengma1001 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.