nrel / evoprotgrad Goto Github PK

View Code? Open in Web Editor NEW

44.0 8.0 6.0 9.55 MB

Directed evolution of proteins in sequence space with gradients

Home Page: https://nrel.github.io/EvoProtGrad/

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 6.46% Python 93.54%

directed-evolution huggingface mcmc protein-engineering protein-language-model

evoprotgrad's Introduction

EvoProtGrad

A Python package for directed evolution on a protein sequence with gradient-based discrete Markov chain monte carlo (MCMC). Users are able to compose custom models that map sequence to function with pretrained models, including protein language models (PLMs), to guide and constrain search. Our package natively integrates with 🤗 HuggingFace and supports PLMs from transformers.

Our MCMC sampler identifies promising amino acids to mutate via model gradients taken with respect to the input (i.e., sensitivity analysis). We allow users to compose their own custom target function for MCMC by leveraging the Product of Experts MCMC paradigm. Each model is an "expert" that contributes its own knowledge about the protein's fitness landscape to the overall target function. The sampler is designed to be more efficient and effective than brute force and random search while maintaining most of the generality and flexibility.

See our publication and our documentation for more details.

Installation

EvoProtGrad is available on PyPI and can be installed with pip:

pip install evo_prot_grad

For the bleeding edge version, and/or if you wish to run tests or register a new expert model with EvoProtGrad, please clone this repo and install in editable mode as follows:

git clone https://github.com/NREL/EvoProtGrad.git
cd EvoProtGrad
pip install -e .

Run tests

Test the code by running python3 -m unittest.

Basic Usage

See demo.ipynb to get started right away in a Jupyter notebook or

Create a ProtBERT expert from a pretrained HuggingFace protein language model (PLM) using evo_prot_grad.get_expert:

import evo_prot_grad

prot_bert_expert = evo_prot_grad.get_expert('bert', scoring_strategy = 'pseudolikelihood_ratio', temperature = 1.0)

The default BERT-style PLM in EvoProtGrad is Rostlab/prot_bert. Normally, we would need to also specify the model and tokenizer. When using a default PLM expert, we automatically pull these from the HuggingFace Hub. The temperature parameter rescales the expert scores and can be used to trade off the importance of different experts. The pseudolikelihood_ratio strategy computes the ratio of the "pseudo" log-likelihood (this isn't the exact log-likelihood when the protein language model is a masked language model) of the wild type and mutant sequence.

Then, create an instance of DirectedEvolution and run the search, returning a list of the best variant per Markov chain (as measured by the prot_bert expert):

variants, scores = evo_prot_grad.DirectedEvolution(
                   wt_fasta = 'test/gfp.fasta',    # path to wild type fasta file
                   output = 'best',                # return best, last, all variants    
                   experts = [prot_bert_expert],   # list of experts to compose
                   parallel_chains = 1,            # number of parallel chains to run
                   n_steps = 20,                   # number of MCMC steps per chain
                   max_mutations = 10,             # maximum number of mutations per variant
                   verbose = True                  # print debug info to command line
)()

We provide a few experts in evo_prot_grad/experts that you can use out of the box, such as:

Protein Language Models (PLMs)

bert, BERT-style PLMs, default: Rostlab/prot_bert
causallm, CausalLM-style PLMs, default: lightonai/RITA_s
esm, ESM-style PLMs, default: facebook/esm2_t6_8M_UR50D

Potts models

evcouplings

and an generic expert for supervised downstream regression models

onehot_downstream_regression

Citation

If you use EvoProtGrad in your research, please cite the following publication:

@article{emami2023plug,
  title={Plug \& play directed evolution of proteins with gradient-based discrete MCMC},
  author={Emami, Patrick and Perreault, Aidan and Law, Jeffrey and Biagioni, David and John, Peter St},
  journal={Machine Learning: Science and Technology},
  volume={4},
  number={2},
  pages={025014},
  year={2023},
  publisher={IOP Publishing}
}

evoprotgrad's People

Contributors

Stargazers

Watchers

Forkers

amelie-schreiber engelberger cnp-ciimar rufus-willy conchaeloko evankomp

evoprotgrad's Issues

Weight of Sequence

Hi,

Is it possible to get the weight of a protein sequence using EvoProtGrad?

Using Masked Marginal Score for ESM-2 as a Scoring Method

In the paper Language models enable zero-shot prediction of the effects of mutations on protein function the ESM folks introduce the "Masked Marginal Scoring" method to compute effects of mutations on function and show that it performs significantly better than the Log Likelihood Ratio (LLR) method. If I am not mistaken, LLR is used for EvoProtGrad currently. Could the code from the ESM github (where they use ESM-1v) be adapted to ESM-2 and used in EvoProtGrad as a scoring method? In particular, could the masked marginal scoring method found here be modified to work with ESM-2 and used in EvoProtGrad as the scoring method? The masked marginal score is defined as

$$ \sum_{i \in M} \log p(x_i = x_i^{mt} | x_{-M}) - \log p(x_i = x_i^{wt} | x_{-M}) $$

in the paper above, in Appendix A at the bottom of page 18, where $-M$ denotes the sequence with masking at all positions in $M$, where mutations occur. That is they introduce masks at the mutated positions (all at once) and compute the score for a mutation by considering its probability relative to the wildtype amino acid. This might significantly improve the scoring and could be a nice alternative scoring strategy.

is it possible to get the importance score of the protein sequence?

I was just wondering is it possible to get the importance score of the protein sequence using EvoProtGrad model? For instance, in https://huggingface.co/datasets/waylandy/phosformer_curated data there are kinase enzymes. Now I want to rank the kinase enzymes based on importance scores.

Furthermore, I found in (https://colab.research.google.com/drive/1e8WjYEbWiikRQg3g4YHQJJcpvTIWVAjp?usp=sharing) that the scores are generated for different variants of a protein sequence. But what is the score of the original protein sequence ? If the score of original sequence can be measured then it can be compared with other variants?

Error in demo.ipynb

When running the second cell:

# HuggingFace ESM2 8M model
esm2_expert = evo_prot_grad.get_expert('esm', temperature = 1.0, device = 'cuda')

# Supervised fluorescence regression model
gfp_expert = evo_prot_grad.get_expert(
                        'onehot_downstream_regression',
                        temperature = 1.0,
                        model = AutoModel.from_pretrained('NREL/avGFP-fluorescence-onehot-cnn',trust_remote_code=True),
                        device = 'cuda')

variants, scores = evo_prot_grad.DirectedEvolution(
                        wt_fasta = 'test/gfp.fasta',
                        output = 'all',
                        experts = [esm2_expert, gfp_expert],
                        parallel_chains = 16,
                        n_steps = 1000,              
                        max_mutations = 15,
                        verbose = False
)()

I get the following error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
File [~/vscode_projects/proteins/EvoProtGrad/evo_prot_grad/__init__.py:54](https://file+.vscode-resource.vscode-cdn.net/Users/amelieschreiber/vscode_projects/proteins/EvoProtGrad/~/vscode_projects/proteins/EvoProtGrad/evo_prot_grad/__init__.py:54), in get_expert(expert_name, temperature, model, tokenizer, device, use_without_wildtype)
     53     expert_mod = importlib.import_module(f"evo_prot_grad.experts.{expert_name}_expert")
---> 54     return expert_mod.build(
     55         temperature = temperature,
     56         model = model,
     57         tokenizer = tokenizer,
     58         device = device,
     59         use_without_wildtype = use_without_wildtype
     60     )
     61 except:

File [~/vscode_projects/proteins/EvoProtGrad/evo_prot_grad/experts/esm_expert.py:65](https://file+.vscode-resource.vscode-cdn.net/Users/amelieschreiber/vscode_projects/proteins/EvoProtGrad/~/vscode_projects/proteins/EvoProtGrad/evo_prot_grad/experts/esm_expert.py:65), in build(**kwargs)
     64 """Builds a Esm2Expert."""
---> 65 return EsmExpert(**kwargs)

File [~/vscode_projects/proteins/EvoProtGrad/evo_prot_grad/experts/esm_expert.py:38](https://file+.vscode-resource.vscode-cdn.net/Users/amelieschreiber/vscode_projects/proteins/EvoProtGrad/~/vscode_projects/proteins/EvoProtGrad/evo_prot_grad/experts/esm_expert.py:38), in EsmExpert.__init__(self, temperature, model, tokenizer, device, use_without_wildtype)
     37     raise ValueError("EsmExpert requires both `model` and `tokenizer` to be specified.")
---> 38 super().__init__(
     39     temperature,
     40     model,
     41     tokenizer.get_vocab(),
     42     device,
     43     use_without_wildtype)
...
     60     )
     61 except:
---> 62     raise ValueError(f"Expert {expert_name} not found in evo_prot_grad.experts.")

ValueError: Expert esm not found in evo_prot_grad.experts.

nrel / evoprotgrad Goto Github PK

evoprotgrad's Introduction

EvoProtGrad

Installation

Run tests

Basic Usage

Citation

evoprotgrad's People

Contributors

Stargazers

Watchers

Forkers

evoprotgrad's Issues

Weight of Sequence

Using Masked Marginal Score for ESM-2 as a Scoring Method

is it possible to get the importance score of the protein sequence?

Error in demo.ipynb

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent