Coder Social home page Coder Social logo

nrel / evoprotgrad Goto Github PK

View Code? Open in Web Editor NEW
44.0 8.0 6.0 9.55 MB

Directed evolution of proteins in sequence space with gradients

Home Page: https://nrel.github.io/EvoProtGrad/

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 6.46% Python 93.54%
directed-evolution huggingface mcmc protein-engineering protein-language-model

evoprotgrad's Issues

Error in demo.ipynb

When running the second cell:

# HuggingFace ESM2 8M model
esm2_expert = evo_prot_grad.get_expert('esm', temperature = 1.0, device = 'cuda')

# Supervised fluorescence regression model
gfp_expert = evo_prot_grad.get_expert(
                        'onehot_downstream_regression',
                        temperature = 1.0,
                        model = AutoModel.from_pretrained('NREL/avGFP-fluorescence-onehot-cnn',trust_remote_code=True),
                        device = 'cuda')

variants, scores = evo_prot_grad.DirectedEvolution(
                        wt_fasta = 'test/gfp.fasta',
                        output = 'all',
                        experts = [esm2_expert, gfp_expert],
                        parallel_chains = 16,
                        n_steps = 1000,              
                        max_mutations = 15,
                        verbose = False
)()

I get the following error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
File [~/vscode_projects/proteins/EvoProtGrad/evo_prot_grad/__init__.py:54](https://file+.vscode-resource.vscode-cdn.net/Users/amelieschreiber/vscode_projects/proteins/EvoProtGrad/~/vscode_projects/proteins/EvoProtGrad/evo_prot_grad/__init__.py:54), in get_expert(expert_name, temperature, model, tokenizer, device, use_without_wildtype)
     53     expert_mod = importlib.import_module(f"evo_prot_grad.experts.{expert_name}_expert")
---> 54     return expert_mod.build(
     55         temperature = temperature,
     56         model = model,
     57         tokenizer = tokenizer,
     58         device = device,
     59         use_without_wildtype = use_without_wildtype
     60     )
     61 except:

File [~/vscode_projects/proteins/EvoProtGrad/evo_prot_grad/experts/esm_expert.py:65](https://file+.vscode-resource.vscode-cdn.net/Users/amelieschreiber/vscode_projects/proteins/EvoProtGrad/~/vscode_projects/proteins/EvoProtGrad/evo_prot_grad/experts/esm_expert.py:65), in build(**kwargs)
     64 """Builds a Esm2Expert."""
---> 65 return EsmExpert(**kwargs)

File [~/vscode_projects/proteins/EvoProtGrad/evo_prot_grad/experts/esm_expert.py:38](https://file+.vscode-resource.vscode-cdn.net/Users/amelieschreiber/vscode_projects/proteins/EvoProtGrad/~/vscode_projects/proteins/EvoProtGrad/evo_prot_grad/experts/esm_expert.py:38), in EsmExpert.__init__(self, temperature, model, tokenizer, device, use_without_wildtype)
     37     raise ValueError("EsmExpert requires both `model` and `tokenizer` to be specified.")
---> 38 super().__init__(
     39     temperature,
     40     model,
     41     tokenizer.get_vocab(),
     42     device,
     43     use_without_wildtype)
...
     60     )
     61 except:
---> 62     raise ValueError(f"Expert {expert_name} not found in evo_prot_grad.experts.")

ValueError: Expert esm not found in evo_prot_grad.experts.

is it possible to get the importance score of the protein sequence?

I was just wondering is it possible to get the importance score of the protein sequence using EvoProtGrad model? For instance, in https://huggingface.co/datasets/waylandy/phosformer_curated data there are kinase enzymes. Now I want to rank the kinase enzymes based on importance scores.

Furthermore, I found in (https://colab.research.google.com/drive/1e8WjYEbWiikRQg3g4YHQJJcpvTIWVAjp?usp=sharing) that the scores are generated for different variants of a protein sequence. But what is the score of the original protein sequence ? If the score of original sequence can be measured then it can be compared with other variants?

Using Masked Marginal Score for ESM-2 as a Scoring Method

In the paper Language models enable zero-shot prediction of the effects of mutations on protein function the ESM folks introduce the "Masked Marginal Scoring" method to compute effects of mutations on function and show that it performs significantly better than the Log Likelihood Ratio (LLR) method. If I am not mistaken, LLR is used for EvoProtGrad currently. Could the code from the ESM github (where they use ESM-1v) be adapted to ESM-2 and used in EvoProtGrad as a scoring method? In particular, could the masked marginal scoring method found here be modified to work with ESM-2 and used in EvoProtGrad as the scoring method? The masked marginal score is defined as

$$ \sum_{i \in M} \log p(x_i = x_i^{mt} | x_{-M}) - \log p(x_i = x_i^{wt} | x_{-M}) $$

in the paper above, in Appendix A at the bottom of page 18, where $-M$ denotes the sequence with masking at all positions in $M$, where mutations occur. That is they introduce masks at the mutated positions (all at once) and compute the score for a mutation by considering its probability relative to the wildtype amino acid. This might significantly improve the scoring and could be a nice alternative scoring strategy.

Weight of Sequence

Hi,

Is it possible to get the weight of a protein sequence using EvoProtGrad?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.