Coder Social home page Coder Social logo

proteingym's Introduction

ProteinGym

Overview

ProteinGym is an extensive set of Deep Mutational Scanning (DMS) assays and annotated human clinical variants curated to enable thorough comparisons of various mutation effect predictors in different regimes. Both the DMS assays and clinical variants are divided into 1) a substitution benchmark which currently consists of the experimental characterisation of ~2.7M missense variants across 217 DMS assays and 2,525 clinical proteins, and 2) an indel benchmark that includes ∼300k mutants across 74 DMS assays and 1,555 clinical proteins.

Each processed file in each benchmark corresponds to a single DMS assay or clinical protein, and contains the following variables:

  • mutant (str): describes the set of substitutions to apply on the reference sequence to obtain the mutated sequence (eg., A1P:D2N implies the amino acid 'A' at position 1 should be replaced by 'P', and 'D' at position 2 should be replaced by 'N'). Present in the the ProteinGym substitution benchmark only (not indels).
  • mutated_sequence (str): represents the full amino acid sequence for the mutated protein.
  • DMS_score (float): corresponds to the experimental measurement in the DMS assay. Across all assays, the higher the DMS_score value, the higher the fitness of the mutated protein. This column is not present in the clinical files, since they are classified as benign/pathogenic, and do not have continuous scores
  • DMS_score_bin (int): indicates whether the DMS_score is above the fitness cutoff (1 is fit (pathogenic for clinical variants), 0 is not fit (benign for clinical variants))

Additionally, we provide two reference files for each benchmark that give further details on each assay and contain in particular:

  • The UniProt_ID of the corresponding protein, along with taxon and MSA depth category
  • The target sequence (target_seq) used in the assay
  • For the assays, details on how the DMS_score was created from the raw files and how it was binarized

To download the benchmarks, please see DMS benchmark - Substitutions and DMS benchmark - Indels in the "Resources" section below.

Fitness prediction performance

The benchmarks folder provides detailed performance files for all baselines on the DMS and clinical benchmarks.

We report the following metrics:

  • For DMS benchmarks in the zero-shot setting: Spearman, NDCG, AUC, MCC and Top-K recall
  • For DMS benchmarks in the supervised setting: Spearman and MSE
  • For clinical benchmarks: AUC

Metrics are aggregated as follows:

  1. Aggregating by UniProt ID (to avoid biasing results towards proteins for which several DMS assays are available in ProteinGym)
  2. Aggregating by different functional categories, and taking the mean across those categories.

These files are named e.g. DMS_substitutions_Spearman_DMS_level.csv, DMS_substitutions_Spearman_Uniprot_level and DMS_substitutions_Spearman_Uniprot_Selection_Type_level respectively for these different steps.

For other deep dives (performance split by taxa, MSA depth, mutational depth and more), these are all contained in the benchmarks/DMS_zero_shot/substitutions/Spearman/Summary_performance_DMS_substitutions_Spearman.csv folder (resp. DMS_indels/clinical_substitutions/clinical_indels & their supervised counterparts). These files are also what are hosted on the website.

We also include, as on the website, a bootstrapped standard error of these aggregated metrics to reflect the variance in the final numbers with respect to the individual assays.

To calculate the DMS substitution benchmark metrics:

  1. Download the model scores from the website
  2. Run ./scripts/scoring_DMS_zero_shot/performance_substitutions.sh

And for indels, follow step #1 and run ./scripts/scoring_DMS_zero_shot/performance_substitutions_indels.sh.

ProteinGym benchmarks - Leaderboard

The full ProteinGym benchmarks performance files are also accessible via our dedicated website: https://www.proteingym.org/. It includes leaderboards for the substitution and indel benchmarks, as well as detailed DMS-level performance files for all baselines. The current version of the substitution benchmark includes the following baselines:

Model name Model type Reference
Site Independent Alignment-based model Hopf, T.A., Ingraham, J., Poelwijk, F.J., Schärfe, C.P., Springer, M., Sander, C., & Marks, D.S. (2017). Mutation effects predicted from sequence co-variation. Nature Biotechnology, 35, 128-135.
EVmutation Alignment-based model Hopf, T.A., Ingraham, J., Poelwijk, F.J., Schärfe, C.P., Springer, M., Sander, C., & Marks, D.S. (2017). Mutation effects predicted from sequence co-variation. Nature Biotechnology, 35, 128-135.
WaveNet Alignment-based model Shin, J., Riesselman, A.J., Kollasch, A.W., McMahon, C., Simon, E., Sander, C., Manglik, A., Kruse, A.C., & Marks, D.S. (2021). Protein design and variant prediction using autoregressive generative models. Nature Communications, 12.
DeepSequence Alignment-based model Riesselman, A.J., Ingraham, J., & Marks, D.S. (2018). Deep generative models of genetic variation capture the effects of mutations. Nature Methods, 15, 816-822.
GEMME Alignment-based model Laine, É., Karami, Y., & Carbone, A. (2019). GEMME: A Simple and Fast Global Epistatic Model Predicting Mutational Effects. Molecular Biology and Evolution, 36, 2604 - 2619.
EVE Alignment-based model Frazer, J., Notin, P., Dias, M., Gomez, A.N., Min, J.K., Brock, K.P., Gal, Y., & Marks, D.S. (2021). Disease variant prediction with deep generative models of evolutionary data. Nature.
Unirep Protein language model Alley, E.C., Khimulya, G., Biswas, S., AlQuraishi, M., & Church, G.M. (2019). Unified rational protein engineering with sequence-based deep representation learning. Nature Methods, 1-8
ESM-1b Protein language model Original model: Rives, A., Goyal, S., Meier, J., Guo, D., Ott, M., Zitnick, C.L., Ma, J., & Fergus, R. (2019). Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences of the United States of America, 118; Extensions: Brandes, N., Goldman, G., Wang, C.H. et al. Genome-wide prediction of disease variant effects with a deep protein language model. Nat Genet 55, 1512–1522 (2023).
ESM-1v Protein language model Meier, J., Rao, R., Verkuil, R., Liu, J., Sercu, T., & Rives, A. (2021). Language models enable zero-shot prediction of the effects of mutations on protein function. NeurIPS.
VESPA Protein language model Marquet, C., Heinzinger, M., Olenyi, T., Dallago, C., Bernhofer, M., Erckert, K., & Rost, B. (2021). Embeddings from protein language models predict conservation and variant effects. Human Genetics, 141, 1629 - 1647.
RITA Protein language model Hesslow, D., Zanichelli, N., Notin, P., Poli, I., & Marks, D.S. (2022). RITA: a Study on Scaling Up Generative Protein Sequence Models. ArXiv, abs/2205.05789.
ProtGPT2 Protein language model Ferruz, N., Schmidt, S., & Höcker, B. (2022). ProtGPT2 is a deep unsupervised language model for protein design. Nature Communications, 13.
ProGen2 Protein language model Nijkamp, E., Ruffolo, J.A., Weinstein, E.N., Naik, N., & Madani, A. (2022). ProGen2: Exploring the Boundaries of Protein Language Models. ArXiv, abs/2206.13517.
MSA Transformer Hybrid Rao, R., Liu, J., Verkuil, R., Meier, J., Canny, J.F., Abbeel, P., Sercu, T., & Rives, A. (2021). MSA Transformer. ICML.
Tranception Hybrid Notin, P., Dias, M., Frazer, J., Marchena-Hurtado, J., Gomez, A.N., Marks, D.S., & Gal, Y. (2022). Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. ICML.
TranceptEVE Hybrid Notin, P., Van Niekerk, L., Kollasch, A., Ritter, D., Gal, Y. & Marks, D.S. & (2022). TranceptEVE: Combining Family-specific and Family-agnostic Models of Protein Sequences for Improved Fitness Prediction. NeurIPS, LMRL workshop.
CARP Protein language model Yang, K.K., Fusi, N., Lu, A.X. (2022). Convolutions are competitive with transformers for protein sequence pretraining.
MIF Inverse folding Yang, K.K., Yeh, H., Zanichelli, N. (2022). Masked Inverse Folding with Sequence Transfer for Protein Representation Learning.

Except for the WaveNet model (which only uses alignments to recover a set of homologous protein sequences to train on, but then trains on non-aligned sequences), all alignment-based methods are unable to score indels given the fixed coordinate system they are trained on. Similarly, the masked-marginals procedure to generate the masked-marginals for ESM-1v and MSA Transformer requires the position to exist in the wild-type sequence. All the other model architectures listed above (eg., Tranception, RITA, ProGen2) are included in the indel benchmark.

For clinical baselines, we used dbNSFP 4.4a as detailed in the manuscript appendix (and in proteingym/clinical_benchmark_notebooks/clinical_subs_processing.ipynb).

Resources

To download and unzip the data, run the following commands for each of the data sources you would like to download, as listed in the table below. For example, you can download & unzip the zero-shot predictions for all baselines for all DMS substitution assays as follows:

curl -o zero_shot_substitutions_scores.zip https://marks.hms.harvard.edu/proteingym/zero_shot_substitutions_scores.zip
unzip zero_shot_substitutions_scores.zip && rm zero_shot_substitutions_scores.zip
Data Size (unzipped) Link
DMS benchmark - Substitutions 1.1GB https://marks.hms.harvard.edu/proteingym/DMS_ProteinGym_substitutions.zip
DMS benchmark - Indels 200MB https://marks.hms.harvard.edu/proteingym/DMS_ProteinGym_indels.zip
Zero-shot DMS Model scores - Substitutions 44.1GB https://marks.hms.harvard.edu/proteingym/zero_shot_substitutions_scores.zip
Zero-shot DMS Model scores - Indels 9.6GB https://marks.hms.harvard.edu/proteingym/zero_shot_indels_scores.zip
Supervised DMS Model performance - Substitutions 2.7MB https://marks.hms.harvard.edu/proteingym/DMS_supervised_substitutions_scores.zip
Supervised DMS Model performance - Indels 0.9MB https://marks.hms.harvard.edu/proteingym/DMS_supervised_indels_scores.zip
Multiple Sequence Alignments (MSAs) for DMS assays 5.2GB https://marks.hms.harvard.edu/proteingym/DMS_msa_files.zip
Redundancy-based sequence weights for DMS assays 200MB https://marks.hms.harvard.edu/proteingym/DMS_msa_weights.zip
Predicted 3D structures from inverse-folding models 84MB https://marks.hms.harvard.edu/proteingym/ProteinGym_AF2_structures.zip
Clinical benchmark - Substitutions 123MB https://marks.hms.harvard.edu/proteingym/clinical_ProteinGym_substitutions.zip
Clinical benchmark - Indels 2.8MB https://marks.hms.harvard.edu/proteingym/clinical_ProteinGym_indels.zip
Clinical MSAs 17.8GB https://marks.hms.harvard.edu/proteingym/clinical_msa_files.zip
Clinical MSA weights 250MB https://marks.hms.harvard.edu/proteingym/clinical_msa_weights.zip
Clinical Model scores - Substitutions 0.9GB https://marks.hms.harvard.edu/proteingym/zero_shot_clinical_substitutions_scores.zip
Clinical Model scores - Indels 0.7GB https://marks.hms.harvard.edu/proteingym/zero_shot_clinical_indels_scores.zip
CV folds - Substitutions - Singles 50M https://marks.hms.harvard.edu/proteingym/cv_folds_singles_substitutions.zip
CV folds - Substitutions - Multiples 81M https://marks.hms.harvard.edu/proteingym/cv_folds_multiples_substitutions.zip
CV folds - Indels 19MB https://marks.hms.harvard.edu/proteingym/cv_folds_indels.zip

Then we also host the raw DMS assays (before preprocessing)

Data Size (unzipped) Link
DMS benchmark: Substitutions (raw) 500MB https://marks.hms.harvard.edu/proteingym/substitutions_raw_DMS.zip
DMS benchmark: Indels (raw) 450MB https://marks.hms.harvard.edu/proteingym/indels_raw_DMS.zip
Clinical benchmark: Substitutions (raw) 58MB https://marks.hms.harvard.edu/proteingym/substitutions_raw_clinical.zip
Clinical benchmark: Indels (raw) 12.4MB https://marks.hms.harvard.edu/proteingym/indels_raw_clinical.zip

How to contribute?

New assays

If you would like to suggest new assays to be part of ProteinGym, please raise an issue on this repository with a `new_assay' label. The criteria we typically consider for inclusion are as follows:

  1. The corresponding raw dataset needs to be publicly available
  2. The assay needs to be protein-related (ie., exclude UTR, tRNA, promoter, etc.)
  3. The dataset needs to have insufficient number of measurements
  4. The assay needs to have a sufficiently high dynamic range
  5. The assay has to be relevant to fitness prediction

New baselines

If you would like new baselines to be included in ProteinGym (ie., website, performance files, detailed scoring files), please follow the following steps:

  1. Submit a PR to our repo with two things:
    • A new subfolder under proteingym/baselines named with your new model name. This subfolder should include a python scoring script similar to this script, as well as all code dependencies required for the scoring script to run properly
    • An example bash script (e.g., under scripts/scoring_DMS_zero_shot) with all relevant hyperparameters for scoring, similar to this script
  2. Raise an issue with a 'new model' label, providing instructions on how to download relevant model checkpoints for scoring, and reporting the performance of your model on the relevant benchmark using our performance scripts (e.g., for zero-shot DMS benchmarks). Please note that our DMS performance scripts correct for various biases (e.g., number of assays per protein family and function groupings) and thus the resulting aggregated performance is not the same as the arithmetic average across assays.

At this point we are only considering new baselines satisfying the following conditions:

  1. The model is able to score all mutants in the relevant benchmark (to ensure all models are compared exactly on the same set of mutants everywhere);
  2. The corresponding model is open source (we should be able to reproduce scores if needed).

At this stage, we are only considering requests for which all model scores for all mutants in a given benchmark (substitution or indel) are provided by the requester; but we are planning on regularly scoring new baselines ourselves for methods with wide adoption by the community and/or suggestions with many upvotes.

Notes

12 December 2023: The code for training and evaluating supervised models is currently shared in https://github.com/OATML-Markslab/ProteinNPT. We are in the process of integrating the code into this repo.

Instructions

If you would like to compute all performance metrics for the various benchmarks, please follow the following steps:

  1. Download locally all relevant files as per instructions above (see Resources)
  2. Update the paths for all files downloaded in the prior step in the config script
  3. If adding a new model, adjust the config.json file accordingly and add the model scores to the relevant path (e.g., DMS_output_score_folder_subs)
  4. If focusing on DMS benchmarks, run the merge script. This will create a single file for each DMS assay, with scores for all model baselines
  5. Run the relevant performance script (eg., scripts/scoring_DMS_zero_shot/performance_substitutions.sh)

Acknowledgements

Our codebase leveraged code from the following repositories to compute baselines:

Model Repo
UniRep https://github.com/churchlab/UniRep
UniRep https://github.com/chloechsu/combining-evolutionary-and-assay-labelled-data
EVE https://github.com/OATML-Markslab/EVE
GEMME https://hub.docker.com/r/elodielaine/gemme
ESM https://github.com/facebookresearch/esm
EVmutation https://github.com/debbiemarkslab/EVcouplings
ProGen2 https://github.com/salesforce/progen
HMMER https://github.com/EddyRivasLab/hmmer
MSA Transformer https://github.com/rmrao/msa-transformer
ProtGPT2 https://huggingface.co/nferruz/ProtGPT2
ProteinMPNN https://github.com/dauparas/ProteinMPNN
RITA https://github.com/lightonai/RITA
Tranception https://github.com/OATML-Markslab/Tranception
VESPA https://github.com/Rostlab/VESPA
CARP https://github.com/microsoft/protein-sequence-models
MIF https://github.com/microsoft/protein-sequence-models
ProtSSN https://github.com/tyang816/ProtSSN

We would like to thank the GEMME team for providing model scores on an earlier version of the benchmark (ProteinGym v0.1), and the ProtSSN team for integrating their model in the ProteinGym repo.

Special thanks the teams of experimentalists who developed and performed the assays that ProteinGym is built on. If you are using ProteinGym in your work, please consider citing the corresponding papers. To facilitate this, we have prepared a file (assays.bib) containing the bibtex entries for all these papers.

License

This project is available under the MIT license found in the LICENSE file in this GitHub repository.

Reference

If you use ProteinGym in your work, please cite the following paper:

@inproceedings{NEURIPS2023_cac723e5,
 author = {Notin, Pascal and Kollasch, Aaron and Ritter, Daniel and van Niekerk, Lood and Paul, Steffanie and Spinner, Han and Rollins, Nathan and Shaw, Ada and Orenbuch, Rose and Weitzman, Ruben and Frazer, Jonathan and Dias, Mafalda and Franceschi, Dinko and Gal, Yarin and Marks, Debora},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
 pages = {64331--64379},
 publisher = {Curran Associates, Inc.},
 title = {ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design},
 url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/cac723e5ff29f65e3fcbb0739ae91bee-Paper-Datasets_and_Benchmarks.pdf},
 volume = {36},
 year = {2023}
}

Links

proteingym's People

Contributors

danieldritter avatar loodvn avatar pascalnotin avatar tyang816 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

proteingym's Issues

wild-type fitness values

Can I get the fitness values for the wild-type sequences? (the 'target_seq' in reference file)
Thank you in advance!

Predictions for Each Mutant in the Random Cross-Validation Scheme

Thank you very much for providing this amazing resource!

I would appreciate your help:

  1. Would you please direct me to the ProteinNPT predictions for each mutant in the random CV scheme?
  2. Were uncertainty quantifications calculated across the CV schemes? If so, would you please also provide those predictions per mutant?

By the way, the link provided in the README for downloading all the baseline scores on the DMS substitutions is dead, though I'm not sure if this zip would contain the data I'm looking for.

Thank you in advance,
Benji

How to use TranceptEVE, the best model?

Hi, thank you for a great work!

TranceptEVE L is the best model according to the benchmark website. But I cannot figure out how to use it on my data. Is it anywhere available? There are quite a lot of details in the TranceptEVE paper, which makes it not trivial to reimplement. Thank you!

Possible missing data of benchmarking supervised performance

Hi, thank you for such fundamental work for the computational bio community!

I am interested in evaluating the model's performance in a supervised setting.
I have download the DMS_supervised_substitutions_scores.csv and run the script provide in scripts/scoring_DMS_supervised/performance_substitutions.sh, I modified the --input_scoring_file as DMS_supervised_substitutions_scores.csv and --DMS_reference_file_path as DMS_substitutions.csv. However, there is an error occurred in

old_ids = ref_df["Old_DMS_ID"].unique()
, I wonder am I passing the wrong reference file or this script needs to be updated with current reference file?

Thanks!

Clarification on Scoring + MSA Transformer Request

Hi ProteinGym team,

Thank you for providing both a supervised and an unsupervised benchmark to the community. This resource makes it 100x easier to benchmark and compare models. The community was in dire need of such a tool.

However, I have a few questions :

  1. Can you please describe how these scores were calculated in this scoring file for single mutant supervised splits:
    https://marks.hms.harvard.edu/proteingym/DMS_supervised_substitutions_scores.csv

Specifically the columns labeled: Spearman_fitness, Std_dev_Spearman_fitness, num_obs_Spearman_fitness, standardized_Spearman_fitness

My guess is that Spearman_fitness is the mean across the 5 (in two cases 4) splits for the test set only for a DMS_id which is not manipulated. Std_dev_Spearman_fitness is the standard deviation across those 5 splits. num_obs_Spearman_fitness is the number of observed datapoints for that spearman correlation. Which I assume always should be equal to the number of datapoints in the test fold, correct? I'm confused by this because it looks like you are using spearman in during training to check your validation set, so I just wanted to make sure it wasn't that value either. standardized_Spearman_fitness is when you are standardizing (values between 0-1) the training set for each fold before training which effects the end spearman correlation, correct? And the spearman's reported on proteingym.org and ProteinNPT paper are not the standardized values, instead they are the non-manipulated values.

Can you also please comment on if you are calculating spearman differently for ProteinNPT. I guess I'm unclear why their would ever be nan's in the experimental values?

  1. Would it be possible to download the 1TB of MSA Transformer embeddings at all for the singles only? Or perhaps you saved the mean pooled embeddings since they would be a lot smaller in dimension and probably only take up a few GB's? I know that may be a big ask, but I have the space to download them and it would save a lot of time and effort :)

Take care,
Bryce

Matching clinical, MSA and DMS protein and variant IDs

Hi,

I am trying to identify proteins across the DMS, clinical and MSA datasets (Only substitutions for now).
The purpose is to train models on DMS assays or MSA data and to evaluate them with clinical scores.

  1. When looking at the reference files "clinical_substitutions.csv", the proteins are IDed with something like e.g. NP_689699.3, while in "DMS_substitutions.csv" the proteins are IDed with their "uniprot_id", e.g. A0A140D2T1_ZIKV, how do I translate one name into the other ?

Similarly: I'd like to identify variants across MSA, DMS and clinical datasets.

  1. Is there currently a way to ensure that variants in the clinical dataset are not present in the DMS and MSA datasets ? This is to ensure that there is no circularity between the training and testing sets, and to have an idea of the quality of the scores reported in the benchmark.

I hope I haven't missed anything obvious.

Thanks for the great work

Antoine

Target sequence clarification

Hi there,

Thanks so much for building this resource!

Regarding the target_seq column in DMS_substitutions.csv, could you clarify whether this is the actual full protein sequence experimentally tested, or whether (at least in some cases) it can correspond to a specific subset of a protein that was experimented on (but the full protein was used for assays).

For instance, the sequence for TCRG1_MOUSE is GATAVSEWTEYKTADGKTYYYNNRTLESTWEKPQELK, but this sequence is a substring of the isoforms in UniProt: https://www.uniprot.org/uniprotkb/Q8CGF7/entry#sequences. However, I see that there is a PDB file for this shorter sequence specifically, which implies that this is the full tested protein (edit: in the ProteinGym database).

Many of the target sequences also do not start with methionine, which perhaps is due to post-translational modifications, but because of this I was less sure whether they correspond to independent proteins or not.

Could you clarify this point?

Thanks!

Gavin

Differences in set of assays marked ProteinGym v0.1 vs Tranception paper

I noticed some minor differences between the set of assays marked ProteinGym version 0.1 in the references files vs the set of assays listed in the Tranception paper:

  • For the indels benchmark, the rubisco dataset from "Highly active rubiscos discovered by systematic interrogation of natural sequence diversity" by Davidi et al. seems to have been removed (not listed as being part of either version 0.1 or version 1). Is there any particular reason for its exclusion from the latest version of ProteinGym?
  • For the substitutions benchmark, there are 78 unique "jo" marked version 0.1, but only 73 unique "jo" in the reference file from the Tranception publication. As a specific example, "A0A247D711_LISMN_Stadelmann_2021" is currently marked version 0.1, but I don't believe it was mentioned in the Tranception publication. Not a big deal, but just thought it would be good to let you know!

P.S. Thank you for all your hard work in putting these data together and making it available for the protein engineering community!

Recorded performance for TranceptEVE M/L on B1LPA6_ECOSM_Russ_2020_indels may be incorrect

I noticed that the reported Spearman correlations of TranceptEVE_M and TranceptEVE_L on B1LPA6_ECOSM_Russ_2020_indels are quite high >0.8, both relative to other models and the performance recorded in old model score files ~0.42-0.43.

Digging a bit deeper, I found that the model score files for these models are in a different format than other similar files and have many more rows than number of datapoints:

> wc -l zero_shot_indels_scores/TranceptEVE/TranceptEVE_S/B1LPA6_ECOSM_Russ_2020_indels.csv
3075 zero_shot_indels_scores/TranceptEVE/TranceptEVE_S/B1LPA6_ECOSM_Russ_2020_indels.csv
> wc -l zero_shot_indels_scores/TranceptEVE/TranceptEVE_M/B1LPA6_ECOSM_Russ_2020_indels.csv
126235 zero_shot_indels_scores/TranceptEVE/TranceptEVE_M/B1LPA6_ECOSM_Russ_2020_indels.csv
> wc -l zero_shot_indels_scores/TranceptEVE/TranceptEVE_L/B1LPA6_ECOSM_Russ_2020_indels.csv
126235 zero_shot_indels_scores/TranceptEVE/TranceptEVE_L/B1LPA6_ECOSM_Russ_2020_indels.csv

> head -n 2 zero_shot_indels_scores/TranceptEVE/TranceptEVE_S/B1LPA6_ECOSM_Russ_2020_indels.csv
mutated_sequence,avg_score_L_to_R,avg_score_R_to_L,avg_score
AAAAERIGEIRRRIDEIDRTLIALWQERAALSQEVGATRMASGGTRLVLSREREILERFRSELGDGTQLALLLLRAGRGPLLNKINPHSARIAFLGPKGSYSHLAARQYAARHFEQFIESGCAKFADIFNQVETGQADYAVVPIENTSSGAINDVYDLLQHTSLSIVGEMTLTIDHCLLVSGTTDLSTINTVYSHPQPFQQCSKFLNRYPHWKIEYTESTSAAMEKVAQAKSPHVAALGSEAGGTLYGLQVLERIEANQRQNFTRFVVLARKAINVSDQVPAKTTLLMATGQQAGALVEALLVLRNHSLIMTRLESRPIHGNPWEEMFYLDIQANLESAEMQKALKELGEITRSMKVLGCYPSENVVPVDPT,-0.8981293658040228,-0.46491348272552413,-0.6815214242647735
> head -n 2 zero_shot_indels_scores/TranceptEVE/TranceptEVE_M/B1LPA6_ECOSM_Russ_2020_indels.csv
mutated_sequence,avg_score,DMS_score
AAAAERIGEIRRRIDEIDRTLIALWQERAALSQEVGATRMASGGTRLVLSREREILERFRSELGDGTQLALLLLRAGRGPLLNKINPHSARIAFLGPKGSYSHLAARQYAARHFEQFIESGCAKFADIFNQVETGQADYAVVPIENTSSGAINDVYDLLQHTSLSIVGEMTLTIDHCLLVSGTTDLSTINTVYSHPQPFQQCSKFLNRYPHWKIEYTESTSAAMEKVAQAKSPHVAALGSEAGGTLYGLQVLERIEANQRQNFTRFVVLARKAINVSDQVPAKTTLLMATGQQAGALVEALLVLRNHSLIMTRLESRPIHGNPWEEMFYLDIQANLESAEMQKALKELGEITRSMKVLGCYPSENVVPVDPT,-1.792856163113236,0.02
> head -n 2 zero_shot_indels_scores/TranceptEVE/TranceptEVE_L/B1LPA6_ECOSM_Russ_2020_indels.csv
mutated_sequence,avg_score,DMS_score
AAAAERIGEIRRRIDEIDRTLIALWQERAALSQEVGATRMASGGTRLVLSREREILERFRSELGDGTQLALLLLRAGRGPLLNKINPHSARIAFLGPKGSYSHLAARQYAARHFEQFIESGCAKFADIFNQVETGQADYAVVPIENTSSGAINDVYDLLQHTSLSIVGEMTLTIDHCLLVSGTTDLSTINTVYSHPQPFQQCSKFLNRYPHWKIEYTESTSAAMEKVAQAKSPHVAALGSEAGGTLYGLQVLERIEANQRQNFTRFVVLARKAINVSDQVPAKTTLLMATGQQAGALVEALLVLRNHSLIMTRLESRPIHGNPWEEMFYLDIQANLESAEMQKALKELGEITRSMKVLGCYPSENVVPVDPT,-1.4938242659797245,0.02

Fitness Directionality in Raw DMS Files

First off, thanks for putting this collection of datasets together! It has been extremely helpful.

I wanted to clarify the expected direction of the relationship between phenotype and fitness in the raw DMS files. For DMS studies that originally had a negative correlation between phenotype and fitness (i.e. a "-1" in the "raw_DMS_directionality" column of the ProteinGym reference files), has the phenotype been adjusted in the raw files to make the relationship positive? I believe this is the case, as higher values in the "DMS_score" columns correlate with higher values in the "DMS_score_bin" column, but the section giving raw file download instructions states that the files download are "the raw, unprocessed DMS files...". Some clarification would be greatly appreciated!

How to normalize the DMS_score

I noticed that there are some assays with very large labels, so I am wondering how to normalize the DMS_score.
Thanks.

image

Some question about benckmark

Hi there :

In the benchmark of proteinGym substitutions ,i see three data splitting methods are evaluated separately. Is this substitutions benchmark trained on the single mutation scanning dataset (~690k)? Or is it combined with the single- and multi-point mutation data for training (~2.7M)? If it is combined and trained together, I don’t seem to see the multi-mutation cv data file cv_folds_multiples_substitutions contains the fold_random_5, fold_modulo_5 and fold_contiguous_5 fold columns like the single cv mutation dataset owned.

In addition, I also want to know when building the benchmark, is a separate model trained for each task data? If so, how to choose a suitable model for the prediction of proteins outside the dataset?

Thanks

Discrepancy between DMS level values and reported averages

Hello,
I have two question regarding the performances in the proteingym:

  1. I opened https://github.com/OATML-Markslab/ProteinGym/blob/main/Detailed_performance_files/Substitutions/Spearman/all_models_substitutions_Spearman_DMS_level.csv. Then, I calculated average Spearman values for TranceptEVE_L and EVE_ensemble values using both pandas and a spreadsheet program. The average values I obtain do not agree with the values reported in Summary_performance_all_models_substitutions_Spearman.csv. I was wondering if you ignore some DMS experiments from the average when you report the values in the summary file.
  2. I was wondering if all data points reported in the experiments (including multiple point measurements such as D170G:R187H in YAP1_HUMAN_Araya_2012.csv experiment) are calculated by the computational methods. I know that it may be difficult to answer that question for the other method reported in the ProteinGym but I am hoping that you have the answer for the methods developed in your lab (all tranception, tranceptEVE or EVE variants).
    Thanks in advance,
    Mustafa

Structure mismatch with mutated sequence.

Hi, thank you for your great effort to build such complete benchmark!

I an confused that for some datasets with long wild type sequences, the structures you provided don't have the same length with the mutated sequences, which means some structure-based baselines couldn't work correctly (they assmue the structure and sequence should have the same length). As an example, the length of the sequence in dataset "A0A140D2T1_ZIKV_Sourisseau_2019" is 3423, but the "A0A140D2T1_ZIKV.pdb" only offers 3D coordinates of 504 residues. I wonder how did you address this problem?

Thank you in advance and looking forward to your reply!

[new model] our model achieves an 0.473 Spearman's Correlation on the DMS substitution benchmark

Hi ProteinGym Team,
Thank you for updating the ProteinGym Benchmark Dataset. We have evaluated our model (ProtSSN, on https://github.com/tyang816/ProtSSN) on the updated ProteinGym for DMS substitution scoring. Our model obtained an overall Spearman's correlation of 0.473 with the ensemble version. We kindly request you to double-check this result and, if confirmed accurate, update our performance on the leaderboard.

For your convenience, we have provided the mutant-specific scores by email. Please let us know if more information or documents are required from us. Thank you!

MSA creation protocol

Hi Pascal,

I was wondering what protocol was used to create MSAs. Is it identical to the one used in EVE?

Best regards,
Floris

fold_rand_multiples CV split has overlapping folds in some datasets

For example, the sequence EVPFKVVAQFPYKSDYEDDLNFEKDQEIIVTSVEDAEVYFGEYQDSNGDVIEGIFPKSFVAVQG from BBC1_YEAST_Rocklin_2023_1TG0 occurs in two folds:

>>> df=pd.read_csv("proteingym_cv_folds_multiples_substitutions/BBC1_YEAST_Rocklin_2023_1TG0.csv")
>>> df[df["mutated_sequence"] == "EVPFKVVAQFPYKSDYEDDLNFEKDQEIIVTSVEDAEVYFGEYQDSNGDVIEGIFPKSFVAVQG"].drop(columns="mutated_sequence")
         mutant  DMS_score  DMS_score_bin  mutation_depth  fold_rand_multiples
813   F10F:W38V  -1.400813              0               2                    3
1737       W38V  -1.623593              0               1                    4

In the corresponding dataset, I believe this sequence only shows up once and the corresponding DMS_score is the average DMS_score.

All substitution scores link in README results in 404

Dear ProteinGym team,

Once again thanks for your amazing work putting this resource together!
In the README, you direct users to download this file that should contain all the scores for the diffrent DMS datasets:

https://marks.hms.harvard.edu/proteingym/scores_all_models_proteingym_substitutions.zip

Unfortunately, following this link results in a 404 error. If you have an updated overview file, it would be much appreciated.

Kind regards,
Stephan Heijl

selection type categorization of DMS datasets

For the assay type -stratified evaluation of Mutational Effect prediction models, I noticed that the two datasets, SPIKE_SARS2_Starr_2020_expression and SPIKE_SARS2_Starr_2020_binding are categorized as "Binding" in DMS_substitutions_Spearman_DMS_level.csv . Were these DMS datasets both intentionally classified as under "Binding" assay type instead of "Expression" and "Binding"?

Possible inconsistencies with DMS ID, DOI, and selection type

Thanks for the excellent resource and making all the data so easily accessible. While combing through the csv files, we noticed a few possible inconsistencies I wanted to ask about.

DMS ID

In reference_files/DMS_substitutions.csv datasets like ARGR_ECOLI_Tsuboyama_2023_1AOY from the mega scale stability experiment are named after Tsuboyama, e.g. ARGR_ECOLI_Tsuboyama_2023_1AOY. That is also the convention in benchmarks/DMS_zero_shot/substitutions/Spearman/DMS_substitutions_Spearman_DMS_level.csv. However, in benchmarks/DMS_supervised/substitutions/Spearman/DMS_substitutions_Spearman_DMS_level.csv they are named after Rocklin, e.g. ARGR_ECOLI_Rocklin_2023_1AOY.

DOI

The same mega scale study appears to have multiple journal DOIs listed in the jo column of reference_files/DMS_substitutions.csv. The first 10.1038/s41586-023-06328-6 is correct but the following increment the final position incorrectly, e.g. 10.1038/s41586-023-06328-7, 10.1038/s41586-023-06328-8.

Selection type

In https://marks.hms.harvard.edu/proteingym/DMS_supervised_substitutions_scores.zip the targets column has the value fitness or fitness_unsupervised_prediction for all rows. Some of these assays have other selection types in reference_files/DMS_substitutions.csv.

Bugs in `performance_DMS_benchmarks.py`?

Hi, great work!
I have downloaded zero_shot_substitutions_socres.zip, but unfortunately I could not test the results directly.

  1. I found input_score_name in config.json but never used in performance_DMS_benchmarks.py. I think you need this parameter to evaluate, but don't actually use it, which I'm confused about.
  2. I met an error at line 297, there is a groupby operation here, but an error will be reported for the dataframe. I would like to know how to run this script correctly.
    image
    Thanks a lot.

Partial AlphaFold structures are non-full-length

Hi, great work! This provides a very useful benchmark for representation learning of proteins.

However, in my attempts to test this dataset with my models, I realized that there are some provided structures that may be incomplete, and of course, this may confusing only for structure-based models.

E.g., the AlphaFold structure corresponding to A0A140D2T1_ZIKV_Sourisseau_2019 appears to be missing a portion of the structure (1-504 given, but N729 in DMS data), which results in some mutations not corresponding to the structure. This bug is due to a mis-coding of an amino acid in the structure, when in fact the first amino acid in the structure is the 291st amino acid of the sequence. I don't know if there are other proteins with a similar problem.

If you can provide a complete (or correctly numbered) structural dataset, it may help all users to test their models on the same structural data to promote fairness in benchmarking. Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.