Coder Social home page Coder Social logo

lhatsk / alphalink Goto Github PK

View Code? Open in Web Editor NEW
61.0 5.0 15.0 13.81 MB

AlphaLink: Integrating crosslinking MS data into OpenFold

License: Apache License 2.0

Dockerfile 0.20% Jupyter Notebook 4.25% Python 89.37% C 0.02% C++ 1.59% Cuda 0.81% Shell 3.75%

alphalink's Introduction

header Figure: AlphaLink prediction (teal) of T1064 with simulated crosslinks (blue)

AlphaLink

AlphaLink predicts protein structures using deep learning given a sequence and a set of experimental contacts. It extends OpenFold with crosslinking MS data or other experimental distance restraint by explicitly incorporating them in the OpenFold architecture. The experimental distance restraints may be represented in one of two forms:

  1. As contacts/upper bound distance restraints
  2. As distance distributions (distograms) (flag --distograms)

For (1), we trained our network with 10 Angstrom Ca-Ca and show robust rejection of experimental noise and false restraints. The distogram representation (2) allows the user to input longer restraints, for example corresponding to crosslinkers with spacers like BS3 or DSS or to NMR PRE distance restraints.

Installation

Please refer to the OpenFold GitHub for installation instructions of the required packages. AlphaLink requires the same packages, since it builds on top of OpenFold.

Crosslinking data

Crosslinking MS data can be included either as a PyTorch dictionary with NumPy arrays: 'xl' and 'grouping' with shape LxLx1 where L is the length of the protein or as a space-separated file with the following format:

residueFrom residueTo FDR

128 163 0.05
147 77 0.05
147 41 0.05

residueFrom and residueTo are the residues crosslinked to each other (sequence numbering starts at 1). FDR is between 0 and 1. CSV format is not supported for distograms.

The software may then be run with models based on upper bound distance thresholds or using generalized distograms. Distograms have shape LxLx128 with the following binning: numpy.arange(2.3125,42,0.3125) + a catch-all bin in the end for distances >= 42A and no group embedding. Last bin is a catch-all bin. The probabilities should sum up to 1. To use distograms, you have to run predict_with_crosslinks.py with the --distograms flag.

Distograms can also be given as a space-separated file with the following format:

residueFrom residueTo 1..128

128 163 0.05 0.05 0.05 0.05 ...
147 77 0.01 0.015 0.05 0.05 ...
147 41 0.04 0.1 0.05 0.052 ...

residueFrom and residueTo are the residues crosslinked to each other (sequence numbering starts at 1). Columns 2-130 contain the probability for each bin in numpy.arange(2.3125,42,0.3125)- i.e. the probability of each bin in a distogram going from 2.3125 to 42 Angstrom. The 128th bin is a catch-all bin for distances >= 42. Each restraint can have a different distribution, any uncertainty has to be encoded in the distribution. There is no additional FDR parameter.

Distance distributions for AlphaLink can be automatically generated from restraint lists with the script preprocessing_distributions.py.

     python preprocessing_distributions.py --infile restraints.csv

Where restraints.csv is a comma-separated file containing residueFrom,residueTo,meanDistance,standard deviation, distribution type (normal/log-normal). For example:

12,135,15.0,5.0,normal

For a restraint between residue 12 and 135 imposed as a normal distribution with a mean distance of 15 Angstrom and a standard deviation of 10 Angstrom.

preprocessing_distributions.py will generate a restraint list with distance distributions binned in 128-bin distograms that can be given to AlphaLink when run with the --distograms flag

     python predict_with_crosslinks.py 7K3N_A.fasta restraints.csv --distograms --checkpoint_path resources/AlphaLink_params/finetuning_model_5_ptm_distogram.pt --uniref90_database_path uniref90.fasta --mgnify_database_path mgy_clusters.fa --pdb70_database_path pdb70/pdb70 --uniclust30_database_path uniclust30_2018_08/uniclust30_2018_08 --jackhmmer_binary_path $CONDA_PREFIX/bin/jackhmmer --hhblits_binary_path $CONDA_PREFIX/bin/hhblits --hhsearch_binary_path $CONDA_PREFIX/bin/hhsearch --kalign_binary_path $CONDA_PREFIX/bin/kalign

MSA subsampling

MSAs can be subsampled to a given Neff with --neff.

Usage

AlphaLink expects a FASTA file containing a single sequence, the crosslinking MS residue pairs, and databases for template/ MSA search, see also OpenFold Inference.

python predict_with_crosslinks.py 7K3N_A.fasta photoL.csv --checkpoint_path resources/AlphaLink_params/finetuning_model_5_ptm_CACA_10A.pt --uniref90_database_path uniref90.fasta --mgnify_database_path mgy_clusters.fa --pdb70_database_path pdb70/pdb70 --uniclust30_database_path uniclust30_2018_08/uniclust30_2018_08 --jackhmmer_binary_path $CONDA_PREFIX/bin/jackhmmer --hhblits_binary_path $CONDA_PREFIX/bin/hhblits --hhsearch_binary_path $CONDA_PREFIX/bin/hhsearch --kalign_binary_path $CONDA_PREFIX/bin/kalign

MSA generation can be skipped if there are precomputed alignments:

python predict_with_crosslinks.py 7K3N_A.fasta photoL.csv --use_precomputed_alignments msa/ --checkpoint_path resources/AlphaLink_params/finetuning_model_5_ptm_CACA_10A.pt  --uniref90_database_path uniref90.fasta --mgnify_database_path mgy_clusters.fa --pdb70_database_path pdb70/pdb70 --uniclust30_database_path uniclust30_2018_08/uniclust30_2018_08 

or with precomputed features (pickle) with --features

Network weights

Can be downloaded here:

https://www.dropbox.com/s/8npy4d6q86eqpfn/finetuning_model_5_ptm_CACA_10A.pt.gz?dl=0 https://www.dropbox.com/s/5jmb8pxmt5rr751/finetuning_model_5_ptm_distogram.pt.gz?dl=0

They need to be unpacked (gunzip).

AlphaLink IHM model deposition alphalink-ihm-template

Models generated with AlphaLink using experimental restraints can be published as integrative/hybrid models in PDB-Dev PDB-Dev using this script. Requires python-ihm.

Takes a .csv file with the crosslinking MS restraints, uniprot accession code and system name to generate a pdb-dev compliant file for deposition. Takes an mmcif file as an input.

First, generate an mmcif file from the .pdb output of AlphaLink using Maxit.

Then, edit the make_ihm script to include authors, publication, system name, entity source, deposition database and details as you need.

Then you can run with

python make_ihm.py

Reproducibility instructions

We eliminated all non-determinism (MSA masking), since with low Neff targets, different MSA masking can have a big effect.

The models generated for the AlphaLink paper are deposited in ModelArchive and PDB-Dev. The restraints used in the modeling are available as supplementary tables to the AlphaLink paper.

Copyright notice

While AlphaFold's and, by extension, OpenFold's source code is licensed under the permissive Apache Licence, Version 2.0, DeepMind's pretrained parameters fall under the CC BY 4.0 license, a copy of which is downloaded to openfold/resources/params by the installation script. Note that the latter replaces the original, more restrictive CC BY-NC 4.0 license as of January 2022.

Citing this work

Cite the AlphaLink paper: "Protein structure prediction with in-cell photo-crosslinking mass spectrometry and deep learning", Nat. Biotech. XXX doi:10.1038/s41587-023-01704-z.

Any work that cites AlphaLink should also cite AlphaFold and OpenFold.

alphalink's People

Contributors

grandrea avatar lhatsk avatar samuelmurail avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

alphalink's Issues

Request for training

Hello, thank you for great research.

I already install alphalink in my workspace following by openfold page.
Actually, I want to try to train the model like in your paper.
So could you provide any training script which I follow?

In the paper, you guys trained the model with fine-tuning method.
And I want to follow up your script.
So if there are any scripts to follow, please provide to us.

Thank you for reading.

Running "predict_with_crosslinks.py" with "restraints.csv --distograms" gives the error

Dear AlphaLink developers,

I am trying to run "predict_with_crosslinks.py" as follows:

# Running
predict_with_crosslinks.py $FASTA_FILE restraints.csv --distograms $UNIREF90_PATH $MGNIFY_PATH $PDB70_PATH $MMCIF_PATH $UNICLUST30_PATH --features features.pkl --checkpoint_path $ALPHALINK_WEIGHTS

where "restraints.csv" looks like:

55,236,35.0,0.5,normal
236,311,26.0,1.5,normal

$ALPHALINK_WEIGHTS corresponds to <'PATH'>finetuning_model_5_ptm_CACA_10A.pt.
'features.pkl' is an output file after AlphaFold2 run (with my protein).

I receive the following error message:

Traceback (most recent call last):
  File "<'PATH'>/predict_with_crosslinks.py", line 550, in <module>
    main(args)
  File "<'PATH'>/predict_with_crosslinks.py", line 367, in main
    model, output_directory = load_models_from_command_line(args, config)
  File "<'PATH'>/predict_with_crosslinks.py", line 270, in load_models_from_command_line
    model.load_state_dict(sd)
  File "<'PATH'>/python3.9/site-packages/torch/nn/modules/module.py", line 1482, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for AlphaFold:
        size mismatch for xl_embedder.linear.weight: copying a param with shape torch.Size([128, 1]) from checkpoint, the shape in current model is torch.Size([128, 128]).

Could you please clarify where my mistake is?

problem with model loading

Hi AlphaLink developers!

I am trying to use AlphaLink. I've downloaded the model via your dropbox link and unpacked it with gunzip. So I start the prediction like this:

 python predict_with_crosslinks.py ./test/test/input.fasta ./test/test/restraints.txt --distograms --checkpoint_path ./alphalink/resources/finetuning_model_5_ptm_CACA_10A.pt --uniref90_database_path /resources/alphafold2/uniref90/uniref90.fasta --mgnify_database_path /resources/alphafold2/mgnify/mgy_clusters.fa --pdb70_database_path /resources/alphafold2/pdb70/ --uniclust30_database_path /resources/alphafold2/uniclust30/uniclust30_2018_08/

The traceback I've got:

  File "/users/user/alphalink/AlphaLink/predict_with_crosslinks.py", line 571, in <module>  main(args)
  File "/users/user/alphalink/AlphaLink/predict_with_crosslinks.py", line 376, in main model, output_directory = load_models_from_command_line(args, config)
  File "/users/user/alphalink/AlphaLink/predict_with_crosslinks.py", line 271, in load_models_from_command_line model.load_state_dict(sd)
  File "/software/f2021/software/pytorch/1.10.0-foss-2021a-cuda-11.3.1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1482, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for AlphaFold:
        size mismatch for xl_embedder.linear.weight: copying a param with shape torch.Size([128, 1]) from checkpoint, the shape in current model is torch.Size([128, 128]).

Do you have any ideas about what went wrong?:)

Best regards,
Julia

Installation unclear

Does this sentence:

AlphaLink requires the same packages, since it builds on top of OpenFold.

mean openfold must also be installed? Or does it mean just follow the example of how openfold is installed. It's not clear.

Run "python preprocessing_distributions.py --infile restraints.csv" but get an error

Hi,

When I tried to get distance distributions from restraint lists by running "python preprocessing_distributions.py --infile restraints.csv", I received an error like below

$ python preprocessing_distributions.py --infile restraints.csv
Traceback (most recent call last):
File "/lscratch/14291792/preprocessing_distributions.py", line 50, in
for line in restraints:
TypeError: iteration over a 0-d array

The restraints.csv is a test file and only has one line
12,135,15.0,5.0,normal

Could you let me know how to solve this problem?

Really appreciate!

Xiang

Over-weight of crosslinking data

Hi,

How can we figure out the over weight problem for crosslinking data? i noticed if there are lots of crosslinking restraints for one sequence, the final models looks like over-constrained and some well-folded domains looks unstructured.

Thanks.
Yan

Save pkl files

Hi,

is there a way to save the model pkl files?

I added the --save_outputs flag, but they're neither being saved while running the distogram nor the 10A distance mode.

Problem with Crosslinking data input

When I was reproducing the results of CDK in the test_set, you provided input data in the form of crosslink data in both CSV and PT file formats. I noticed that in the PT file, the xl_array contains duplicated entries for residueTo and residueFrom. Can you explain why these entries are duplicated in reverse order?
Additionally, could you clarify the information represented by the grouping_array?
Furthermore, the results I inferred from these inputs do not match the PDB file located at test_set/CDK/predictions/CDK_neff10_1h01_xl_model_5_ptm.pdb, specifically in terms of RMSD and TM-score.

this is my call script:
python predict_with_crosslinks.py test_set/CDK/fasta/CDK.fasta test_set/CDK/crosslinks/1h01_xl.pt --features test_set/CDK/features/CDK_neff10.pkl --checkpoint_path resources/AlphaLink_params/finetuning_model_5_ptm_CACA_10A.pt --uniref90_database_path /xxx/uniref90.fasta --mgnify_database_path /xxx/mgnify/mgy_clusters_2022_05.fa --pdb70_database_path /xxx/pdb70 --uniclust30_database_path /xxx/uniref30/

hands-on protocol for contacts_to_distograms

Hi,

Could you please share a hands-on protocol on how we can generate distogram with contact information? as a beginner, it seems hard for me to use the scripts (contacts_to_distograms.py) to build the distogram.

Thank you so much!
Yan

FDR Description in Arg parse is potentially wrong

In the file: predict_with_crosslink.py
The description for the following code is possibly wrong. Number of CPUs definitely cannot be floating point. What is fdr and what does it mean?
parser.add_argument( "--fdr", type=float, default=0.05, help="""Number of CPUs with which to run alignment tools"""

Request for an example folder

Hi,

Is it possible to create a folder with sample files (including "7K3N_A.fasta", "restraints.csv", and "photoL.csv") as mentioned in the provided example? This would help us better understand the work and run the tool.

As shown in the examples from https://github.com/lhatsk/AlphaLink#readme:
python predict_with_crosslinks.py 7K3N_A.fasta restraints.csv...
python predict_with_crosslinks.py 7K3N_A.fasta photoL.csv ...

Alphalink install failed

I have been trying to install virtual env using environment.yml

And get the following error:

Collecting deepspeed==0.5.10 (from -r /vast/scratch/users/iskander.j/AlphaLink/condaenv.s_3tgy1b.requirements.txt (line 2))
  Using cached deepspeed-0.5.10.tar.gz (515 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'error'

Pip subprocess error:
  Running command git clone --filter=blob:none --quiet https://github.com/NVIDIA/dllogger.git /vast/scratch/users/iskander.j/tmp/pip-req-build-b5e98t7e
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [14 lines of output]
      Traceback (most recent call last):
        File "<string>", line 36, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/vast/scratch/users/iskander.j/tmp/pip-install-16f1fgl7/deepspeed_58e11f6d38c1437fb3136539611b056b/setup.py", line 27, in <module>
          import torch
        File "/home/users/allstaff/iskander.j/.local/lib/python3.7/site-packages/torch/__init__.py", line 217, in <module>
          _load_global_deps()
        File "/home/users/allstaff/iskander.j/.local/lib/python3.7/site-packages/torch/__init__.py", line 177, in _load_global_deps
          raise err
        File "/home/users/allstaff/iskander.j/.local/lib/python3.7/site-packages/torch/__init__.py", line 172, in _load_global_deps
          ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
        File "/stornext/System/data/apps/rc-tools/rc-tools-1.0/bin/tools/envs/alphalink/lib/python3.7/ctypes/__init__.py", line 364, in __init__
          self._handle = _dlopen(self._name, mode)
      OSError: /home/users/allstaff/iskander.j/.local/lib/python3.7/site-packages/torch/lib/libtorch_global_deps.so: cannot open shared object file: No such file or directory
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
                                                                                                                           failed

CondaEnvException: Pip failed

When I installed OpenFold from the OpenFold GitHub, I got errors due to deprecated simtk version.

No MSA output - precomputed alignments called automatically

Hello,

I am able to "run" AlphaLink successfully (i.e., generate pkl and pdb outputs from a fasta and crosslinks file), but when I check the generated 'alignments' folder, I get a subfolder with the name of the input fasta and then nothing else. So, no MSA has been generated in that folder nor anywhere else as far as I can tell. When checking my slurm outputs, I noticed that the --use_precomputed_alignments flag was automatically being called, even though this flag was not in the original script. This flag was pointing to the aforementioned 'alignments' folder that gets created for the outputs...which is empty.

Am I doing something wrong? Here is what one of my scripts looks like; I used the example on the GitHub page:

python $HOME/AlphaLink/predict_with_crosslinks.py
$FASTAS/BLAH.fasta
$CROSSLINKS/BLAH.csv
--checkpoint_path $HOME/AlphaLink/finetuning_model_5_ptm_CACA_10A.pt
--uniref90_database_path $SOURCE/uniref90/uniref90.fasta
--mgnify_database_path $SOURCE/mgnify/mgy_clusters_2022_05.fa
--pdb70_database_path $SOURCE/pdb70/pdb70_hhm.ffdata
--uniclust30_database_path $SOURCE/uniref30/uniref30.fasta
--output_dir AlphaLink_Outputs/Batch_Testing/TEST
--neff 10

As you can see this is when I subsample neff. I can double-check the slurm output when --neff flag is not used, but the result is the same - no MSA data. Here is the slurm output that refers to the precomputed msas flag:

Using precomputed alignments for sp|BLAH|BLAH at AlphaLink_Outputs/Batch_Testing/TEST/alignments...

Andrea recommended that I try adding more flags for jackhmmer, hhblits, etc., but this did not help the issue.

Thank you,

Anthony

Issue with multimer

Hello,

I have been trying using AlphaLink with distance constraints between different subunits.
From what I understand from the code, it doesn't seem possible to add such constraints.

Am I right ?

Cheers,
Samuel

Inter-subunit crosslinking data

Hi, great work !
I am wondering if it is possible to leverage intermolecular crosslinking data as distance restraint in alphalink?
Further, would it be possible to use ambiguous distance restraints, like in NMR structure calculation and haddock, generated from homo-oligomer crosslinking data ? Or translate pair representation / MSA coevolution information into explicit distance restraint ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.