Coder Social home page Coder Social logo

aqlaboratory / openfold Goto Github PK

View Code? Open in Web Editor NEW
2.6K 2.6K 474.0 15.6 MB

Trainable, memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold 2

License: Apache License 2.0

Python 93.17% Shell 2.84% Jupyter Notebook 3.18% Dockerfile 0.13% C 0.01% C++ 0.17% Cuda 0.49%
alphafold2 protein-structure pytorch

openfold's People

Contributors

alquraishi avatar atgctg avatar awaelchli avatar bozhang-hpc avatar brianloyal avatar cclauss avatar christinaflo avatar controny avatar decarboxy avatar dependabot[bot] avatar dingquanyu avatar ericmjl avatar gahdritz avatar jnwei avatar jonathanking avatar josemduarte avatar kiddozhu avatar lilleswing avatar ljarosch avatar luwei0917 avatar marta-sd avatar mattwthompson avatar nikitos9000 avatar nz99 avatar sachinkadyan7 avatar sauravmaheshkar avatar sdvillal avatar timodonnell avatar vaclavhanzl avatar zrqiao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

openfold's Issues

Training duration & NaNs during training

First of all, great work!

I'm wondering what training times I can expect for a single target. I'm currently at 1min/ it (sample) which seems too slow (v100s with fp16 and deepspeed activated, crop size 256). The official implementation takes around 20sec for a comparable sample (single GPU, about 16s with an A100). Haven't tested how much of an overhead is introduced by deepspeed. Gradient accumulation should help to reduce this.

Is it actually possible to train batch size > 1 on a single GPU? I'm assuming it would work with fixed_size=True. I just vaguely remember that they did some dimensionality juggling with the template/ recycling dimensions which might interfere.

Thanks!

About the evaluation in CASP14

Hello everyone. I am doing some evaluation jobs of the inference pipeline. I am wondering how to evaluate the result pdb file such like TM-score, for those proteins that CASP14 doesn't provide the remarking pdb file.

pdb files not exist in mmcif dir

Hi,
Thank for the last time it helped me.
However, now I have another error.
After running the training from ProteinNet input:

python /data/openfold/train_openfold.py /data/af_databases/pdb_mmcif/mmcif_files/ /home/ubuntu/ProteinNet_parsed/ProteinNet_lc/ /data/af_databases/pdb_mmcif/mmcif_files/ /home/ubuntu/OF_train_from_Protein_Net/try_1_Dec29_2021/ 2021-10-10 --template_release_dates_cache_path /data/af_databases/pdb_mmcif/mmcif_cache.json --precision 16 --replace_sampler_ddp=True--deepspeed_config /data/deepspeed_config.json --default_root_dir /home/ubuntu/OF_train_from_Protein_Net/try_1_Dec29_2021/ --gpus 1 --seed 44

I got this error:
###############

Epoch 0: 0%| | 0/50939 [00:00<?, ?it/s]Traceback (most recent call last):
File "/data/openfold/train_openfold.py", line 336, in
main(args)
File "/data/openfold/train_openfold.py", line 196, in main
ckpt_path=ckpt_path,
File "/data/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 736, in fit
self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File "/data/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/data/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/data/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1193, in _run
self._dispatch()
File "/data/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1272, in _dispatch
self.training_type_plugin.start_training(self)
File "/data/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "/data/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1282, in run_stage
return self._run_train()
File "/data/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1312, in _run_train
self.fit_loop.run()
File "/data/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/data/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
self.epoch_loop.run(data_fetcher)
File "/data/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 140, in run
self.on_run_start(*args, **kwargs)
File "/data/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 141, in on_run_start
self._dataloader_iter = _update_dataloader_iter(data_fetcher, self.batch_idx + 1)
File "/data/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/loops/utilities.py", line 121, in _update_dataloader_iter
dataloader_iter = enumerate(data_fetcher, batch_idx)
File "/data/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/utilities/fetching.py", line 199, in iter
self.prefetching(self.prefetch_batches)
File "/data/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/utilities/fetching.py", line 258, in prefetching
self._fetch_next_batch()
File "/data/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/utilities/fetching.py", line 300, in _fetch_next_batch
batch = next(self.dataloader_iter)
File "/data/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/supporters.py", line 536, in next
return self.request_next_batch(self.loader_iters)
File "/data/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/supporters.py", line 548, in request_next_batch
return apply_to_collection(loader_iters, Iterator, next)
File "/data/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/utilities/apply_func.py", line 92, in apply_to_collection
return function(data, *args, **kwargs)
File "/data/openfold/openfold/data/data_modules.py", line 350, in _batch_prop_gen
for batch in iterator:
File "/data/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in next
data = self._next_data()
File "/data/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
return self._process_data(data)
File "/data/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/data/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/torch/_utils.py", line 434, in reraise
raise exception
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/data/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/data/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/data/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/data/openfold/openfold/data/data_modules.py", line 178, in getitem
chain_id=chain_id,
File "/data/openfold/openfold/data/data_pipeline.py", line 577, in process_pdb
with open(pdb_path, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/data/af_databases/pdb_mmcif/mmcif_files/4l6v_9.pdb'

Epoch 0: 0%| | 0/50939 [00:00<?, ?it/s]

Thanks,
Oz

trained parameters

Hi, will you be releasing the parameters on a non-academic license as well ? Or do we have to train it from scratch?

Error while installing dependencies

When I run

scripts/install_third_party_dependencies.sh

it fails at the last step

gzip: tests/test_data/sample_feats.pickle.gz: No such file or directory

This is because the file sample_feats.pickle.gz is not downloaded.

More components of the model should be TorchScript-compatible

As it stands, only the attention primitives Attention and GlobalAttention are TorchScript-ed (or, for that matter, TorchScript-able) during inference. For better runtimes and memory allocation, more of the network's modules---especially in the Evoformer---should be made compatible with TorchScript. In my estimation, the biggest hurdle before this goal is the inference-time chunking functionality, which currently makes heavy use of function pointers not supported by TorchScript.

Colab not Working !!! Error when importing `datapipeline`

Great work with reproducing the original code and creating a OpenSource PyTorch Implementation ☕️☕️☕️☕️


When I try to run the attached Colab Notebook, In the "Search against genetic databases" subsection while importing datapipeline from openfold.data, I run into a ImportError, viz.

ImportError: cannot import name 'MultipleChainsError' from 'openfold.data.templates' (/opt/conda/lib/python3.7/site-packages/openfold/data/templates.py)

The full traceback is attached below :-

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-7-8051d602620b> in <module>()
     27 from openfold.data import feature_pipeline
     28 from openfold.data import parsers
---> 29 from openfold.data import data_pipeline
     30 from openfold.data.tools import jackhmmer
     31 from openfold.model import model

/opt/conda/lib/python3.7/site-packages/openfold/data/data_pipeline.py in <module>()
     20 import numpy as np
     21 
---> 22 from openfold.data import templates, parsers, mmcif_parsing
     23 from openfold.data.tools import jackhmmer, hhblits, hhsearch
     24 from openfold.data.tools.utils import to_date

/opt/conda/lib/python3.7/site-packages/openfold/data/templates.py in <module>()
     26 import numpy as np
     27 
---> 28 from openfold.data import parsers, mmcif_parsing
     29 from openfold.data.tools import kalign
     30 from openfold.data.tools.utils import to_date

/opt/conda/lib/python3.7/site-packages/openfold/data/mmcif_parsing.py in <module>()
     27 import numpy as np
     28 
---> 29 from openfold.data.templates import MultipleChainsError
     30 import openfold.np.residue_constants as residue_constants
     31 

ImportError: cannot import name 'MultipleChainsError' from 'openfold.data.templates' (/opt/conda/lib/python3.7/site-packages/openfold/data/templates.py)

Interesting enough if I add from openfold.data.templates import MultipleChainsError I run into a circular ImportError, the error trace is attached below

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-8-71256580fa0c> in <module>()
     27 from openfold.data import feature_pipeline
     28 from openfold.data import parsers
---> 29 from openfold.data.templates import MultipleChainsError
     30 from openfold.data import data_pipeline
     31 from openfold.data.tools import jackhmmer

/opt/conda/lib/python3.7/site-packages/openfold/data/templates.py in <module>()
     26 import numpy as np
     27 
---> 28 from openfold.data import parsers, mmcif_parsing
     29 from openfold.data.tools import kalign
     30 from openfold.data.tools.utils import to_date

/opt/conda/lib/python3.7/site-packages/openfold/data/mmcif_parsing.py in <module>()
     27 import numpy as np
     28 
---> 29 from openfold.data.templates import MultipleChainsError
     30 import openfold.np.residue_constants as residue_constants
     31 

ImportError: cannot import name 'MultipleChainsError' from 'openfold.data.templates' (/opt/conda/lib/python3.7/site-packages/openfold/data/templates.py)

Issue in prep_mmseqs_dbs.sh

Hi, Iam running the script prep_mmseqs_dbs.sh. Ive done the corrections in script changing tar2exprofiledb to tsv2exprofiledb.

but the script extract the files and return the following error:

uniclust30_2018_08/uniclust30_2018_08_a3m.ffdata
uniclust30_2018_08/uniclust30_2018_08_a3m.ffindex
uniclust30_2018_08/uniclust30_2018_08_hhm.ffdata
uniclust30_2018_08/uniclust30_2018_08_hhm.ffindex
uniclust30_2018_08/uniclust30_2018_08_cs219.ffdata
uniclust30_2018_08/uniclust30_2018_08_cs219.ffindex
uniclust30_2018_08/uniclust30_2018_08.cs219
uniclust30_2018_08/uniclust30_2018_08.cs219.sizes
uniclust30_2018_08/uniclust30_2018_08_a3m_db
uniclust30_2018_08/uniclust30_2018_08_a3m_db.index
uniclust30_2018_08/uniclust30_2018_08_hhm_db
uniclust30_2018_08/uniclust30_2018_08_hhm_db.index
uniclust30_2018_08/uniclust30_2018_08_md5sum
../../scripts/prep_mmseqs_dbs.sh: line 33: mmseqs: command not found

I got mmseqs installed.

Could anyone help me?

No checkpoints saved after validation epoch ends

Checkpoints are not saved after validation epoch ends. checkpoint_best_val is active

Validation loss is not shown during validation, maybe this is connected? (since it's supposed to track val_loss)

Severe memory fragmentation

In some cases, especially for larger crop sizes, intermediate tensors during training grow so large that PyTorch OOM's despite having allocated as little as 60% of available GPU memory. It would be good to carefully the profile the network to identify the worst culprit modules and come up with clean ways to prevent such degenerate tensor allocation.

OOM with bfloat16, no speed-up

New issue based on: #34

Turning on bfloat16 in deepspeed doesn't seem to have the desired effect. Model params size remains unchanged. Hitting OOM in validation which works fine in FP16.

Training with bfloat16 in pytorch-lightning fails:

File "openfold/openfold/utils/loss.py", line 46, in sigmoid_cross_entropy
log_p = torch.nn.functional.logsigmoid(logits)
RuntimeError: "log_sigmoid_forward_cuda" not implemented for 'BFloat16'

Support still missing in deepspeed? microsoft/DeepSpeed#974

Tested on A100 with torch 1.10.1+cu113

Sampling recycling iterations in validation

I was a bit surprised that the number of recycling iterations are sampled during validation. This makes different validation epochs less comparable and the progress less smooth. I think eval should mimic predict in this aspect.

max_iters = self.config.common.max_recycling_iters
if(stage_cfg.supervised):
    clamp_prob = self.config.supervised.clamp_prob
    keyed_probs.append(
        ("use_clamped_fape", [1 - clamp_prob, clamp_prob])
    )

if(self.stage == "train" and self.config.supervised.uniform_recycling):
    recycling_probs = [
        1. / (max_iters + 1) for _ in range(max_iters + 1)
    ]
    keyed_probs.append(
        ("no_recycling_iters", recycling_probs)
    )
else:
    recycling_probs = [
        0. for _ in range(max_iters + 1)
    ]
    recycling_probs[-1] = 1.
    keyed_probs.append(
        ("no_recycling_iters", recycling_probs)
    )

the consuming time for precompute alignments

I ran "precompute_alignments.py" to precompute 184,700 protein alignments before training the model because I want to use the same data as AlphaFold, but It took me ~4h to finish only one protein alignment (1yxq), so I want to know my operation is correct or not, besides, is there any precomputed alignments can be download to save my aligned time?

script error

when i use script_preset_(model_module), code error,
RuntimeError:
'Tensor' object has no attribute or method 'new_ones'.:
File "openfold/openfold/model/msa.py", line 118
if mask is None:
# [*, N_seq, N_res]
mask = m.new_ones(
~~~~~~~~~~ <--- HERE
m.shape[:-3] + (n_seq, n_res),
)

Some questions about running inference

When I use openfold to infer proteins, some of them can be inferred, but some of them will report errors. The reason for the error is probably: the template name searched out is outdated, and there is no outdated protein cif file in the template_mmcif_dir directory.

Below is an example of an error protein :
5IZB_A
Traceback (most recent call last):
File "run_pretrained_openfold.py", line 253, in
main(args)
File "run_pretrained_openfold.py", line 118, in main
fasta_path=fasta_path, alignment_dir=local_alignment_dir
File "/home/jsr/openfold/openfold/data/data_pipeline.py", line 420, in process_fasta
self.template_featurizer,
File "/home/jsr/openfold/openfold/data/data_pipeline.py", line 55, in make_template_features
hits=hits_cat,
File "/home/jsr/openfold/openfold/data/templates.py", line 1059, in get_templates
kalign_binary_path=self._kalign_binary_path,
File "/home/jsr/openfold/openfold/data/templates.py", line 827, in _process_single_hit
with open(cif_path, "r") as cif_file:
FileNotFoundError: [Errno 2] No such file or directory: '/public/database/alphafold2_database/mmcif/mmcif_files/4zai.cif'

What should template_mmcif_dir be?

For training use ColabFold pipeline (and templates with HHsearch), there is a path template_mmcif_dir.
Should it be something like data/pdb_mmcif/mmcif_files/ or other precomputed folders?

Invalid Command: tar2exprofiledb

in openfold/scripts/prep_mmseqs_dbs.sh
I guess it should be mmseqs tsv2exprofiledb not mmseqs tar2exprofiledb

Also a bug at line 26: tar --extract --verbose --file="${DOWNLOAD_DIR}/${f}" \
I think it should be tar --extract --verbose --file="${f}" \

Dockerfile

Hey epic work!
Could you post a Dockerfile for training/inference?
Thanks!

About the memory

Well done!I am quitly wondering that using 4 TITAN 2080 with 12G, can i train this model? will i meet the error on out of the memory?

New entries in obsolete.dat will throw up errors.

Traceback (most recent call last):
  File "/ocean/projects/bio210060p/kadyan/openfold-release/scripts/precompute_te
mplate_hits.py", line 224, in <module>
    main(args, template_pipeline_runner)
  File "/ocean/projects/bio210060p/kadyan/openfold-release/scripts/precompute_te
mplate_hits.py", line 116, in main
    feature_dict = template_pipeline_runner.run(a3m_dir, fasta_file_path)
  File "/ocean/projects/bio210060p/kadyan/openfold-release/scripts/precompute_te
mplate_hits.py", line 80, in run
    alignment_dir=a3m_dir,
  File "/ocean/projects/bio210060p/kadyan/openfold-release/openfold/data/data_pi
peline.py", line 360, in process_fasta
    hits=hits_cat,
  File "/ocean/projects/bio210060p/kadyan/openfold-release/openfold/data/templat
es.py", line 1058, in get_templates
    kalign_binary_path=self._kalign_binary_path,
  File "/ocean/projects/bio210060p/kadyan/openfold-release/openfold/data/templat
es.py", line 828, in _process_single_hit
    with open(cif_path, "r") as cif_file:
FileNotFoundError: [Errno 2] No such file or directory: '/databases/pdb_mmcif/mmcif_files/6ek0.cif'

ISSUE: New entries added in obsolete.dat will fail because the corresponding replacements will not be found in the pre-downloaded pdb_mmcifs.

Acquiring MSAs

Thanks so much for an excellent repo!

I'm trying to weigh all of the options for acquiring MSAs in order to train the model. I could either 1) use trrosetta's MSAs, 2) Use Protein Net's MSAs, or 3) Make MSAs myself using MMSeqs2. Do you potentially know how these options compare and how long 3) would take?

Thanks!

"Module 'Attention' has no attribute 'linear_g' : "

Hi,
When I'm trying to train the model and running this command :

python /data/openfold/train_openfold.py /home/ubuntu/train_mmcif_Dec29_2021/ //home/ubuntu/ProteinNet_parsed/ProteinNet_MSA/ /data/af_databases/pdb_mmcif/mmcif_files/ /home/ubuntu/OF_train_from_ProteinNet_try1_Dec29_20210/ 2021-10-10 --template_release_dates_cache_path /data/af_databases/pdb_mmcif/mmcif_cache.json --precision 16 --replace_sampler_ddp=True --deepspeed_config_path /data/deepspeed_config.json --resume_from_ckpt ckpt_dir/ --gpus 1 --precision 16 --seed 44

I get this error -

"Module 'Attention' has no attribute 'linear_g' : "

I'm running it from the conda env (openfold_venv)

Thanks
Oz

issue in prep_mmseqs_dbs.sh

I noticed a small bug in the prep_mmseqs_dbs.sh. This script fails due to the lack of the mmseqs_dbs directory. I made a branch to try to make a pull request but I got an error saying permission was denied. I also updated the readme to fix the instruction for running this script (download_mmseqs_databases.sh -> download_mmseqs_dbs.sh, prep_mmseqs_databases.sh -> prep_mmseqs_dbs.sh). Here are the changed I propose to prep_mmseqs_dbs.sh:

#!/bin/bash
#
# Copyright 2021 AlQuraishi Laboratory 
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Downloads and unzips all required data for AlphaFold.
#
# Usage: bash download_all_data.sh /path/to/download/directory
set -e

DOWNLOAD_DIR="$1"
ROOT_DIR="${DOWNLOAD_DIR}/mmseqs_dbs"

mkdir --parents "${ROOT_DIR}"

for f in $(ls ${DOWNLOAD_DIR}/*.tar.gz)
do
  tar --extract --verbose --file="${f}" \
      --directory="${ROOT_DIR}"
  rm "${f}"
  BASENAME="$(basename {f%%.*})"
  DB_NAME="${BASENAME}_db"
  OLD_PWD=$(pwd)
  cd "${ROOT_DIR}"
  mmseqs tsv2exprofiledb "${BASENAME}" "${DB_NAME}"
  mmseqs createindex "${DB_NAME}" "${DOWNLOAD_DIR}/tmp/"
  cd "${OLD_PWD}"
done

confidence per residue

[EDIT: I can see 'plddt' is part of the output, closing issue, will reopen if it's not the per-residue confidence score]

Thank you for this amazing repo!
Is there any suggested way to output the per-residue confidence score that AlphaFold produces?

OOM in validation

I get CUDA OOM error when I add my validation set, which I can predict just fine with run_pretrained_openfold.py

Are you limiting your validation set to a certain size? I assume the problem is because of the additional features necessary to compute the loss.

I had to do some changes to make validation work:
val needs to be changed to eval in data_modules, e.g.:
https://github.com/aqlaboratory/openfold/blob/main/openfold/data/data_modules.py#L153

The third argument "unclamped" no longer exists:
https://github.com/aqlaboratory/openfold/blob/main/openfold/data/data_modules.py#L188

Switch validation also to _output_raw=True

Clamped fape loss in validation

Currently, the fape loss is clamped in 90% of the cases during validation. I'm wondering if this should be made deterministic (always clamp or never clamp) to make validation runs more comparable.

training data

In the supplement of alphafold2 1.2.5, there are some filters, which are applied to the training data, does the latest code not include this part?

ddp error

when i use strategy='ddp', train_openfold.py error, follows:

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 4983 has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debugging.

about self.cached_weights in train_openfold.py

I ran train_openfold.py, and when I reached the validation set, I got an ‘OpenfoldWrapper’ object without cached_weights attribute. Can you help me see what is wrong?

Traceback (most recent call last):
File "train_openfold.py", line 370, in
main(args)
File "train_openfold.py", line 233, in main
ckpt_path=ckpt_path,
File "/public/tools/anaconda3/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 739, in fit
self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File "/public/tools/anaconda3/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 683, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/public/tools/anaconda3/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 773, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/public/tools/anaconda3/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _run
self._dispatch()
File "/public/tools/anaconda3/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1275, in _dispatch
self.training_type_plugin.start_training(self)
File "/public/tools/anaconda3/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "/public/tools/anaconda3/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in run_stage
return self._run_train()
File "/public/tools/anaconda3/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1315, in _run_train
self.fit_loop.run()
File "/public/tools/anaconda3/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/public/tools/anaconda3/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
self.epoch_loop.run(data_fetcher)
File "/public/tools/anaconda3/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 146, in run
self.on_advance_end()
File "/public/tools/anaconda3/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 242, in on_advance_end
self._run_validation()
File "/public/tools/anaconda3/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 337, in _run_validation
self.val_loop.run()
File "/public/tools/anaconda3/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/public/tools/anaconda3/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
dl_outputs = self.epoch_loop.run(dataloader, dataloader_idx, dl_max_batches, self.num_dataloaders)
File "/public/tools/anaconda3/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/public/tools/anaconda3/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 122, in advance
output = self._evaluation_step(batch, batch_idx, dataloader_idx)
File "/public/tools/anaconda3/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 217, in _evaluation_step
output = self.trainer.accelerator.validation_step(step_kwargs)
File "/public/tools/anaconda3/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 236, in validation_step
return self.training_type_plugin.validation_step(*step_kwargs.values())
File "/public/tools/anaconda3/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 219, in validation_step
return self.model.validation_step(*args, **kwargs)
File "train_openfold.py", line 108, in validation_step
if(self.cached_weights is None):
File "/public/tools/anaconda3/envs/openfold/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1178, in getattr
type(self).name, name))
AttributeError: 'OpenFoldWrapper' object has no attribute 'cached_weights'

chunk_layer is memory-inefficient

The chunk_layer function in openfold/utils/tensor_utils.py, which implements the "chunking" procedure described in subsection 1.11.8 of the Alphafold 2 supplement, relies on a memory-expensive expand/reshape operation at the top to standardize the batch dimensions of input tensors. This operation can be a bottleneck during inference, so some optimization here would do wonders.

mmseqs tsv2exprofiledb issue with colabfold_envdb_202108

I was having issues with the prep_mmseqs_db.sh script, so I tried running the steps individually and I'm having an issue with running mmseqs tsv2exprofiledb with the colabfold_envdb_202108 database.

First, I downloaded this databases using the download_mmseqs_dbs.sh script and then ran the tar command according to the example in prep_mmseqs_dbs.sh such that I had a directory with the following files:

colabfold_envdb_202108.tsv           
colabfold_envdb_202108_seq.tsv  
colabfold_envdb_202108_aln.tsv 
colabfold_envdb_202108_h.tsv 
uniref30_2103.md5sums
uniref30_2103.tsv
uniref30_2103_h.tsv
uniref30_2103_aln.tsv  
uniref30_2103_seq.tsv

I then used mmseqs tsv2exprofiledb mmseqs_dbs/uniref30_2103 /mmseqs/uniref30_2103_db which seemed to complete without error (though there is no .idx file, which is supposed to be the output of this command, I believe), generating the following files:

uniref30_2103_db.dbtype    
uniref30_2103_db_seq_tmp
uniref30_2103_db.index     
uniref30_2103_db_seq_tmp.index.0
uniref30_2103_db.sh
uniref30_2103_db_h
uniref30_2103_db.0
uniref30_2103_db.1               
uniref30_2103_db_h.dbtype
uniref30_2103_db_h.index

However, when I tried to do the same with the colabfold_envdb_202108 database, it seemed to start correctly, but then was killed after a minute or two. The following files were generated:

colabfold_envdb_202108_db.sh   
colabfold_envdb_202108_db_h     
colabfold_envdb_202108_db_h.index.0  

I used nohup and this is the extent of the output from that command:

tsv2exprofiledb /mmseqs_dbs/colabfold_envdb_202108 /mmseqs_dbs/colabfold_envdb_202108_db

MMseqs Version: 4f046dd1979ec87b440656ff13b12e5c525b8374
Verbosity       3

Killed

I'm wondering if I'm using an instance with insufficient RAM. Do you have an idea of the amount of RAM needed for the idx files?

Purpose of rc.MAP_HHBLITS_AATYPE_TO_OUR_AATYPE

Hi,

Thanks for a great repo!

I'm confused why the template's amino acids and the msa's amino acids are modified again using rc.MAP_HHBLITS_AATYPE_TO_OUR_AATYPE.

It seems like we read the amino acids from the pdb structure and convert them to ids using HHBLITS_AA_TO_ID. I'm wondering why we need to modify them again?

training speed is about 2x slower than JAX trainable version (Uni-Fold)

device: 1 A100 with 40GB memory
cuda: 11.3
Compared with https://github.com/dptech-corp/Uni-Fold, using model_2 setting, and the same data (only use one sample, and use DummyDataLoader in openfold).

And I follow this issue, #19, disabled clear_cache_between_blocks and deepspeed for cpu offload.
The commit I used is c4d9f57

speed per example:

FP32 FP16
openfold 24.5 s 17 s
Uni-Fold 13.25 s 8.9 s

Is that expected? any tricks that I can get further speed-up?

Option to run in "de-novo" mode

Most of the scripts at https://github.com/sokrypton/ColabFold have a ton of additional flexibility that comes in handy when running AF on de-novo sequences (for which you usually can't generated an MSA) or to do protein-design with.

Can this codebase also be leveraged to:

  • run predictions without MSA input or template structures, so just the raw sequence input
  • increase the number of recycle iterations (since this has been shown to recover some of the accuracy lost by not having an MSA)

Low-memory attention a little slow

I've implemented low-memory attention (9670958) using an algorithm from a recent preprint (https://arxiv.org/pdf/2112.05682.pdf), enhanced a little bit with the ability to add multiple biases + batch dimensions. Lacking the JAX map & scan used in the original implementation, which I've had to replace with for loops, ours is quite a bit slower (exact figures depend heavily on the choice of chunk sizes, but it seems to be in the ballpark of 2x slower than our own standard Attention implementation). It would be nice to speed it up a little.

What are the different npz files?

From downloading DeepMind's pretrained parameters, there are 5 models and for each model there is a .npz file and a _ptm.npz file. May I know what the 5 different models are and what the corresponding _ptm.npz files mean?

Multimer

Hi,
Thanks for this great work.
Just wondering, is there a way to do complex (multimer) prediction as in alphafold multimer ?

Thanks
Oz

Question about loss weight

Hi all! Firstly, thanks for your work and effort! I noticed that in the config file, the weight for each loss is different than that in af2's paper. For example, the weight for angle loss is 1, instead of 0.3. Some of them, such as the violation loss, experimentally solved loss, have a weight of 0. Is there any reason that the weight is set up this way? For instance, for losses that have been assigned a weight of 0 in the implementation, are they still under testing? Thanks!

about self.cached_weights

Why do I report this error when I specify the validation set during training: AttributeError:'OpenFoldWrapper' object has no attribute'cached_weights'

Checkpointing Issue

Thanks for such a great repo! I get the following issue when running the model (but only when I use GPUs). I'm using torch checkpointing, not deepspeed. I saw an issue similar to this, but it seemed to be deepspeed-specific so I thought I'd repost.

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 2000 with name module.model.evoformer.blocks.19.pair_transition.linear_2.bias [For reference I have 20 blocks ] has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration.

how to extract the embeddings for each Protein Sequence

First of all, great work!

As you know for each protein Sequence Evolutionary Scale Modeling(ESM) generates an embedding in the size of #aminoacids*1280, I was wondering if we could get such information from openfold as well. do you think is this possible to extract such an embedding from the inner layers of openfold?
could you give some guide on how to extract such information from openfold?

Thanks!

Data parsing Bug for dataloader.

Hi, I've processed some data for training but get the bug of dataloader of:

File "/share/home/openfold/openfold-main/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/utilities/apply_func.py", line 92, in apply_to_collection return function(data, *args, **kwargs) File "/share/home/openfold/openfold-main/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__ data = self._next_data() File "/share/home/openfold/openfold-main/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data return self._process_data(data) File "/share/home/openfold/openfold-main/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data data.reraise() File "/share/home/openfold/openfold-main/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/torch/_utils.py", line 434, in reraise raise exception TypeError: Caught TypeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/share/home/openfold/openfold-main/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "/share/home/openfold/openfold-main/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch return self.collate_fn(data) File "/share/home/openfold/openfold-main/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/utilities/auto_restart.py", line 474, in _capture_metadata_collate data = default_collate(samples) File "/share/home/openfold/openfold-main/openfold/data/data_modules.py", line 297, in __call__ prot, self.stage File "/share/home/openfold/openfold-main/openfold/data/feature_pipeline.py", line 116, in process_features mode=mode, File "/share/home/openfold/openfold-main/openfold/data/feature_pipeline.py", line 93, in np_example_to_features cfg[mode], File "/share/home/openfold/openfold-main/openfold/data/input_pipeline.py", line 187, in process_tensors_from_config lambda x: wrap_ensemble_fn(tensors, x), torch.arange(num_recycling + 1) File "/share/home/openfold/openfold-main/openfold/data/input_pipeline.py", line 201, in map_fn ensembles = [fun(elem) for elem in x] File "/share/home/openfold/openfold-main/openfold/data/input_pipeline.py", line 201, in <listcomp> ensembles = [fun(elem) for elem in x] File "/share/home/openfold/openfold-main/openfold/data/input_pipeline.py", line 187, in <lambda> lambda x: wrap_ensemble_fn(tensors, x), torch.arange(num_recycling + 1) File "/share/home/openfold/openfold-main/openfold/data/input_pipeline.py", line 168, in wrap_ensemble_fn return fn(d) File "/share/home/openfold/openfold-main/openfold/data/data_transforms.py", line 76, in <lambda> return lambda x: f(x, *args, **kwargs) File "/share/home/openfold/openfold-main/openfold/data/input_pipeline.py", line 196, in compose x = f(x) File "/share/home/openfold/openfold-main/openfold/data/data_transforms.py", line 76, in <lambda> return lambda x: f(x, *args, **kwargs) File "/share/home/openfold/openfold-main/openfold/data/data_transforms.py", line 180, in sample_msa num_seq = protein["msa"].shape[0] TypeError: 'function' object is not subscriptable

And I upload one of the datasample, Am I wrong with the generate MSAs pipeline or wrong with the dataloader?
5E0Y.zip

Frequent loss is NaN & Training Hangs

Thank you for sharing your code!

I am trying to train openfold, but the problem of loss being NAN persists, and the whole training hangs when this problem occurs.

I downloaded the code in early December and trained on 8 V100 cards with a training dataset size of 1000. When I ran to the 26th sample of the 2nd epoch, there were many warning outputs with a loss of NAN and the training was interrupted.
I read your solution of "Replace training_step in train_openfold.py " in Issue #19, after changing, when I train the first sample, I got this:

WARNING:root:loss is NaN. Returning 0 loss...

Training still hangs.

I ran a recent commit again and retrained with the same dataset and the same problem occurred again and on the same sample. Like this:

image

I changed the way the mapping is generated in data_modules.py so that the dataset can be loaded in a fixed order when it is loaded, and I checked the data where the loss is NAN and found no abnormalities.

This is very strange, because with your first version of the code, there is no NAN loss so far, but with the version you committed after December this problem keeps occurring, even if I change my training dataset and the learning rate in the deepspeed config file, it does not improve the situation.

Is there a workaround for this situation?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.