Coder Social home page Coder Social logo

samsledje / d-script Goto Github PK

View Code? Open in Web Editor NEW
78.0 6.0 16.0 159.99 MB

A structure-aware interpretable deep learning model for sequence-based prediction of protein-protein interactions

Home Page: http://dscript.csail.mit.edu

License: MIT License

Python 39.84% Shell 1.25% Jupyter Notebook 58.90%

d-script's Introduction

D-SCRIPT

D-SCRIPT Architecture

D-SCRIPT PyPI DOI Documentation Status License Code style: black CodeFactor

D-SCRIPT is a deep learning method for predicting a physical interaction between two proteins given just their sequences. It generalizes well to new species and is robust to limitations in training data size. Its design reflects the intuition that for two proteins to physically interact, a subset of amino acids from each protein should be in con-tact with the other. The intermediate stages of D-SCRIPT directly implement this intuition, with the penultimate stage in D-SCRIPT being a rough estimate of the inter-protein contact map of the protein dimer. This structurally-motivated design enhances the interpretability of the results and, since structure is more conserved evolutionarily than sequence, improves generalizability across species.

d-script's People

Contributors

kapil-devkota avatar samsledje avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

d-script's Issues

Some consultation on calculating contact map (label)

Dear authors, I would like to understand how to generate real contact map but I meet some problems while reading the code. I am not aware of the meanings of glider_map and glider_matrix. Can you give me some more detailed material. Thank you very much!

e. coli dataset

maybe it's right in front of me, but I can't find the e. coli dataset. Could it be added with rest? interested to play with it!

how to compare TT3D contact map to AlphaFold multimer prediction?

Could you please suggest a method to compare the contact map to an AlphaFold prediction?

If I understand correctly, the default TT3D output contact map is a matrix of length(protein1)*length(protein2) with a score reflecting how likely each amino acid of protein1 is to interact with each amino acid of protein2. The default output of AlphaFold prediction of the interaction would be a structure file (.pdb) with fixed distances between each amino acid.

These two outputs are clearly related, but I'm not sure how I could mathematically test their similarity.

The API docs is incomplete for prediction

Hi I have been trying to use the API but I haven't figured it out, seems like the documentation for the prediction is in complete. If I want to do the equivalent of dscript predict [-h] --pairs PAIRS --model MODEL [--seqs SEQS] [--embeddings EMBEDDINGS] [-o OUTFILE] [-d DEVICE] [--thresh THRESH] with the API, what should I write? I found this dscript.commands.predict.PredictionArguments(cmd, device, embeddings, outfile, seqs, model, thresh, func) but I am unsure what is cmd and func. Would you mind showing me an example. Thank you.

prediction step error: Unable to create dataset (name already exists)

Hi Sam,
I've been using your available human ppi trained model to test out some predictions on our data, but I keep getting this error, seen in the tail of the slurm outfile:

100%|##########| 14399/14399 [49:31<00:00,  4.85it/s]
  0%|          | 11647/45233277 [3:18:54<12871:51:27,  1.02s/it]
# Using CPU
# Loading Embeddings...
# Making Predictions...
Traceback (most recent call last):
  File "/home/igwill/.local/bin/dscript", line 8, in <module>
    sys.exit(main())
  File "/home/igwill/.local/lib/python3.7/site-packages/dscript/__main__.py", line 54, in main
    args.func(args)
  File "/home/igwill/.local/lib/python3.7/site-packages/dscript/commands/predict.py", line 150, in main
    cmap_file.create_dataset(f"{n0}x{n1}", data=cm.squeeze().cpu().numpy())
  File "/share/apps/anaconda3-97/lib/python3.7/site-packages/h5py/_hl/group.py", line 148, in create_dataset
    dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
  File "/share/apps/anaconda3-97/lib/python3.7/site-packages/h5py/_hl/dataset.py", line 137, in make_new_dset
    dset_id = h5d.create(parent.id, name, tid, sid, dcpl=dcpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5d.pyx", line 87, in h5py.h5d.create
ValueError: Unable to create dataset (name already exists)

I've tried setting and not setting the --outfile with a name, and running in an otherwise empty dir.

This is how I'm running D-SCRIPT for this step (just on the CPU for this test prediction run) :
dscript predict --pairs "/home/igwill/PPI/Dscript/prediction/hypo_cross_ppis_SPTM.tsv" --embeddings "/home/igwill/PPI/Dscript/ppi_allseqs_embed" --model "/home/igwill/PPI/Dscript/prediction/human_v1.sav"

I see files being made by D-SCRIPT: *.log, .cmaps.h5 (the culprit?), .tsv, .positive.tsv.

Am I mis-setting something here?

Thanks

Cannot reproduce the paper results using default training hyperparameters

The pretrained model human_v1.sav has good performance as demonstrated in the paper. However, I failed to train it using the default hyperparameters in this repo.

image
('bb' means using Bepler & Berger's LM as embeddings)

The recall score is especially low and even in decreasing trend. I wonder if the hyperparameters are not suitable or the code is broken in recent commits.

ModuleNotFoundError when trying to unload pretrained TT3D model

Hi!
I'm trying to use the pretrained TT3D model when evaluating a test set of mine. However, after having downloaded it from, https://d-script.readthedocs.io/en/stable/data.html, I get the following error when using evaluate:

Traceback (most recent call last):
  File "dscript/commands/evaluate.py", line 283, in <module>
    main(parser.parse_args())
  File "dscript/commands/evaluate.py", line 200, in main
    model = torch.load(model_path).cuda()
  File "/home/sarahn/miniconda3/envs/dscript/lib/python3.7/site-packages/torch/serialization.py", line 712, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/home/sarahn/miniconda3/envs/dscript/lib/python3.7/site-packages/torch/serialization.py", line 1046, in _load
    result = unpickler.load()
  File "/home/sarahn/miniconda3/envs/dscript/lib/python3.7/site-packages/torch/serialization.py", line 1039, in find_class
    return super().find_class(mod_name, name)
ModuleNotFoundError: No module named 'dscript'

Do you have any idea how this could be fixed? Has anybody been able to successfully use the pretrained TT3D model? I have installed dscript with pip.

Data request: pre-computed PPIs

Hi Sam,

Hope you are doing well. I find D-SCRIPT very easy to install and use.

Could you explain how to generate the results in Table 1, H. sapiens (5-fold cross validation)? Especially how these relate to human_test.tsv and human_train.tsv in d-script/data/pairs/?

It would be very handy if you could share the predicted PPIs (D-SCRIPT result files) for Table 1, so that we don't have to rerun the tool for the same protein pairs. So far I got D-SCRIPT running very well on a single server but not yet on HPC.

Many thanks,
CS

How to train D-SCRIPT end-to-end?

I find in the dscript/commands/train.py that the training takes the embedded features and only trains the projection and CNN module.
How can I train the language, the BiLSTM, together with the projection and CNN module?

proteinA x proteinB skipped (The size of tensor a (2581) must match the size of tensor b (2000)

hi~, I try to use this great tool to prodict specific sequence protein interactions, and I got the above messages in the log file, what these messages mean or does it affect predicted interaction value?
I use conda to install and predicted use trained model, the predicted command showed below.
dscript predict --pairs ./pair.tsv --seqs ./sequence.fasta --model ./topsy_turvy_v1.sav

What is the purpose of the argument dumb-embed-switch?

As I understand now, if set to False, it allows the interaction model in train.py to train using embedding.FullyConnectedEmbed to extract embeddings inside the run_training_from_args function. However, I'm confused because in predict.py you don't use this model for the embeddings. What am I missing? How can there be 2 different embedding models?

i have no idea how to run the model

When i try to run descript.main.py, but Pycharm told me
"usage: main.py [-h] [-v] [-c] {train,eval,embed,predict} ...
main.py: error: the following arguments are required: cmd".
I am new in Python, can you please tell me how can i run the model

In the .fasta file, multiple proteins share the same sequence.

Awesome work!
When I check .fasta file in the data folder, I found that different proteins may share the same residual sequence. However, some of these proteins don't appear in the .tsv file. Do you merge proteins that share the same sequence into one in the .tsv file?
Thanks.

Prediction Quality of the human_v1.sav model

Hello, I was testing the human_v1.sav model on the first 100 positive pairs provided as part of human_test.tsv and human_train.tsv pairs used in the training (200 total). For the predictions, I had noticed that around ~40 of the 100 for each pair set were positive, whereas in the original all of them were positive:

Predicted Positives:

9606.ENSP00000364459 9606.ENSP00000360117 0.9836567640304565
9606.ENSP00000469581 9606.ENSP00000386920 0.9592902660369873
9606.ENSP00000248150 9606.ENSP00000393333 0.8590458035469055
9606.ENSP00000405785 9606.ENSP00000360568 0.9667615294456482
9606.ENSP00000421422 9606.ENSP00000363322 0.7510783076286316
9606.ENSP00000428962 9606.ENSP00000456034 0.9630085825920105
9606.ENSP00000356794 9606.ENSP00000228140 0.9728716015815735
9606.ENSP00000287156 9606.ENSP00000408292 0.7241741418838501
9606.ENSP00000366249 9606.ENSP00000379931 0.9771170020103455
9606.ENSP00000315397 9606.ENSP00000361928 0.9851889610290527
9606.ENSP00000394441 9606.ENSP00000426176 0.9927392601966858
9606.ENSP00000391383 9606.ENSP00000471874 0.9158253073692322
9606.ENSP00000355779 9606.ENSP00000359095 0.8252230882644653
9606.ENSP00000390667 9606.ENSP00000377446 0.9509118795394897
9606.ENSP00000340879 9606.ENSP00000436512 0.8199784755706787
9606.ENSP00000272418 9606.ENSP00000362208 0.8711791038513184
9606.ENSP00000453523 9606.ENSP00000366249 0.9753308296203613
9606.ENSP00000430545 9606.ENSP00000333401 0.9889339208602905
9606.ENSP00000388690 9606.ENSP00000449765 0.8419162034988403
9606.ENSP00000433170 9606.ENSP00000396716 0.9459835886955261
9606.ENSP00000462938 9606.ENSP00000400467 0.9080663323402405
9606.ENSP00000425126 9606.ENSP00000439565 0.863395631313324
9606.ENSP00000463568 9606.ENSP00000361626 0.552105724811554
9606.ENSP00000466020 9606.ENSP00000347168 0.9441595077514648
9606.ENSP00000422772 9606.ENSP00000433170 0.8190745711326599
9606.ENSP00000290541 9606.ENSP00000440327 0.9485920071601868
9606.ENSP00000367806 9606.ENSP00000361076 0.9093227386474609
9606.ENSP00000439565 9606.ENSP00000400082 0.9004131555557251
9606.ENSP00000346148 9606.ENSP00000366488 0.6718304753303528
9606.ENSP00000399076 9606.ENSP00000390331 0.8431304097175598
9606.ENSP00000437062 9606.ENSP00000455856 0.9130898714065552
9606.ENSP00000466634 9606.ENSP00000455282 0.771989107131958
9606.ENSP00000354739 9606.ENSP00000380465 0.9060049057006836
9606.ENSP00000429968 9606.ENSP00000391383 0.7879292368888855
9606.ENSP00000449409 9606.ENSP00000416350 0.610442578792572
9606.ENSP00000266735 9606.ENSP00000423803 0.863314151763916
9606.ENSP00000422772 9606.ENSP00000429374 0.8409360647201538
9606.ENSP00000305682 9606.ENSP00000355694 0.9833050966262817
9606.ENSP00000386752 9606.ENSP00000357823 0.9631509780883789
9606.ENSP00000400666 9606.ENSP00000420277 0.6354408264160156
9606.ENSP00000363018 9606.ENSP00000345156 0.5488811731338501
9606.ENSP00000449391 9606.ENSP00000243067 0.6863182187080383
9606.ENSP00000265339 9606.ENSP00000293968 0.945186972618103
9606.ENSP00000428581 9606.ENSP00000342676 0.9357559680938721
9606.ENSP00000298288 9606.ENSP00000464594 0.9879041314125061
9606.ENSP00000299166 9606.ENSP00000425882 0.9502236843109131
9606.ENSP00000392850 9606.ENSP00000433170 0.9410211443901062
9606.ENSP00000441332 9606.ENSP00000390867 0.9402053356170654
9606.ENSP00000467146 9606.ENSP00000290541 0.8331601619720459
9606.ENSP00000421422 9606.ENSP00000390790 0.9625501036643982
9606.ENSP00000296102 9606.ENSP00000347969 0.8511738777160645
9606.ENSP00000278572 9606.ENSP00000368927 0.9590868353843689
9606.ENSP00000430545 9606.ENSP00000484903 0.9736945033073425
9606.ENSP00000392850 9606.ENSP00000434837 0.8842238783836365
9606.ENSP00000053468 9606.ENSP00000400467 0.946972668170929
9606.ENSP00000470098 9606.ENSP00000466063 0.9452704787254333
9606.ENSP00000452940 9606.ENSP00000418695 0.621618390083313
9606.ENSP00000274242 9606.ENSP00000379888 0.8498455882072449
9606.ENSP00000466142 9606.ENSP00000265339 0.9736203551292419
9606.ENSP00000356720 9606.ENSP00000418933 0.9557900428771973
9606.ENSP00000471874 9606.ENSP00000228140 0.8821477890014648
9606.ENSP00000386920 9606.ENSP00000296102 0.935010552406311
9606.ENSP00000249299 9606.ENSP00000430021 0.6841226816177368
9606.ENSP00000435333 9606.ENSP00000351021 0.8266493678092957
9606.ENSP00000369741 9606.ENSP00000378081 0.8568494915962219
9606.ENSP00000413875 9606.ENSP00000424359 0.9197615385055542
9606.ENSP00000368801 9606.ENSP00000400591 0.9503082036972046
9606.ENSP00000483260 9606.ENSP00000377865 0.8625007271766663
9606.ENSP00000428962 9606.ENSP00000419084 0.9733706116676331
9606.ENSP00000323046 9606.ENSP00000469431 0.9690800905227661
9606.ENSP00000290541 9606.ENSP00000378739 0.9431149959564209
9606.ENSP00000354728 9606.ENSP00000376018 0.9871731996536255
9606.ENSP00000388126 9606.ENSP00000410758 0.8001493215560913
9606.ENSP00000418346 9606.ENSP00000429374 0.9086166620254517
9606.ENSP00000346012 9606.ENSP00000434837 0.9208306670188904
9606.ENSP00000463107 9606.ENSP00000386541 0.8896101117134094
9606.ENSP00000363339 9606.ENSP00000481635 0.6550856828689575
9606.ENSP00000246115 9606.ENSP00000386054 0.8669542670249939
9606.ENSP00000262629 9606.ENSP00000376395 0.6916682720184326
9606.ENSP00000300151 9606.ENSP00000480129 0.9504024982452393

All Predicted:

9606.ENSP00000409077 9606.ENSP00000470819 0.004183880519121885
9606.ENSP00000263904 9606.ENSP00000472680 0.004181481432169676
9606.ENSP00000364459 9606.ENSP00000360117 0.9836567640304565
9606.ENSP00000422403 9606.ENSP00000400591 0.3727874755859375
9606.ENSP00000388332 9606.ENSP00000346080 0.00473413523286581
9606.ENSP00000469581 9606.ENSP00000386920 0.9592902660369873
9606.ENSP00000248150 9606.ENSP00000393333 0.8590458035469055
9606.ENSP00000466266 9606.ENSP00000363315 0.03303053602576256
9606.ENSP00000405785 9606.ENSP00000360568 0.9667615294456482
9606.ENSP00000362930 9606.ENSP00000429312 0.16951589286327362
9606.ENSP00000421422 9606.ENSP00000363322 0.7510783076286316
9606.ENSP00000462683 9606.ENSP00000353015 0.004504330921918154
9606.ENSP00000387127 9606.ENSP00000394483 0.004192233085632324
9606.ENSP00000315476 9606.ENSP00000367498 0.004174979869276285
9606.ENSP00000428962 9606.ENSP00000456034 0.9630085825920105
9606.ENSP00000356794 9606.ENSP00000228140 0.9728716015815735
9606.ENSP00000287156 9606.ENSP00000408292 0.7241741418838501
9606.ENSP00000400467 9606.ENSP00000406561 0.06780070811510086
9606.ENSP00000455282 9606.ENSP00000452473 0.012177504599094391
9606.ENSP00000359862 9606.ENSP00000322594 0.040320705622434616
9606.ENSP00000409852 9606.ENSP00000362994 0.004177952650934458
9606.ENSP00000366249 9606.ENSP00000379931 0.9771170020103455
9606.ENSP00000315397 9606.ENSP00000361928 0.9851889610290527
9606.ENSP00000201031 9606.ENSP00000482258 0.004201050847768784
9606.ENSP00000440658 9606.ENSP00000435311 0.004511235747486353
9606.ENSP00000394441 9606.ENSP00000426176 0.9927392601966858
9606.ENSP00000294179 9606.ENSP00000478403 0.004298059269785881
9606.ENSP00000391383 9606.ENSP00000471874 0.9158253073692322
9606.ENSP00000448229 9606.ENSP00000228140 0.3032079339027405
9606.ENSP00000355779 9606.ENSP00000359095 0.8252230882644653
9606.ENSP00000419084 9606.ENSP00000463395 0.004182545933872461
9606.ENSP00000452636 9606.ENSP00000429323 0.0048737130127847195
9606.ENSP00000390667 9606.ENSP00000377446 0.9509118795394897
9606.ENSP00000340879 9606.ENSP00000436512 0.8199784755706787
9606.ENSP00000413869 9606.ENSP00000484726 0.00418407516553998
9606.ENSP00000272418 9606.ENSP00000362208 0.8711791038513184
9606.ENSP00000453523 9606.ENSP00000366249 0.9753308296203613
9606.ENSP00000485510 9606.ENSP00000358854 0.1316414326429367
9606.ENSP00000436313 9606.ENSP00000406751 0.3218730390071869
9606.ENSP00000417619 9606.ENSP00000360117 0.13994590938091278
9606.ENSP00000360429 9606.ENSP00000347915 0.004184618126600981
9606.ENSP00000476760 9606.ENSP00000473330 0.004178718663752079
9606.ENSP00000430545 9606.ENSP00000333401 0.9889339208602905
9606.ENSP00000388690 9606.ENSP00000449765 0.8419162034988403
9606.ENSP00000452473 9606.ENSP00000356541 0.004236438311636448
9606.ENSP00000433170 9606.ENSP00000396716 0.9459835886955261
9606.ENSP00000402423 9606.ENSP00000426111 0.0041794609278440475
9606.ENSP00000357887 9606.ENSP00000362166 0.004217073321342468
9606.ENSP00000458237 9606.ENSP00000304127 0.00913697574287653
9606.ENSP00000462938 9606.ENSP00000400467 0.9080663323402405
9606.ENSP00000306033 9606.ENSP00000466593 0.004175788722932339
9606.ENSP00000456246 9606.ENSP00000356903 0.007225535809993744
9606.ENSP00000425126 9606.ENSP00000439565 0.863395631313324
9606.ENSP00000400666 9606.ENSP00000463889 0.004179061856120825
9606.ENSP00000273156 9606.ENSP00000372210 0.15635330975055695
9606.ENSP00000464883 9606.ENSP00000405454 0.017727287486195564
9606.ENSP00000376256 9606.ENSP00000362815 0.004175873938947916
9606.ENSP00000431517 9606.ENSP00000468749 0.004177439026534557
9606.ENSP00000391730 9606.ENSP00000453880 0.004175685811787844
9606.ENSP00000463568 9606.ENSP00000361626 0.552105724811554
9606.ENSP00000460400 9606.ENSP00000381963 0.0041790003888309
9606.ENSP00000275603 9606.ENSP00000365944 0.004179549869149923
9606.ENSP00000466020 9606.ENSP00000347168 0.9441595077514648
9606.ENSP00000380196 9606.ENSP00000448030 0.004200428258627653
9606.ENSP00000422772 9606.ENSP00000433170 0.8190745711326599
9606.ENSP00000290541 9606.ENSP00000440327 0.9485920071601868
9606.ENSP00000367806 9606.ENSP00000361076 0.9093227386474609
9606.ENSP00000439565 9606.ENSP00000400082 0.9004131555557251
9606.ENSP00000346148 9606.ENSP00000366488 0.6718304753303528
9606.ENSP00000451345 9606.ENSP00000319578 0.02991567924618721
9606.ENSP00000399076 9606.ENSP00000390331 0.8431304097175598
9606.ENSP00000350625 9606.ENSP00000408482 0.004186746198683977
9606.ENSP00000437062 9606.ENSP00000455856 0.9130898714065552
9606.ENSP00000466634 9606.ENSP00000455282 0.771989107131958
9606.ENSP00000264639 9606.ENSP00000401724 0.00619113864377141
9606.ENSP00000405965 9606.ENSP00000452486 0.004188647493720055
9606.ENSP00000354739 9606.ENSP00000380465 0.9060049057006836
9606.ENSP00000429968 9606.ENSP00000391383 0.7879292368888855
9606.ENSP00000449409 9606.ENSP00000416350 0.610442578792572
9606.ENSP00000467831 9606.ENSP00000400508 0.004181521479040384
9606.ENSP00000455392 9606.ENSP00000376246 0.004175132606178522
9606.ENSP00000266735 9606.ENSP00000423803 0.863314151763916
9606.ENSP00000340677 9606.ENSP00000480077 0.004175209905952215
9606.ENSP00000460084 9606.ENSP00000415430 0.00417515030130744
9606.ENSP00000422772 9606.ENSP00000429374 0.8409360647201538
9606.ENSP00000305682 9606.ENSP00000355694 0.9833050966262817
9606.ENSP00000386752 9606.ENSP00000357823 0.9631509780883789
9606.ENSP00000364320 9606.ENSP00000357857 0.004175729118287563
9606.ENSP00000400666 9606.ENSP00000420277 0.6354408264160156
9606.ENSP00000325421 9606.ENSP00000364204 0.004290678072720766
9606.ENSP00000236273 9606.ENSP00000424616 0.004183924291282892
9606.ENSP00000302176 9606.ENSP00000340817 0.00417567603290081
9606.ENSP00000361076 9606.ENSP00000279259 0.031076163053512573
9606.ENSP00000412530 9606.ENSP00000279259 0.008396787568926811
9606.ENSP00000386792 9606.ENSP00000370208 0.004656988196074963
9606.ENSP00000439818 9606.ENSP00000466103 0.004299112595617771
9606.ENSP00000363018 9606.ENSP00000345156 0.5488811731338501
9606.ENSP00000449391 9606.ENSP00000243067 0.6863182187080383
9606.ENSP00000285018 9606.ENSP00000418389 0.004180514719337225
9606.ENSP00000397912 9606.ENSP00000389381 0.08798135071992874
9606.ENSP00000265339 9606.ENSP00000293968 0.945186972618103
9606.ENSP00000259727 9606.ENSP00000419851 0.076852947473526
9606.ENSP00000480334 9606.ENSP00000395401 0.01999049074947834
9606.ENSP00000225728 9606.ENSP00000255764 0.1401696503162384
9606.ENSP00000345963 9606.ENSP00000478877 0.36111822724342346
9606.ENSP00000468663 9606.ENSP00000326500 0.08016742020845413
9606.ENSP00000428581 9606.ENSP00000342676 0.9357559680938721
9606.ENSP00000362937 9606.ENSP00000401470 0.00417680200189352
9606.ENSP00000296792 9606.ENSP00000451345 0.004193365573883057
9606.ENSP00000261597 9606.ENSP00000476760 0.004182123113423586
9606.ENSP00000298288 9606.ENSP00000464594 0.9879041314125061
9606.ENSP00000299166 9606.ENSP00000425882 0.9502236843109131
9606.ENSP00000392850 9606.ENSP00000433170 0.9410211443901062
9606.ENSP00000441332 9606.ENSP00000390867 0.9402053356170654
9606.ENSP00000252440 9606.ENSP00000337853 0.004177470691502094
9606.ENSP00000415973 9606.ENSP00000361626 0.07606470584869385
9606.ENSP00000446013 9606.ENSP00000380998 0.004191002808511257
9606.ENSP00000467146 9606.ENSP00000290541 0.8331601619720459
9606.ENSP00000452636 9606.ENSP00000431800 0.004373032134026289
9606.ENSP00000421422 9606.ENSP00000390790 0.9625501036643982
9606.ENSP00000386500 9606.ENSP00000429419 0.004411790519952774
9606.ENSP00000296102 9606.ENSP00000347969 0.8511738777160645
9606.ENSP00000400211 9606.ENSP00000324944 0.004174448549747467
9606.ENSP00000337853 9606.ENSP00000485133 0.0870395228266716
9606.ENSP00000278572 9606.ENSP00000368927 0.9590868353843689
9606.ENSP00000430545 9606.ENSP00000484903 0.9736945033073425
9606.ENSP00000392850 9606.ENSP00000434837 0.8842238783836365
9606.ENSP00000053468 9606.ENSP00000400467 0.946972668170929
9606.ENSP00000400727 9606.ENSP00000451252 0.11200016736984253
9606.ENSP00000348838 9606.ENSP00000356479 0.05801907554268837
9606.ENSP00000470098 9606.ENSP00000466063 0.9452704787254333
9606.ENSP00000457239 9606.ENSP00000427514 0.004312561824917793
9606.ENSP00000350098 9606.ENSP00000424048 0.004175967071205378
9606.ENSP00000324124 9606.ENSP00000432154 0.1811545193195343
9606.ENSP00000427365 9606.ENSP00000434816 0.011811366304755211
9606.ENSP00000480549 9606.ENSP00000314458 0.004189439117908478
9606.ENSP00000438284 9606.ENSP00000453560 0.45608770847320557
9606.ENSP00000452940 9606.ENSP00000418695 0.621618390083313
9606.ENSP00000274242 9606.ENSP00000379888 0.8498455882072449
9606.ENSP00000466142 9606.ENSP00000265339 0.9736203551292419
9606.ENSP00000422772 9606.ENSP00000471185 0.06107671186327934
9606.ENSP00000369162 9606.ENSP00000481646 0.2004823088645935
9606.ENSP00000296555 9606.ENSP00000293968 0.006630090065300465
9606.ENSP00000358300 9606.ENSP00000358320 0.004874304868280888
9606.ENSP00000356720 9606.ENSP00000418933 0.9557900428771973
9606.ENSP00000448615 9606.ENSP00000439952 0.004177016206085682
9606.ENSP00000471874 9606.ENSP00000228140 0.8821477890014648
9606.ENSP00000386920 9606.ENSP00000296102 0.935010552406311
9606.ENSP00000482557 9606.ENSP00000480472 0.0042853280901908875
9606.ENSP00000429384 9606.ENSP00000362105 0.004279043525457382
9606.ENSP00000249299 9606.ENSP00000430021 0.6841226816177368
9606.ENSP00000463202 9606.ENSP00000478887 0.004179484210908413
9606.ENSP00000273047 9606.ENSP00000310406 0.004185174126178026
9606.ENSP00000435333 9606.ENSP00000351021 0.8266493678092957
9606.ENSP00000383941 9606.ENSP00000403891 0.0041751619428396225
9606.ENSP00000453972 9606.ENSP00000394014 0.06416654586791992
9606.ENSP00000464599 9606.ENSP00000451345 0.12590646743774414
9606.ENSP00000466791 9606.ENSP00000400467 0.0045409612357616425
9606.ENSP00000437996 9606.ENSP00000382966 0.004204932600259781
9606.ENSP00000358854 9606.ENSP00000432154 0.0042976695112884045
9606.ENSP00000293860 9606.ENSP00000361171 0.00426383875310421
9606.ENSP00000369741 9606.ENSP00000378081 0.8568494915962219
9606.ENSP00000413875 9606.ENSP00000424359 0.9197615385055542
9606.ENSP00000261015 9606.ENSP00000265245 0.008206214755773544
9606.ENSP00000368801 9606.ENSP00000400591 0.9503082036972046
9606.ENSP00000483260 9606.ENSP00000377865 0.8625007271766663
9606.ENSP00000266735 9606.ENSP00000475218 0.05284871533513069
9606.ENSP00000323099 9606.ENSP00000465569 0.00418142369017005
9606.ENSP00000428962 9606.ENSP00000419084 0.9733706116676331
9606.ENSP00000323046 9606.ENSP00000469431 0.9690800905227661
9606.ENSP00000386469 9606.ENSP00000419782 0.004180506803095341
9606.ENSP00000259895 9606.ENSP00000358487 0.004174997564405203
9606.ENSP00000290541 9606.ENSP00000378739 0.9431149959564209
9606.ENSP00000354728 9606.ENSP00000376018 0.9871731996536255
9606.ENSP00000428884 9606.ENSP00000341268 0.00417519174516201
9606.ENSP00000388126 9606.ENSP00000410758 0.8001493215560913
9606.ENSP00000435766 9606.ENSP00000416120 0.07131560891866684
9606.ENSP00000362797 9606.ENSP00000420946 0.019188711419701576
9606.ENSP00000261416 9606.ENSP00000455545 0.004202495329082012
9606.ENSP00000261015 9606.ENSP00000400542 0.2885924279689789
9606.ENSP00000418346 9606.ENSP00000429374 0.9086166620254517
9606.ENSP00000346012 9606.ENSP00000434837 0.9208306670188904
9606.ENSP00000479931 9606.ENSP00000384046 0.004173992667347193
9606.ENSP00000463107 9606.ENSP00000386541 0.8896101117134094
9606.ENSP00000360876 9606.ENSP00000353483 0.004179186653345823
9606.ENSP00000363339 9606.ENSP00000481635 0.6550856828689575
9606.ENSP00000402142 9606.ENSP00000388246 0.004798989277333021
9606.ENSP00000482258 9606.ENSP00000458430 0.004259957931935787
9606.ENSP00000246115 9606.ENSP00000386054 0.8669542670249939
9606.ENSP00000452595 9606.ENSP00000475909 0.00420505041256547
9606.ENSP00000262629 9606.ENSP00000376395 0.6916682720184326
9606.ENSP00000300151 9606.ENSP00000480129 0.9504024982452393
9606.ENSP00000418641 9606.ENSP00000419084 0.06059009209275246
9606.ENSP00000438450 9606.ENSP00000429901 0.008720356971025467
9606.ENSP00000353202 9606.ENSP00000362659 0.00427031796425581
9606.ENSP00000391114 9606.ENSP00000436218 0.008428175002336502
9606.ENSP00000415508 9606.ENSP00000384123 0.03511521965265274
9606.ENSP00000352333 9606.ENSP00000476100 0.004223955329507589
9606.ENSP00000209884 9606.ENSP00000216225 0.004192270804196596

Original:
9606.ENSP00000409077 9606.ENSP00000470819 1
9606.ENSP00000263904 9606.ENSP00000472680 1
9606.ENSP00000364459 9606.ENSP00000360117 1
9606.ENSP00000422403 9606.ENSP00000400591 1
9606.ENSP00000388332 9606.ENSP00000346080 1
9606.ENSP00000469581 9606.ENSP00000386920 1
9606.ENSP00000248150 9606.ENSP00000393333 1
9606.ENSP00000466266 9606.ENSP00000363315 1
9606.ENSP00000405785 9606.ENSP00000360568 1
9606.ENSP00000362930 9606.ENSP00000429312 1
9606.ENSP00000421422 9606.ENSP00000363322 1
9606.ENSP00000462683 9606.ENSP00000353015 1
9606.ENSP00000387127 9606.ENSP00000394483 1
9606.ENSP00000315476 9606.ENSP00000367498 1
9606.ENSP00000428962 9606.ENSP00000456034 1
9606.ENSP00000356794 9606.ENSP00000228140 1
9606.ENSP00000287156 9606.ENSP00000408292 1
9606.ENSP00000400467 9606.ENSP00000406561 1
9606.ENSP00000455282 9606.ENSP00000452473 1
9606.ENSP00000359862 9606.ENSP00000322594 1
9606.ENSP00000409852 9606.ENSP00000362994 1
9606.ENSP00000366249 9606.ENSP00000379931 1
9606.ENSP00000315397 9606.ENSP00000361928 1
9606.ENSP00000201031 9606.ENSP00000482258 1
9606.ENSP00000440658 9606.ENSP00000435311 1
9606.ENSP00000394441 9606.ENSP00000426176 1
9606.ENSP00000294179 9606.ENSP00000478403 1
9606.ENSP00000391383 9606.ENSP00000471874 1
9606.ENSP00000448229 9606.ENSP00000228140 1
9606.ENSP00000355779 9606.ENSP00000359095 1
9606.ENSP00000419084 9606.ENSP00000463395 1
9606.ENSP00000452636 9606.ENSP00000429323 1
9606.ENSP00000390667 9606.ENSP00000377446 1
9606.ENSP00000340879 9606.ENSP00000436512 1
9606.ENSP00000413869 9606.ENSP00000484726 1
9606.ENSP00000272418 9606.ENSP00000362208 1
9606.ENSP00000453523 9606.ENSP00000366249 1
9606.ENSP00000485510 9606.ENSP00000358854 1
9606.ENSP00000436313 9606.ENSP00000406751 1
9606.ENSP00000417619 9606.ENSP00000360117 1
9606.ENSP00000360429 9606.ENSP00000347915 1
9606.ENSP00000476760 9606.ENSP00000473330 1
9606.ENSP00000430545 9606.ENSP00000333401 1
9606.ENSP00000388690 9606.ENSP00000449765 1
9606.ENSP00000452473 9606.ENSP00000356541 1
9606.ENSP00000433170 9606.ENSP00000396716 1
9606.ENSP00000402423 9606.ENSP00000426111 1
9606.ENSP00000357887 9606.ENSP00000362166 1
9606.ENSP00000458237 9606.ENSP00000304127 1
9606.ENSP00000462938 9606.ENSP00000400467 1
9606.ENSP00000306033 9606.ENSP00000466593 1
9606.ENSP00000456246 9606.ENSP00000356903 1
9606.ENSP00000425126 9606.ENSP00000439565 1
9606.ENSP00000400666 9606.ENSP00000463889 1
9606.ENSP00000273156 9606.ENSP00000372210 1
9606.ENSP00000464883 9606.ENSP00000405454 1
9606.ENSP00000376256 9606.ENSP00000362815 1
9606.ENSP00000431517 9606.ENSP00000468749 1
9606.ENSP00000391730 9606.ENSP00000453880 1
9606.ENSP00000463568 9606.ENSP00000361626 1
9606.ENSP00000460400 9606.ENSP00000381963 1
9606.ENSP00000275603 9606.ENSP00000365944 1
9606.ENSP00000466020 9606.ENSP00000347168 1
9606.ENSP00000380196 9606.ENSP00000448030 1
9606.ENSP00000422772 9606.ENSP00000433170 1
9606.ENSP00000290541 9606.ENSP00000440327 1
9606.ENSP00000367806 9606.ENSP00000361076 1
9606.ENSP00000439565 9606.ENSP00000400082 1
9606.ENSP00000346148 9606.ENSP00000366488 1
9606.ENSP00000451345 9606.ENSP00000319578 1
9606.ENSP00000399076 9606.ENSP00000390331 1
9606.ENSP00000350625 9606.ENSP00000408482 1
9606.ENSP00000437062 9606.ENSP00000455856 1
9606.ENSP00000466634 9606.ENSP00000455282 1
9606.ENSP00000264639 9606.ENSP00000401724 1
9606.ENSP00000405965 9606.ENSP00000452486 1
9606.ENSP00000354739 9606.ENSP00000380465 1
9606.ENSP00000429968 9606.ENSP00000391383 1
9606.ENSP00000449409 9606.ENSP00000416350 1
9606.ENSP00000467831 9606.ENSP00000400508 1
9606.ENSP00000455392 9606.ENSP00000376246 1
9606.ENSP00000266735 9606.ENSP00000423803 1
9606.ENSP00000340677 9606.ENSP00000480077 1
9606.ENSP00000460084 9606.ENSP00000415430 1
9606.ENSP00000422772 9606.ENSP00000429374 1
9606.ENSP00000305682 9606.ENSP00000355694 1
9606.ENSP00000386752 9606.ENSP00000357823 1
9606.ENSP00000364320 9606.ENSP00000357857 1
9606.ENSP00000400666 9606.ENSP00000420277 1
9606.ENSP00000325421 9606.ENSP00000364204 1
9606.ENSP00000236273 9606.ENSP00000424616 1
9606.ENSP00000302176 9606.ENSP00000340817 1
9606.ENSP00000361076 9606.ENSP00000279259 1
9606.ENSP00000412530 9606.ENSP00000279259 1
9606.ENSP00000386792 9606.ENSP00000370208 1
9606.ENSP00000439818 9606.ENSP00000466103 1
9606.ENSP00000363018 9606.ENSP00000345156 1
9606.ENSP00000449391 9606.ENSP00000243067 1
9606.ENSP00000285018 9606.ENSP00000418389 1

9606.ENSP00000397912 9606.ENSP00000389381 1
9606.ENSP00000265339 9606.ENSP00000293968 1
9606.ENSP00000259727 9606.ENSP00000419851 1
9606.ENSP00000480334 9606.ENSP00000395401 1
9606.ENSP00000225728 9606.ENSP00000255764 1
9606.ENSP00000345963 9606.ENSP00000478877 1
9606.ENSP00000468663 9606.ENSP00000326500 1
9606.ENSP00000428581 9606.ENSP00000342676 1
9606.ENSP00000362937 9606.ENSP00000401470 1
9606.ENSP00000296792 9606.ENSP00000451345 1
9606.ENSP00000261597 9606.ENSP00000476760 1
9606.ENSP00000298288 9606.ENSP00000464594 1
9606.ENSP00000299166 9606.ENSP00000425882 1
9606.ENSP00000392850 9606.ENSP00000433170 1
9606.ENSP00000441332 9606.ENSP00000390867 1
9606.ENSP00000252440 9606.ENSP00000337853 1
9606.ENSP00000415973 9606.ENSP00000361626 1
9606.ENSP00000446013 9606.ENSP00000380998 1
9606.ENSP00000467146 9606.ENSP00000290541 1
9606.ENSP00000452636 9606.ENSP00000431800 1
9606.ENSP00000421422 9606.ENSP00000390790 1
9606.ENSP00000386500 9606.ENSP00000429419 1
9606.ENSP00000296102 9606.ENSP00000347969 1
9606.ENSP00000400211 9606.ENSP00000324944 1
9606.ENSP00000337853 9606.ENSP00000485133 1
9606.ENSP00000278572 9606.ENSP00000368927 1
9606.ENSP00000430545 9606.ENSP00000484903 1
9606.ENSP00000392850 9606.ENSP00000434837 1
9606.ENSP00000053468 9606.ENSP00000400467 1
9606.ENSP00000400727 9606.ENSP00000451252 1
9606.ENSP00000348838 9606.ENSP00000356479 1
9606.ENSP00000470098 9606.ENSP00000466063 1
9606.ENSP00000457239 9606.ENSP00000427514 1
9606.ENSP00000350098 9606.ENSP00000424048 1
9606.ENSP00000324124 9606.ENSP00000432154 1
9606.ENSP00000427365 9606.ENSP00000434816 1
9606.ENSP00000480549 9606.ENSP00000314458 1
9606.ENSP00000438284 9606.ENSP00000453560 1
9606.ENSP00000452940 9606.ENSP00000418695 1
9606.ENSP00000274242 9606.ENSP00000379888 1
9606.ENSP00000466142 9606.ENSP00000265339 1
9606.ENSP00000422772 9606.ENSP00000471185 1
9606.ENSP00000369162 9606.ENSP00000481646 1
9606.ENSP00000296555 9606.ENSP00000293968 1
9606.ENSP00000358300 9606.ENSP00000358320 1
9606.ENSP00000356720 9606.ENSP00000418933 1
9606.ENSP00000448615 9606.ENSP00000439952 1
9606.ENSP00000471874 9606.ENSP00000228140 1
9606.ENSP00000386920 9606.ENSP00000296102 1
9606.ENSP00000482557 9606.ENSP00000480472 1
9606.ENSP00000429384 9606.ENSP00000362105 1
9606.ENSP00000249299 9606.ENSP00000430021 1
9606.ENSP00000463202 9606.ENSP00000478887 1
9606.ENSP00000273047 9606.ENSP00000310406 1
9606.ENSP00000435333 9606.ENSP00000351021 1
9606.ENSP00000383941 9606.ENSP00000403891 1
9606.ENSP00000453972 9606.ENSP00000394014 1
9606.ENSP00000464599 9606.ENSP00000451345 1
9606.ENSP00000466791 9606.ENSP00000400467 1
9606.ENSP00000437996 9606.ENSP00000382966 1
9606.ENSP00000358854 9606.ENSP00000432154 1
9606.ENSP00000293860 9606.ENSP00000361171 1
9606.ENSP00000369741 9606.ENSP00000378081 1
9606.ENSP00000413875 9606.ENSP00000424359 1
9606.ENSP00000261015 9606.ENSP00000265245 1
9606.ENSP00000368801 9606.ENSP00000400591 1
9606.ENSP00000483260 9606.ENSP00000377865 1
9606.ENSP00000266735 9606.ENSP00000475218 1
9606.ENSP00000323099 9606.ENSP00000465569 1
9606.ENSP00000428962 9606.ENSP00000419084 1
9606.ENSP00000323046 9606.ENSP00000469431 1
9606.ENSP00000386469 9606.ENSP00000419782 1
9606.ENSP00000259895 9606.ENSP00000358487 1
9606.ENSP00000290541 9606.ENSP00000378739 1
9606.ENSP00000354728 9606.ENSP00000376018 1
9606.ENSP00000428884 9606.ENSP00000341268 1
9606.ENSP00000388126 9606.ENSP00000410758 1
9606.ENSP00000435766 9606.ENSP00000416120 1
9606.ENSP00000362797 9606.ENSP00000420946 1
9606.ENSP00000261416 9606.ENSP00000455545 1
9606.ENSP00000261015 9606.ENSP00000400542 1
9606.ENSP00000418346 9606.ENSP00000429374 1
9606.ENSP00000346012 9606.ENSP00000434837 1
9606.ENSP00000479931 9606.ENSP00000384046 1
9606.ENSP00000463107 9606.ENSP00000386541 1
9606.ENSP00000360876 9606.ENSP00000353483 1
9606.ENSP00000363339 9606.ENSP00000481635 1
9606.ENSP00000402142 9606.ENSP00000388246 1
9606.ENSP00000482258 9606.ENSP00000458430 1
9606.ENSP00000246115 9606.ENSP00000386054 1
9606.ENSP00000452595 9606.ENSP00000475909 1
9606.ENSP00000262629 9606.ENSP00000376395 1
9606.ENSP00000300151 9606.ENSP00000480129 1
9606.ENSP00000418641 9606.ENSP00000419084 1
9606.ENSP00000438450 9606.ENSP00000429901 1
9606.ENSP00000353202 9606.ENSP00000362659 1
9606.ENSP00000391114 9606.ENSP00000436218 1
9606.ENSP00000415508 9606.ENSP00000384123 1
9606.ENSP00000352333 9606.ENSP00000476100 1
9606.ENSP00000209884 9606.ENSP00000216225 1

I was just wondering is this a normal result for a prediction, given how the datapoints were used in the training?

Use multiple workers to load embeddings

Loading embeddings is a slow start-up step that can easily be parallelized -- divide list of embeddings among workers and have them all read into the same dictionary

Queries about sequence identity in the dataset

Hi, I really like your work and idea. I have a question about the sequence identity in your paper:

"Next, we removed PPIs with high sequence redundancy to other PPIs, following the precedent of previous approaches. Specifically, we clustered proteins at the 40% similarity threshold using CD-HIT, and a PPI (A-B) was considered sequence redundant (and excluded) if we had already selected another PPI (C-D) such that the protein pairs (A, C) and (B, D) each shared a CD-HIT cluster"

When considering three proteins A, B, and C with interactions between A-B and A-C, how do you address this redundancy? Do you assess the redundancy for pair A-A, B-C or focus on A-C and B-A instead? I think it is totally different.

maybe a typo bug

In contact.py,
def forward(self, z0, z1):
"""
:param z0: Projection module embedding :math:(b \\times N \\times d)
:type z0: torch.Tensor
:param z1: Projection module embedding :math:(b \\times M \\times d)
:type z1: torch.Tensor
:return: Predicted contact map :math:(b \\times N \\times M)
:rtype: torch.Tensor
"""
B = self.broadcast(z0, z1)
return self.predict(C)

should it be B?

Optimization Parameter Causes Exceptions In Predictions

https://github.com/samsledje/D-SCRIPT/blob/5cefd93122fc9f2b9b4e176a0fa74fca719919dd/dscript/models/interaction.py#L112C72-L112C72

When the size of the sequence is > 2000. The parameter used from broadcasting downstream causes multiplication errors. These errors are silenced in the try except block in commands/predict.py. I think the choice of 2000 was to limit the size of the AA sequence passed due to memory constraints? I propose either expanding the length of the parameter, or providing a warning beforehand, as tracing the error message is non trivial.

I am confused the scale of the output file

when i try to run the code " dscript embed --seqs /Users/chenwenqi/Desktop/paper/2021-08-12/D-SCRIPT-main/data/seqs/yeast.fasta --outfile yeast.h5" but the terminal told show that

5664 Sequences Loaded

Approximate Storage Required (varies by average sequence length): ~45.312GB

Storing to yeast.h5...

"
Is it normal that embedding 5664 sequences need about 45.312GB?

TT3D - error at beginning of making predictions

I have run TT non-3D with no problems, but TT3D has generated this error.

output log file printed text

[2023-09-10-10:42:43] Using CPU
[2023-09-10-10:42:43] Loading model from /cluster/jb_lab/twaksman001/Bioinformatics/Interaction Prediction/D-SCRIPT/tt3d_v1.sav
[2023-09-10-10:42:43] Loading pairs from /cluster/jb_lab/twaksman001/Bioinformatics/Interaction Prediction/D-SCRIPT/Mp Effectors At Proteome TT3D/Pairs_MpEffectors_AtProteome_TT3D_1.txt
[2023-09-10-10:42:43] Generating Embeddings...
[2023-09-10-11:35:02] Loading FoldSeek 3Di sequences...
[2023-09-10-11:35:02] Making Predictions...

error log file printed text

(end of generating embeddings)
100%|██████████| 2370/2370 [52:18<00:00, 1.32s/it]

0%| | 0/163229 [00:00<?, ?it/s]
0%| | 0/163229 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/cluster/jb_lab/twaksman001/conda/envs/bio1/bin/dscript", line 8, in
sys.exit(main())
File "/cluster/jb_lab/twaksman001/conda/envs/bio1/lib/python3.9/site-packages/dscript/main.py", line 77, in main
args.func(args)
File "/cluster/jb_lab/twaksman001/conda/envs/bio1/lib/python3.9/site-packages/dscript/commands/predict.py", line 221, in main
cm, p = model.map_predict(
File "/cluster/jb_lab/twaksman001/conda/envs/bio1/lib/python3.9/site-packages/dscript/models/interaction.py", line 219, in map_predict
(self.xx[:N] + 1 - ((N + 1) / 2)) / (-1 * ((N + 1) / 2))
File "/cluster/jb_lab/twaksman001/conda/envs/bio1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1614, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'ModelInteraction' object has no attribute 'xx'

predictions error with nonTT3D and TT3D

Hi,

I obtained the following errors when using topsy_turvy_v1.sav. Embeddings can be generated, only predictions were halted.

File "~\miniforge3\envs\dscript\lib\site-packages\dscript\commands\predict.py", line 232, in main                      
cm, p = model.map_predict(p0, p1)  
File "~\miniforge3\envs\dscript\lib\site-packages\dscript\models\interaction.py", line 219, in map_predict             
(self.xx[:N] + 1 - ((N + 1) / 2)) / (-1 * ((N + 1) / 2))  
miniforge3\envs\dscript\lib\site-packages\torch\nn\modules\module.py", line 1186, in __getattr__               
type(self).__name__, name))                                                                                                   
AttributeError: 'ModelInteraction' object has no attribute 'xx'  

With TT3D model:

\miniforge3\envs\dscript\lib\site-packages\dscript\commands\predict.py", line 132, in main                      
"A TT3D model has been provided, but no foldseek_fasta has been provided"                                                     
ValueError: A TT3D model has been provided, but no foldseek_fasta has been provided

Same protein pairs but have different labels

Hi Sam,

Thanks for your awesome work!

When I checked the tsv file in the data/pairs folder, I found some protein pairs have the same protein IDs but shall different ground truth labels (0 and 1), which really confused me. Here I attach the samples found in the human_train.tsv.

0 1 2
9606.ENSP00000296792 9606.ENSP00000254803 0
9606.ENSP00000284274 9606.ENSP00000352551 1
9606.ENSP00000321449 9606.ENSP00000369741 1
9606.ENSP00000340879 9606.ENSP00000369411 0
9606.ENSP00000282074 9606.ENSP00000389810 1
9606.ENSP00000389810 9606.ENSP00000282074 0
9606.ENSP00000360709 9606.ENSP00000303754 0
9606.ENSP00000254803 9606.ENSP00000296792 1
9606.ENSP00000301452 9606.ENSP00000313681 1
9606.ENSP00000367824 9606.ENSP00000467145 0
9606.ENSP00000335027 9606.ENSP00000348632 0
9606.ENSP00000368345 9606.ENSP00000482258 0
9606.ENSP00000378338 9606.ENSP00000422241 0
9606.ENSP00000300026 9606.ENSP00000316881 1
9606.ENSP00000217402 9606.ENSP00000324810 1
9606.ENSP00000230413 9606.ENSP00000320567 1
9606.ENSP00000468611 9606.ENSP00000378650 1
9606.ENSP00000384962 9606.ENSP00000365243 1
9606.ENSP00000342889 9606.ENSP00000460439 0
9606.ENSP00000388669 9606.ENSP00000367444 1
9606.ENSP00000296255 9606.ENSP00000347969 1
9606.ENSP00000354813 9606.ENSP00000443273 0
9606.ENSP00000443273 9606.ENSP00000354813 1
9606.ENSP00000422241 9606.ENSP00000378338 1
9606.ENSP00000369741 9606.ENSP00000378294 1
9606.ENSP00000347969 9606.ENSP00000296255 0
9606.ENSP00000358487 9606.ENSP00000408901 0
9606.ENSP00000316244 9606.ENSP00000248150 0
9606.ENSP00000303754 9606.ENSP00000360709 1
9606.ENSP00000316881 9606.ENSP00000300026 0
9606.ENSP00000369411 9606.ENSP00000340879 1
9606.ENSP00000229471 9606.ENSP00000288699 0
9606.ENSP00000320184 9606.ENSP00000223324 0
9606.ENSP00000305480 9606.ENSP00000055077 0
9606.ENSP00000315017 9606.ENSP00000308275 0
9606.ENSP00000367444 9606.ENSP00000388669 0
9606.ENSP00000223324 9606.ENSP00000320184 1
9606.ENSP00000443273 9606.ENSP00000346196 1
9606.ENSP00000460439 9606.ENSP00000342889 1
9606.ENSP00000324810 9606.ENSP00000217402 0
9606.ENSP00000380336 9606.ENSP00000263694 0
9606.ENSP00000453361 9606.ENSP00000385018 1
9606.ENSP00000352551 9606.ENSP00000284274 0
9606.ENSP00000230050 9606.ENSP00000363018 1
9606.ENSP00000343867 9606.ENSP00000468569 0
9606.ENSP00000250003 9606.ENSP00000242728 0
9606.ENSP00000385018 9606.ENSP00000453361 0
9606.ENSP00000378650 9606.ENSP00000468611 0
9606.ENSP00000408901 9606.ENSP00000358487 1
9606.ENSP00000348632 9606.ENSP00000335027 1
9606.ENSP00000339527 9606.ENSP00000366711 1
9606.ENSP00000248150 9606.ENSP00000316244 1
9606.ENSP00000482258 9606.ENSP00000368345 1
9606.ENSP00000320567 9606.ENSP00000230413 0
9606.ENSP00000242728 9606.ENSP00000250003 1
9606.ENSP00000288699 9606.ENSP00000229471 1
9606.ENSP00000324173 9606.ENSP00000288602 0
9606.ENSP00000386717 9606.ENSP00000461082 0
9606.ENSP00000308022 9606.ENSP00000362930 0
9606.ENSP00000288602 9606.ENSP00000324173 1
9606.ENSP00000365243 9606.ENSP00000384962 0
9606.ENSP00000313681 9606.ENSP00000301452 0
9606.ENSP00000278572 9606.ENSP00000378081 1
9606.ENSP00000369162 9606.ENSP00000375633 0
9606.ENSP00000263694 9606.ENSP00000380336 1
9606.ENSP00000376500 9606.ENSP00000385009 0
9606.ENSP00000308275 9606.ENSP00000315017 1
9606.ENSP00000366711 9606.ENSP00000339527 0
9606.ENSP00000448250 9606.ENSP00000285667 1
9606.ENSP00000482114 9606.ENSP00000330945 1
9606.ENSP00000378294 9606.ENSP00000369741 0
9606.ENSP00000429717 9606.ENSP00000294179 1
9606.ENSP00000270586 9606.ENSP00000264639 0
9606.ENSP00000428581 9606.ENSP00000320567 1
9606.ENSP00000320567 9606.ENSP00000428581 0
9606.ENSP00000461082 9606.ENSP00000386717 1
9606.ENSP00000467145 9606.ENSP00000367824 1
9606.ENSP00000385009 9606.ENSP00000376500 1
9606.ENSP00000055077 9606.ENSP00000305480 1
9606.ENSP00000330945 9606.ENSP00000482114 0
9606.ENSP00000264639 9606.ENSP00000270586 1
9606.ENSP00000362930 9606.ENSP00000308022 1
9606.ENSP00000346196 9606.ENSP00000443273 0
9606.ENSP00000363018 9606.ENSP00000230050 0
9606.ENSP00000378081 9606.ENSP00000278572 0
9606.ENSP00000375633 9606.ENSP00000369162 1
9606.ENSP00000468569 9606.ENSP00000343867 1
9606.ENSP00000363746 9606.ENSP00000388126 0
9606.ENSP00000388126 9606.ENSP00000363746 1
9606.ENSP00000285667 9606.ENSP00000448250 0
9606.ENSP00000294179 9606.ENSP00000429717 0
9606.ENSP00000369741 9606.ENSP00000321449 0

Also, similar cases happen in the yeast_test.tsv:

0 1 2
4932.YBR159W 4932.YPL220W 1
4932.YMR230W 4932.YHR021C 0
4932.YJL069C 4932.YPR137W 1
4932.YJL033W 4932.YLL011W 1
4932.YHR021C 4932.YMR230W 1
4932.YLL011W 4932.YJL033W 0
4932.YPL220W 4932.YBR159W 0
4932.YPR137W 4932.YJL069C 0

Could you please check these protein pairs and verify their ground truth? Thank you very much!

Using D-SCRIPT in slurm

I am now trying to run D-SCRIPT embedding task on a cluster equipped with slurm. I have downloaded and renamed the pre-trained language model dscript_lm_v1.pt and put it in D-SCRIPT/dscript, the same directory as pretrained.py. However, the program still thought there is no such a model file and started the trial to download a model from web (and as the computing node has no network, the run failed).
Is there any solution to running the model embedding & prediction on slurm?

Running with n_jobs = -1 throttles all CPU on the machine

Current behavior (here) by default uses n_jobs=-1 to load sequence embeddings in parallel. As a result, any other computation running on the same machine slows down drastically. This should be changed to a sensible default (16? 32?) and an n_jobs flag should be added which can be set by the user.

Pre-trained models do not work with "novel" proteins

Hello,

I am running predict function of d-script with the following inputs taken from the download page:

Code: dscript predict --pairs human_test.tsv --seqs human.fasta --model human_v1.sav

It is working. However, when I try to run it with my own coronavirus sequences that I have added, it gives me an error:

Code: dscript predict --pairs covid_test.tsv --seqs human_extend.fasta --model lm_v1.sav

Error:

File "/home/msalnik/.local/lib/python3.8/site-packages/dscript/commands/predict.py", line 80, in main
model = torch.load(modelPath).cpu()
File "/home/msalnik/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 667, in cpu
return self._apply(lambda t: t.cpu())
File "/home/msalnik/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 530, in _apply
module._apply(fn)
File "/home/msalnik/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 530, in _apply
module._apply(fn)
File "/home/msalnik/.local/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 187, in _apply
self._flat_weights = [(lambda wn: getattr(self, wn) if hasattr(self, wn) else None)(wn) for wn in self._flat_weights_names]
File "/home/msalnik/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'LSTM' object has no attribute '_flat_weights_names'

This is what the coivd_test.tsv looks like inside:
image

The human_extend.fasta is human.fasta but with these sequences added:
image

Just a couple of notes:

  • I had installed the dscript as a conda environment as detailed in the installation guide
  • I have pytorch 1.5, which is the version that was recommended, and looking into the issue I saw that pytorch version was oftentimes responsible for the _flat_weights_names error.

Are there any steps that I am missing for adding novel sequences not in the existing database (human.fasta)?

Add multi-GPU support for prediction (and embedding?)

A single process splits candidate interactions to the number of allocated GPUs, and all are able to access the same loaded embeddings in memory

  • currently can't run more than a certain number of different jobs in parallel, because each load all the embeddings into their memory, which fills up system memory quickly
  • for large numbers of predictions, running in serial on a single GPU is a waste of time and system resources
  • fix: cmd line allocates which GPUs are able to be used, and sub processes run on each one, after embeddings are loaded into a shared memory

How can I use command line to run the model in remote serve?

Now I have entered this model in remote server, the path is "/home/XXXX/XXXX/XXXX/XXXX/D-SCRIPT-main"
When i input " dscript embed --seqs ~//home/XXXX/XXXX/XXXX/XXXX/D-SCRIPT-main/data/seqs/yeast.fasta --outfile yeast.h5" , the terminal told me "Command 'dscript' not found, did you mean:"
I have no idea how to run the code in remote serve. But I succeed in run the code on my computer with the same command line.

'collections.OrderedDict' object has no attribute cuda

Hi @samsledje , so I tried doing the prediction through the command line this time but I'm having an issue that I was hoping you can help me with. I am running dscript predict --pairs ecoli_test.tsv --model dscript_human_v2.pt --seqs ecoli.fasta --device 0 but I'm getting

[2023-05-11-10:07:28] Loading model from dscript_human_v2.pt
Traceback (most recent call last):
 File "/home/jlw742/pytorch/bin/dscript", line 33, in <module>
   sys.exit(load_entry_point('dscript==0.2.2', 'console_scripts', 'dscript')())
 File "/home/jlw742/pytorch/lib/python3.9/site-packages/dscript-0.2.2-py3.9.egg/dscript/__main__.py", line 65, in main
   args.func(args)
 File "/home/jlw742/pytorch/lib/python3.9/site-packages/dscript-0.2.2-py3.9.egg/dscript/commands/predict.py", line 105, in main
   model = torch.load(modelPath).cuda()
AttributeError: 'collections.OrderedDict' object has no attribute 'cuda' 

I did some research and apparently the issue is that the weights of the model were saved as state_dict but now they are being loaded by thetorch.load() method instead of the model.load_state_dict.
I have downloaded the dscript_human_v2.pt weight and I am using the ecoli files from your repo.
Would you mind helping me with this? Thank you very much

SkipLTSM object has no attribute 'map_predict'

python3 -m dscript predict --pairs pairs_known.tsv --seqs experimentally_known/all.fasta --model models/lm_v1.sav --outfile known_predictions.txt
# Using CPU
# Generating Embeddings...
  0%|                                                                                                                                                                               | 0/6 [00:00<?, ?it/s]Downloading model lm_v1 from http://cb.csail.mit.edu/cb/dscript/data/models/dscript_lm_v1.pt...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [02:00<00:00, 20.02s/it]
# Making Predictions...
  0%|                                                                                                                                                                               | 0/5 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/Cellar/[email protected]/3.7.12_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/[email protected]/3.7.12_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/site-packages/dscript-0.1.6-py3.7.egg/dscript/__main__.py", line 58, in <module>
    main()
  File "/usr/local/lib/python3.7/site-packages/dscript-0.1.6-py3.7.egg/dscript/__main__.py", line 54, in main
    args.func(args)
  File "/usr/local/lib/python3.7/site-packages/dscript-0.1.6-py3.7.egg/dscript/commands/predict.py", line 146, in main
    cm, p = model.map_predict(p0, p1)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 591, in __getattr__
    type(self).__name__, name))
AttributeError: 'SkipLSTM' object has no attribute 'map_predict'

Confusions about embedding dimensions

Hello,

In train.py, line 255: cm, ph = model.map_predict(z_a, z_b), the dimension of z_a, z_b should be Nxd.
In interaction.py, line 156, the function map_predict states that z0 and z1 should have dimension bxNxd.
At line 168, z0 and z1 is passed to cpred, which calls function cmap in contact.py.
In contact.py all embeddings are treated as 3d tensors (bxNxd). If I understand correctly, the actual embeddings that are fed to contact.py have dimension Nxd, which may cause an error, for example, line 43 in contact.py: z0 = z0.transpose(1, 2) cannot be operated on a 2d tensor.

Is it possible to be a bug, or am I misunderstanding something? Thank you!

Using d-script with complexes

Hi, I have read your work (D-SCRIPT translates genome to phenome with sequence-based...) and enjoyed it greatly. I had a question that I was hoping you can answer. If I want to use your method for predicting the interaction between a protein complex and a protein or between two protein complexes, how do you suggest I accomplish this task with D-script? I was thinking of concatenating the sequences of the complex and making one long protein out of it but the start and end of that sequence would be unreal. Please let me know what approach you suggest.

TT3D - files?

Is there a pretrained model for TT3D available?

What is foldseek fasta file?

e. coli dataset

maybe it's right in front of me, but I can't find the e. coli dataset. Could it be added with rest? interested to play with it!

Test dataset construction

Hello,

I have been doing some troubleshooting/validating on some predictions we've made used D-SCRIPT. As part of this, I have built some 10-negative:1-positive ratio evaluation test datasets and used the provided human-trained model.

My understanding from the publication was that these were created by (1) filtering out similar known-positive PPIs to a non-redundant list (40% clustering step); (2) take that set of proteins and randomly pair them among themselves (avoiding recreating known-positives) to produce a 10-fold random-negative PPI list; (3) combine the positives and 10x negatives - then you can run the evaluation predictions.

This should mean that the 10x random-negative PPIs only ever use proteins that were already in the filtered known-positive protein list, right? While trying to check my own work, I went to the DSCRIPT test datasets available here on github and it looks like the random-negatives include many proteins that are not represented in the positive set.

For example, from yeast_test.tsv, I am seeing 5,664 unique protein names (and this lines up with the number of fasta entries, I see 5,664 during the embed step).
If I subset yeast_test.tsv to only include positives (column3==1), I find only 2,367 unique protein names.
The remaining 3,297 are only found among the negative interactions (column3==0). I've seen similar patterns in fly_test, human_test, and human_train. My understanding was that there should be no proteins only ever used in the random-negative PPIs?

Generating the random-negatives from only proteins known to interact indeed drops prediction performance, which is what I'm trying to figure out regarding my own data. The false-positive rate is high, around 4%, compared to the default data at about 0.9%. To give you an idea of what I'm troubleshooting here:

run description positive PPIs negatives PPIs AUPR
yeast data as published 5,000 50,000 as published 0.405
subset of published 567 5,670 subset 0.402
subset of positives, re-made randoms from all proteins 567 5,670 made from any proteins in yeast_test.tsv 0.382
subset of publishedt, re-made randoms 567 5670, made from proteins only in this 567 PPI subset 0.188
subset & re-made, mmseqs2 cluster filter 567 5,670, made from positive 567 and nonredundant 0.203

The subsets of 567 are random positive PPIs taken from the yeast_test.tsv to match the sample size I'm dealing with our custom dataset. In row 2 I simply sub-sampled from the negative 50,000 PPIs directly.
In row 3 I made the random 10x negative set with any proteins available in the full yeast_test.tsv, but with my own code.
In rows 4-5, instead of pulling a 10x pool of negative PPIs from proteins in the full yeast_test.tsv (which includes proteins not represented in the 567 positive subetset), I made them myself using only the proteins present in the 567 positive input set. In this case, the 567 positive PPIs included 787 unique protein names, all of which were incorporated into the 5,670 random negatives.

Could you clarify what I've missed about making these evaluation test data sets? Or where do the "other" proteins come from that appear in the random-negative PPIs? Possibly, the negative protein pool was just the whole proteome?

Thank you!

prediction error

Hello,
Thank you for useful tool and detailed document.
I run the prediction based on the human model downloaded from https://d-script.readthedocs.io/en/main/data.html, paired file and fasta files were from github and get error below:
Making Predictions... 0%| | 0/6 [00:00<?, ?it/s] Traceback (most recent call last): File "/siat/hujc/anaconda3/envs/dscript/bin/dscript", line 8, in <module> sys.exit(main()) File "/siat/hujc/anaconda3/envs/dscript/lib/python3.7/site-packages/dscript/__main__.py", line 56, in main args.func(args) File "/siat/hujc/anaconda3/envs/dscript/lib/python3.7/site-packages/dscript/commands/predict.py", line 160, in main cm, p = model.map_predict(p0, p1) File "/siat/hujc/anaconda3/envs/dscript/lib/python3.7/site-packages/dscript/models/interaction.py", line 171, in map_predict if self.do_w: File "/siat/hujc/anaconda3/envs/dscript/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__ type(self).__name__, name)) torch.nn.modules.module.ModuleAttributeError: 'ModelInteraction' object has no attribute 'do_w'
Both 1.5 and 1.7 pytorch have been tried, however, still got the same error, could you please help figure out?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.