muhaochen / seq_ppi Goto Github PK

This is the repository for PIPR. This repository contains the source code and links to some datasets used in the ISMB/ECCB-2019 paper "Multifaceted Protein-Protein Interaction Prediction Based on Siamese Residual RCNN".

Home Page: http://dx.doi.org/10.1093/bioinformatics/btz328

License: Apache License 2.0

Python 91.79% Shell 8.21%

seq_ppi's People

Contributors

Stargazers

Watchers

seq_ppi's Issues

Basic run question

Hi,

I am interested in using your program to test for some cross-species PPIs in a pair of non-model organisms and have a couple very basic questions about running PIPR.

At least for a first pass, I’d like to use the multispecies. training set you provide. I see the binary interactions (.actions) named in run.sh and their sequences (.dictionary) named in crnn.py. And, I see that _yeast_wvctc_rcnn.txt is an output file from testing the yeast dataset for PPIs using the multispecies training set.
But how do I actually then bring in my own set of new protein sequences to test for possible PPIs using that training set? I don’t see a place to specify what my sequences of interest are.
Is it line 68 of crnn.py?:

# ds_file, label_index, rst_file, use_emb, hidden_dim
ds_file = '../../../yest/preprocessed/Supp-AB.tsv'

Which is not a file I can find – is that where my data would go? What would the correct format be, something like column 1 ID - column 2 aminoacid sequence?

And a very basic question, running python run.sh from ‘binary/model_multi_species/’ gets me a syntax error:

  File "run.sh", line 1
    cd rcnn
       ^
SyntaxError: invalid syntax

I can edit run.sh to start with python ./lasagna/crnn.py or try to run just a single line (with a path edit to find rcnn.py, e.g.
CUDA_VISIBLE_DEVICES=4 python ./lasagna/rcnn.py ../../../multi_species/preprocessed/CeleganDrosophilaEcoli.actions.tsv -1 results/all_wvctc_rcnn_25_5.txt 3 25 150
But I saw multiple times you explain to people just to stick to running run.sh and not mess with things, so I figure I must being doing something wrong haha.

Thank you!

where is the default_onehot.txt?

Thanks for the source code you provided, I learned a lot from it. From rcnn.py, the 'default_onehot.txt' file should be in the 'embedding' folder, but I didn't find it. In addition, I would like to ask whether the word vectors of amino acids are obtained from the yeast database or from the database of all species?

why not put add you result in your code

Can you help me understand the calculation procedure of autocovariance (AC) vector?

It's a great paper. I will definitely cite your paper if I use the mentioned metrics.

I have a few confusions in the way we calculate protein-based AC vector (before concatenating with other protein P2)

From the formula (1) and (2) of your paper, given a max gap G (e.g, G=3) and given physicochemical features (i = 1,...,14):
-We have M=42 variables AC(i,g) [for each gap (g) and each feature (i)].
-If we sum up all the values of the j-th amino acid at this point like formula (1), we have no mean, std, min, and max values as there is only one value for each AC(i,g).

-Assume that we have not sum up yet: each AC(i,g) variable will have multiple values (each for each amino acid)
-We standardize (by mean and STD) these 42 variables and then min-max scale across all the values for each variable , if I understand correctly.
-The resulting object is 42 arrays of within-variable scaled values.
-But I'm confused how will we calculate the final 42-variable vector of the protein from these arrays. By averaging or summing like the formula suggest?

Sorry for a technically detailed and awkward question.
Thank you very much. I would appreciate that if you clarify my confusion.

你好，请问一下您的模型对输入序列的长度有什么要求？

Hello Muhao,
您的模型设计得非常好！
您模型中的前处理’Pre-trained amino acid embeddings‘把任意长度的序列转化为固定值，还是取训练集中最长的序列长度作为模型输入的最大值？

谢谢！
Wu Shiauthie

FileNotFoundError while running cnn.py for binary PPI prediction

FileNotFoundError: [Errno 2] No such file or directory: '../../../sun/preprocessed/Supp-AB.tsv'

Licensing

Hello,

I'm working on trying to implement several PPI prediction algorithms into a unified framework, however many of the github repositories I've come across from papers do not have any license associated with them. Could you please add a license to your code so I know how its able to be used (preferably MIT, BSD3.0, or some other open source permissive license or a statement stating it can be used for any reason to allow the code to be used without any restrictions).

Thanks.

What should I do if I want to reproduce the cross-validation corr mentioned in the paper as 0.873

    Thank you for your work and code. I have some questions about the results of the reproducing work. I want to repeat your result on binding affinity part. I downloaded SKEMPI dataset and used seq_ppi/regression/model/run.sh for cross-validation after change the  ds_file to SKEMPI dataset.I found that the obtained results did not correspond to those in the paper. After reading the code and the paper, it seems that the folder for the regression task, and the entire github, is not the code used in the paper or that I missed something, For example The paper mentions 10-fold cross validation, while the code uses 5-fold . **What should I do if I want to reproduce the cross-validation corr mentioned in the paper as 0.873** ? Can you help me with this problem?

how to reproduce your work

when i run your code i did find same code had some bugs and same shell seen never had been run before.
so how can you get your work done,maybe i am wrong,but that is so confued .
so can you offer a more detail document show your work flow and how to get this work.
by the way ,it would be better if you can offer the list file of the functions of every file in your project

The accuracy of yeast dataset

I used the rcnn.py in the seq_ppi/binary/model/lasagna, but I get the average accuracy of yeast is around 95.6%. There are some differences from 97.02% in the article. Is there any need to modify the parameter settings in the program? Thank you very much！

Errors in running the RCNN network on yeast dataset.

There are some minor errors in the seq2tensor.py related to reading of embedding from files. I can send a pull request if that's fine.

Basic run: embeddings missing

I tried to run the default cnn.py model/script, with the "old" package version, and get the following error (a file seems to be missing):

`C:\Users\Dan Ofer\Desktop\Stuff\Datasets\seq_ppi-master\binary\model\cnn\cnn.py in ()
79 n_epochs = int(n_epochs)
80
---> 81 seq2t = s2t(emb_files[use_emb])
82
83 max_data = -1

C:\Users\Dan Ofer\Desktop\Stuff\Datasets\seq_ppi-master\embeddings\seq2tensor.py in init(self, filename)
6 self.t2v = {}
7 self.dim = None
----> 8 for line in open(filename):
9 line = line.strip().split('\t')
10 t = line[0]

FileNotFoundError: [Errno 2] No such file or directory: '../../../embeddings/default_onehot.txt'`

Saving training results

Hello, I have been trying to set up and use your model but I have run into the issue that I do not see any currently existing code to save the results of the training once it is done, as the only output file I can see in rcnn.py is the one with the results on accuracy, precision, etc. with this code at the end :
with open(rst_file, 'w') as fp:
fp.write('acc=' + str(accuracy) + '\tprec=' + str(prec) + '\trecall=' + str(recall) + '\tspec=' + str(spec) + '\tf1=' + str(f1) + '\tmcc=' + str(mcc))

I cannot see code for the creation of an H5 file with the saved model data.

Have I missed something which allows the results of training to be saved and used (if so, apologies) ?

Value Error: all the input arrays must have same number of dimensions

cannot find the file'../../../mtb/preprocessed/SKEMPI_seq.txt'

in regression problem,I cannot find the processed file'../../../mtb/preprocessed/SKEMPI_seq.txt';
following the readme,cannot download the normalized SKEMPI dataset;
please tell me where I can get it

muhaochen / seq_ppi Goto Github PK

seq_ppi's People

Contributors

Stargazers

Watchers

Forkers

seq_ppi's Issues

Recommend Projects

Recommend Topics

Recommend Org