Coder Social home page Coder Social logo

muhaochen / seq_ppi Goto Github PK

View Code? Open in Web Editor NEW
80.0 80.0 24.0 8.63 MB

This is the repository for PIPR. This repository contains the source code and links to some datasets used in the ISMB/ECCB-2019 paper "Multifaceted Protein-Protein Interaction Prediction Based on Siamese Residual RCNN".

Home Page: http://dx.doi.org/10.1093/bioinformatics/btz328

License: Apache License 2.0

Python 91.79% Shell 8.21%

seq_ppi's People

Contributors

chelseaju avatar muhaochen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

seq_ppi's Issues

Basic run question

Hi,

I am interested in using your program to test for some cross-species PPIs in a pair of non-model organisms and have a couple very basic questions about running PIPR.

At least for a first pass, I’d like to use the multispecies. training set you provide. I see the binary interactions (.actions) named in run.sh and their sequences (.dictionary) named in crnn.py. And, I see that _yeast_wvctc_rcnn.txt is an output file from testing the yeast dataset for PPIs using the multispecies training set.
But how do I actually then bring in my own set of new protein sequences to test for possible PPIs using that training set? I don’t see a place to specify what my sequences of interest are.
Is it line 68 of crnn.py?:

# ds_file, label_index, rst_file, use_emb, hidden_dim
ds_file = '../../../yest/preprocessed/Supp-AB.tsv'

Which is not a file I can find – is that where my data would go? What would the correct format be, something like column 1 ID - column 2 aminoacid sequence?

And a very basic question, running python run.sh from ‘binary/model_multi_species/’ gets me a syntax error:

  File "run.sh", line 1
    cd rcnn
       ^
SyntaxError: invalid syntax

I can edit run.sh to start with python ./lasagna/crnn.py or try to run just a single line (with a path edit to find rcnn.py, e.g.
CUDA_VISIBLE_DEVICES=4 python ./lasagna/rcnn.py ../../../multi_species/preprocessed/CeleganDrosophilaEcoli.actions.tsv -1 results/all_wvctc_rcnn_25_5.txt 3 25 150
But I saw multiple times you explain to people just to stick to running run.sh and not mess with things, so I figure I must being doing something wrong haha.

Thank you!

where is the default_onehot.txt?

Thanks for the source code you provided, I learned a lot from it. From rcnn.py, the 'default_onehot.txt' file should be in the 'embedding' folder, but I didn't find it. In addition, I would like to ask whether the word vectors of amino acids are obtained from the yeast database or from the database of all species?

Can you help me understand the calculation procedure of autocovariance (AC) vector?

It's a great paper. I will definitely cite your paper if I use the mentioned metrics.

I have a few confusions in the way we calculate protein-based AC vector (before concatenating with other protein P2)

From the formula (1) and (2) of your paper, given a max gap G (e.g, G=3) and given physicochemical features (i = 1,...,14):
-We have M=42 variables AC(i,g) [for each gap (g) and each feature (i)].
-If we sum up all the values of the j-th amino acid at this point like formula (1), we have no mean, std, min, and max values as there is only one value for each AC(i,g).

-Assume that we have not sum up yet: each AC(i,g) variable will have multiple values (each for each amino acid)
-We standardize (by mean and STD) these 42 variables and then min-max scale across all the values for each variable , if I understand correctly.
-The resulting object is 42 arrays of within-variable scaled values.
-But I'm confused how will we calculate the final 42-variable vector of the protein from these arrays. By averaging or summing like the formula suggest?

Sorry for a technically detailed and awkward question.
Thank you very much. I would appreciate that if you clarify my confusion.

Licensing

Hello,

I'm working on trying to implement several PPI prediction algorithms into a unified framework, however many of the github repositories I've come across from papers do not have any license associated with them. Could you please add a license to your code so I know how its able to be used (preferably MIT, BSD3.0, or some other open source permissive license or a statement stating it can be used for any reason to allow the code to be used without any restrictions).

Thanks.

What should I do if I want to reproduce the cross-validation corr mentioned in the paper as 0.873

    Thank you for your work and code. I have some questions about the results of the reproducing work. I want to repeat your result on binding affinity part. I downloaded SKEMPI dataset and used seq_ppi/regression/model/run.sh for cross-validation after change the  ds_file to SKEMPI dataset.I found that the obtained results did not correspond to those in the paper. After reading the code and the paper, it seems that the folder for the regression task, and the entire github, is not the code used in the paper or that I missed something, For example The paper mentions 10-fold cross validation, while the code uses 5-fold . **What should I do if I want to reproduce the cross-validation corr mentioned in the paper as 0.873** ? Can you help me with this problem?

how to reproduce your work

when i run your code i did find same code had some bugs and same shell seen never had been run before.
so how can you get your work done,maybe i am wrong,but that is so confued .
so can you offer a more detail document show your work flow and how to get this work.
by the way ,it would be better if you can offer the list file of the functions of every file in your project

The accuracy of yeast dataset

I used the rcnn.py in the seq_ppi/binary/model/lasagna, but I get the average accuracy of yeast is around 95.6%. There are some differences from 97.02% in the article. Is there any need to modify the parameter settings in the program? Thank you very much!

Basic run: embeddings missing

I tried to run the default cnn.py model/script, with the "old" package version, and get the following error (a file seems to be missing):

`C:\Users\Dan Ofer\Desktop\Stuff\Datasets\seq_ppi-master\binary\model\cnn\cnn.py in ()
79 n_epochs = int(n_epochs)
80
---> 81 seq2t = s2t(emb_files[use_emb])
82
83 max_data = -1

C:\Users\Dan Ofer\Desktop\Stuff\Datasets\seq_ppi-master\embeddings\seq2tensor.py in init(self, filename)
6 self.t2v = {}
7 self.dim = None
----> 8 for line in open(filename):
9 line = line.strip().split('\t')
10 t = line[0]

FileNotFoundError: [Errno 2] No such file or directory: '../../../embeddings/default_onehot.txt'`

Saving training results

Hello, I have been trying to set up and use your model but I have run into the issue that I do not see any currently existing code to save the results of the training once it is done, as the only output file I can see in rcnn.py is the one with the results on accuracy, precision, etc. with this code at the end :
with open(rst_file, 'w') as fp:
    fp.write('acc=' + str(accuracy) + '\tprec=' + str(prec) + '\trecall=' + str(recall) + '\tspec=' + str(spec) + '\tf1=' + str(f1) + '\tmcc=' + str(mcc))

I cannot see code for the creation of an H5 file with the saved model data.

Have I missed something which allows the results of training to be saved and used (if so, apologies) ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.