hashemifar / dppi Goto Github PK

A convolutional neural network to predict PPI interactions

Lua 100.00%

dppi's Introduction

DPPI

A convolutional neural network to predict PPI interactions.

Main Command: th main.lua -dataset myTrain -learningRate 0.01 -momentum 0.9 -string first-run -device 1 -top_rand -batchSize 2 -saveModel

==> Input parameters:

    -dataset: Name of the training data (e.g. myTrain)

    -string: A suffix that is added to the result file
    
    -device: GPU number

==> Necessary input files before running the command:

    -Training data: It is in dat format. 
     The name of this file should be the name of your training data followed by ‘_labels’ (e.g myTrain_labels.dat).

     The dat file is made using a script called 'convert_csv_to_dat.lua' . 

    -Validation data: Same as Training data. Name of this file is the same as training data followed by '_valid'
     (e.g myTrain_valid_labels.dat). 

    Similar to Training data, you can make the dat file using convert_csv_to_dat.lua

    -Cropped profiles of proteins: It is in t7 format. This file is made using a script called ‘create_crop.lua’.

    -Numbers of cropped per profile: It is in t7 format. This file is made using a script called ‘create_crop.lua’.

====================================================

convert_csv_to_dat.lua: This script converts a csv file to dat file.

Command:

th convert_csv_to_dat.lua -dataset myTrain th convert_csv_to_dat.lua -dataset myTrain_valid

==> Input parameters:

    -dataset: name of the dataset in csv format without suffix (e.g. myTrain).
    
    This file contains three column where first and second columns are two proteins and third column is either 1 or 0
    
    indicating if the two proteins interact or not (e.g. myTrain.csv and myTrain_valid.csv).

==> Output:

    dataset in dat format (e.g. myTrain_labels.dat or myTrain_valid_labels.dat)

====================================================

creat_crop.lua: This scripts makes the cropped profiles

Command:

th creat_crop.lua -dataset myTrain

==> Input parameters:

    -dataset: name of your training data (e.g. ‘myTrain’).

==> Output:

    1) cropped profiles: It is in t7 format. The name of this file is the input name followed by “_number_crop_512” (e.g. myTrain_profile_crop_512.t7)

    2) numbers of cropped per profile: It is in t7 format. The name of this file is the input name followed by “_number_crop_512” (e.g. myTrain_number_crop_512.t7)

==> Necessary input files before running the command:

    -You should have a 1)file and a 2)folder with names the same as the -dataset:
    
    1)The suffix of the file is ‘.node’ (e.g myTrain.node). This file has one column which contains names of all proteins in the training and validation data. 

    2) Profile folder with the same name as -dataset (e.g myTrain). This folder contains profiles of all proteins in training and validation data. 
    
    The name of the profiles inside this folder is the same as the protein names in ‘.node’ file

====================================================

Please remember befor running the Main Command you need to change the data directory and work directory

in main.lua file at lines 5 and 6. You need to replace '$HOME' with your own data directory and work directory.

dppi's People

Contributors

Stargazers

Watchers

Forkers

maozhitao pufeng123 xanderdunn yifengtao lbmallory rohanmaddamsetti sabyuwo yipinlei millersan fluetty hannah-qingya minghao2016 fengxinzzz

dppi's Issues

Segment fault

Hello, I am running your code ：th main.lua -dataset myTrain -learningRate 0.01 -momentum 0.9 -string first-run -device 1 -top_rand -batchSize 2 -saveModel (modified the address in main.lua and used your data) and a segmentation error has occurred. I haven't solved this problem for a long time after debugging. The main mistakes are these three:
[1][475221.681741] python[35692]: segfault at 20 ip 00007ff51a25a760 sp 00007ffe58687d00 error 4 in python3.7[7ff51a179000+1e1000];
[2][544549.003939] a.out[24933]: segfault at 6023e0 ip 0000000000401600 sp 00007ffd7fad55a0 error 7 in a.out[400000+3000];
[3][992757.284087] luajit[50534]: segfault at 18 ip 00000000004687d3 sp 00007ffd37779388 error 4 in luajit[400000+99000].I want to know if there is a problem with my environment or the way to run the code.Thank you!

example file for training set

It is not immediately clear what does the README file mean by "Training data: It is in dat format.Training data contains three column where first and second columns are two proteins". Similarly, it is unclear what does it mean by "Cropped profiles of proteins: It is in t7 format. This file is made using a script called ‘create_crop.lua’." Is it possible to provide small example files for training data and the expected output?

PSSM profiles

Hi,
I am trying to recreate your results. I have a query:
are the PSSM profiles provided in myTrain folder real or just toy examples?
Because the dimensions don't match for protein Q08999 for example.
Thanks!
Regards,
Saby

from protein sequence to training data

Hi, We would like to try out your code to predict PPI; however, we are having trouble understanding the input format. Given two lists of protein sequences (positive and negative sets), how do we convert the primary protein sequences to the format you have under the folder myTrain/? Thanks!

Average Pooling to encode protein profile

Hi, in the paper, the protein profiles P are converted to vectors as
o = Pool ( Relu ( Batch( Conv ( P ) ) )

In the Supplement, it seems you are using Average Pooling with some window size l_p. In the code, it seems that l_p is size 4. Then you have to "flatten" all the ave pool vectors. How can this produce the final vector of size 1xd, where d is the number of filter, as in the Supplement?

Code Request

Hello, I am a graduate student studying in a related field, and your model algorithm has given me a great inspiration. If possible, could you take a look at the source code?

Unclear how invariance to input profiles is achieved

From the paper,
o1 = ReLU Batch [W1 W2] R1
o2 = ReLU Batch [W2 W1] R2

If I reverse(_r) the order of R1 and R2, I get,
o2_r = ReLU Batch [W1 W2] R2
o1_r = ReLU Batch [W2 W1] R1

The Hamard product q = o1 . o2 doesn't appear to be the same as q_r = o2_r . o1_r

Could you please explain how the network is invariant to the order in which the protein sequences within each pair are input?

Saby

Main Command: th main.lua -dataset myTrain -learningRate 0.01 -momentum 0.9 -string first-run -device 1 -top_rand -batchSize 2 -saveModel

I get an error when using this argument: -top_rand.

This argument is not being accepted by the main.lua program.

Please let me know what is the significance of this argument?

hashemifar / dppi Goto Github PK

dppi's Introduction

DPPI

dppi's People

Contributors

Stargazers

Watchers

Forkers

dppi's Issues

Segment fault

example file for training set

PSSM profiles

from protein sequence to training data

Average Pooling to encode protein profile

Code Request

Unclear how invariance to input profiles is achieved

Saby

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent