Coder Social home page Coder Social logo

oubounyt / deepromoter Goto Github PK

View Code? Open in Web Editor NEW

This project forked from egochao/deepromoter

0.0 0.0 0.0 4.84 MB

Pytorch implementation of DeePromoter Active sequence detection for promoter(DNA subsequence regulates transcription initiation of the gene by controlling the binding of RNA polymerase)

License: MIT License

Shell 2.17% Python 97.83%

deepromoter's Introduction

DeePromoter

Pytorch implementation of DeePromoter Active sequence detection for promoter(DNA subsequence regulates transcription initiation of the gene by controlling the binding of RNA polymerase)

Updates

  • 2021-07-08 : Finish training and testing scripts for DeePromoter

Requirements

  • Please install torch==1.9 from https://pytorch.org

  • You can install others Python dependencies with

    pip3 install -r requirements.txt

Dataset

Current supported dataset is:

  • EPDnew : A collection of experimentally validated promoters for selected model organisms. Evidence comes from TSS-mapping from high-throughput expreriments such as CAGE and Oligocapping

Preprocessing

Dataset for Human and Mouse had been processed and stored in ./data

Procedure for create negative dataset as described in paper:

  • Step 1: Break the protein sequence to N part(20 as in the paper)

  • Step 2: Random choose M part of the original protein to keep it, and random initialize the rest

  • Step 3: For every training step mix the positive batch with negative batch and perform training

##Training

python3 train.py -d data/human/nonTATA/hs_pos_nonTATA.txt --experiment_name human_nonTATA

Early stop had been implement and train will automatically stop when Mathews correlation coefficient is saturated

The results will be saved in to ./output/experiment_name

You can do continue training by pass the path to weight by flag -w or --weight

Inference

Prepare your dataset in txt format with each DNA sequence(length 300) on a line

Run inference by

python3 test.py -d data/human/nonTATA/hs_pos_nonTATA.txt -w path_to_weight

Output will be save into file infer_results.txt in the main folder

Implementation Issues

Negative sampling

  1. In addition to using negative sampling as in the paper described(see Preprocessing) I added a random dataset to help the model generalize.

Parallel convolution

  1. The author use grid search to find optimal parameters for the network. I used the final set of parameter from the paper. Kernel size = [27, 14, 7], and maxpooling with kernel = 6

References

  1. DeePromoter paper

deepromoter's People

Contributors

dieubat avatar egochao avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.