Coder Social home page Coder Social logo

geoplus's Introduction

Geoplus

Introduction

Geographic representation of transcript as vectors (Geo2vec) are designed to describe the landscape of the entire transcript through grids (of equal widths) or regions (with unequal width), respectively. In this work, we explore different strategies for applying Geo2vec encoding to enhance classical models in various N6-methyladenosine (m6A) RNA methylation prediction tasks. Specifically, for single-nucleotide m6A prediction, we constructed a multi-model model GepSe (short for Geography plus Sequences) that accepts both sequence and geographic encoding inputs. Since it is not clear which specific isoform transcript carries the modification obtianed through the Illumina sequencing method, we also developed a isoform-aware deep neural network under the multiple instance learning framework (i-GepSe). MeRIP-seq (m6A-seq) is a widely used approach that has generated many tissue-specific m6A data. Given its low resolution, we previously developed WeakRM, a weakly supervised learning framework capable of building predictors based on MeRIP-seq data and providing ~50-nt level predictions. Using Geo2vec encoding, we construct ti-GepSe that further enhances the prediction performance. Oxford Nanopore direct RNA sequencing technology provides a new feasible solution for detection of RNA modifications with simplified experimental procedures. We also explored strategies to detect m6A using Nanopore signal derived features and Geo2vec encoding.

Requirements

  • Python 3.x (3.8.8)
  • Tensorflow 2.3.2
  • Numpy 1.18.5
  • scikit-learn 0.24.1
  • Argparse 1.4.0
  • prettytable 2.1.0

Installation

Please clone this repository as follows:

git clone https://github.com/daiyun02211/Geoplus.git
cd ./Geoplus

Please see also R package Geo2vec for feature extraction: To install Geo2vec from Github, please use the following command in R consol.

if (!requireNamespace("devtools", quietly = TRUE))
    install.packages("devtools")

devtools::install_github("daiyun02211/Geo2vec")

Usage

Single-nucletide m6A predictor (GepSe and i-GepSe)

An example raw data (genomic coordinates formated as .bed file) can be found in Example/base_predictor. Two-step preprocessing are required:

  1. step1_generate_encoding.R generates sequence encoding and Geo2vec encoding using R package Geo2vec;
  2. step2_rds2npy.py converts generated encodings to suitable Python format for modeling. Python codes for GepSe and i-GepSe can be found in Scripts/Gepse and saved weights can be found in Weights/base_predictor:
python Scripts/GepSe/main.py --mode infer --data_dir ./Examples/base_predictor/processed/ --geo_enc chunkTX --tx long --cp_dir ./Weights/base_predictor/tech_shared/GepSe/ --save_dir ./
python Scripts/GepSe/main.py --mode infer --data_dir ./Examples/base_predictor/processed/ --geo_enc chunkTX --tx all --cp_dir ./Weights/base_predictor/tech_shared/iGepSe/ --save_dir ./

Optional arguments are provided to ease usage:

  • --mode: Three modes can be selected: train, eval and infer;
  • --data_dir: The directory where the processed data is stored;
  • --geo_enc: The Geo2vec encoding type should be consistent with the generated encoding in preprocessing;
  • --tx: The transcripts selection approach: None(only sequence), long(sequence + encoding of longest transcript), all(sequence + encoding of all mapped transcripts)
  • --cp_dir: The directory where the trained network weights (checkpoints) are stored. Further arguments can be found:
python Scripts/GepSe/main.py -h

Tissue-specific m6A predictor (ti-GepSe)

Please note that ti-GepSe was trained on MeRIP-seq data with instance length 50. Therefore, the default ti-GepSe provides prediction at up to 50-nt resolution. Python codes for ti-GepSe can be found in Scripts/tiGepse and saved weights can be found in Weights/tissue_predictor:

python Scripts/tiGepSe/main.py --mode infer --data_dir ./Examples/tissue_predictor/ --tissue Lung --len 50 --cp_dir ./Weights/tissue_predictor/ --save_dir ./

Optional arguments are provided to ease usage:

  • --mode: Three modes can be selected: train, eval and infer;
  • --data_dir: The directory where the tissue data is stored. input_dir will be automaticlly generated with --data_dir and --tissue;
  • --tissue: One of the 25 tissue types;
  • --len: The instance length. MeRIP-seq peak data is divided into instances for modeling using multiple instance learning;
  • --cp_dir: The directory where the trained network weights (checkpoints) are stored. cp_path will be automaticlly generated with --cp_dir and --tissue. Further arguments can be found:
python Scripts/tiGepSe/main.py -h

geoplus's People

Contributors

daiyun02211 avatar

Watchers

 avatar

Forkers

fuhaitao95

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.