novoalab / epinano Goto Github PK

View Code? Open in Web Editor NEW

109.0 7.0 31.0 5.24 GB

Detection of RNA modifications from Oxford Nanopore direct RNA sequencing reads (Liu*, Begik* et al., Nature Comm 2019)

License: GNU General Public License v2.0

Python 73.51% Shell 10.90% R 12.71% Jupyter Notebook 1.48% Dockerfile 1.40%

m6a-rna-modifications nanopore rna-sequencing direct-rna m6a

epinano's Introduction

Detection of RNA modifications from Oxford Nanopore direct RNA sequencing reads

Upgrades
About EpiNano
Getting the code
Running EpiNano
Further Documentation
Citing this work
License
Contact

Upgrades

EpiNano 1.2 - current version

Includes pretrained m6A models derived from sequences base-called with Guppy v 3.1.5.
Pretrained models can also be used to detect other RNA modifications (tested for pseudouridine, other modifications: not tested).
This version of EpiNano allows to make predictions using two different strategies: EpiNano-Error and EpiNano-SVM.
This version now includes modules for visualizing your RNA modification predictions (EpiNano_Plot)

EpiNano-Error can only be run in pairwise mode (e.g. WT and KO or KD). It combines the different types of base-calling errors that appear in a given dataset (mismatches, deletions, insertions) as well as alterations in per-base-calling qualities. RNA modification predictions are based on the differences in error patterns observed in two matched samples. This strategy can be used with FASTQ data base-called with any given base-calling algorithm version.

EpiNano-SVM can be run using either pre-trained models for a given RNA modification, or by building your own models. However, we should note that using a matched control (e.g. KO or KD) is still highly recommended, due to the noisy nature of direct RNA sequencing reads, which are 'error'-rich. Moreover, in addition to SVM models trained with "raw" base-calling 'error' features (same as in EpiNano 1.0 and 1.1), in EpiNano 1.2 we now provide SVM models trained with features that capture differences between samples (i.e. difference in mismatch, rather than absolute mismatch frequency), which we find have improved performance.

EpiNano 1.1 - a slimmer version of version 1.0, written in python3 is available here.

This version is the one currently implemented in MasterOfPores, a workflow to analyze direct RNA sequencing data.
The major differences with EpiNano 1.0 are (i) it is faster (ii) Uses python3 instead of python2 (iii) Does not extract current intensity in the feature table, as this feature was not used to train the final models.
Includes pre-trained m6A models base-called with Albacore version 2.1.7.
Works both with Guppy and Albacore basecalled data, but the SVM predictions will be only accurate if your data has been base-called using Albacore 2.1.7.
Regardless of the basecallers used, EpiNano can be used as a toolkit to extract per k-mer base-calling 'errors' (mismatch, insertion, deletion, quality), which are a proxy of RNA modifications present in a given dataset. We recommend running EpiNano in paired mode, i.e. computing the features in two datasets (WT-KO) to then accurately predict the RNA modified sites (i.e. those showing largest differences in their base-calling 'error' features).

EpiNano 1.0 - original code used in Liu, Begik et al., Nature Comm 2019, which is available here.

Includes pre-trained m6A models base-called with Albacore version 2.1.7.
It extracts both base-calling 'errors' (mismatch, insertion, seletion, per-base quality) as well as current intensity values
Current intensity information is extracted from the base-called Albacore FAST5 files.
Does not have models trained with Guppy base-called datasets.

About EpiNano

EpiNano is a tool to identify RNA modifications present in direct RNA sequencing reads.

EpiNano will extract a set of 'features' from direct RNA sequencing reads, which will be in turn used to predict whether the 'error' is caused by the presence of an RNA modification or not. Features directly extracted and derived include:

current intensity and duration
read quality
base quality scores
mismatch frequency
deletion frequency
insertion frequency
sumErr

These features can be organized in per base and per kmer formats

Modes of Running EpiNano

In EpiNano 1.2, we introduce delta-features, features capturing difference between modified and un-modified sites and sum_err, a metric computed by combining different types of errors and even base quality scores. These new metrics represent our attempt to steer around the limitation related to the fact that different types of RNA base modifications tend to introduce different types of sequencing errors.

EpiNano version 1.2 can predict RNA-modified sites in two different ways:

EpiNano-Error

Base-calling algorithm independent.
Applicable to any given RNA modification that affects the base-calling features.

EpiNano-SVM

Base-calling algorithm dependent (data must be base-called with Guppy 3.1.5)
Can use both base-calling error features as well as current signals features
It can be used to train your own models as well as be applied to datasets for which a pre-trained model is available (m6A)
The available m6A SVM models has been trained and tested upon a set of 'unmodified' and 'modified' sequences containing m6A at known sites or A.
We also offer SVM models trained with delta features, i.e., features capturing difference between modified and un-modified samples. These models can be applied to detect other RNA modifications apart from m6A (tested on pseudouridine).

Considerations when using EpiNano

EpiNano relies on the use of base-calling 'errors' to detect RNA modifications; however, direct RNA sequencing base-calling produces a significant amount of 'errors' in unmodified sequences. Therefore, to obtain higher confidence m6A-modified sites, we recommend to sequence both modified and unmodified datasets (e.g. treated with demethylase, or comparing a wild-type vs knockout/knockdown). Coupling a "control" (KD/KO) is not required in earlier Epinano versions, but is highly recommended.
You can use EpiNano as a feature extractor to predict RNA modifications based on alterations in base-called features (i.e., EpiNano-Error, as used here), as well as use the pre-trained SVMs to detect m6A RNA modifications (i.e., EpiNano-SVM, as used here).
EpiNano does not have per-read resolution. We are currently working on an improved version of EpiNano to obtain predictions at per-read level.
The performance of the algorithm is dependent on the stoichiometry of the site (i.e. sites with very low stoichiometry will be often missed by the algorithm)
Pre-trained models to predict m6A sites are included in each release. Please note that if you use pre-trained m6A models, your data should be base-called with the SAME base-calling algorithm and version (i.e. Guppy 3.1.5 if you use EpiNano 1.2, and Albacore 2.1.7 if you use EpiNano 1.0 or 1.1).
If you are using a different base-calling algorithm version, we recommend you to use EpiNano-Error rather than EpiNano-SVM.

Pre-requisites

The following softwares and packages were used by EpiNano

Software/Packages	Version
java openjdk	1.8.0
minimap2	2.14-r886
samtools	0.1.19
sam2tsv	a779a30d6af485d9cd669aa3752465132cf21eec
python	3.6.7
h5py	2.8.0
numpy	1.15.4
pandas	0.23.4
scikit-learn	0.20.2
nanopolish	0.12.4
dask	2.5.2
biopython	1.76
pysam	0.15.3+
R	3.6.0 (2019-04-26)
R packages:
forcats	0.4.0
optparse	1.6.6
stringr	1.4.0
dplyr	1.0.1
purrr	0.3.2
readr	1.3.1
tidyr	0.8.3
tibble	3.0.3
tidyverse	1.2.1
ggrepel	0.8.1
car	3.0-3
ggplot2	3.1.1
reshape2	1.4.3
outliers	0.14

Getting the code

To download the latest version of EpiNano , you just need to clone the repo:

git clone [email protected]:enovoa/EpiNano.git

You can choose to download EpiNano 1.1 HERE
You can choose to download EpiNano 1.0 HERE

Running EpiNano

a) Running EpiNano 1.2

To train models and assess prediction accuracies, please refer to commands in test_data/train_models/train_test.sh.

To make predictions with pre-trained models, please refer to commands in /test_data/make_predictions.

We will also update in wiki with specific examples of using different Epinano components.

Below is a simple introduction of programs' usage information.

STEP 1. Extract base-calling error features

Epinano_Variants, outputs a feature table sample.per.site.var.csv, which contains base-calling ‘error’ information for each reference position. Please note that by default, the feature table sample.per_site.5mer.csv that was generated by default in EpiNano 1.1 (which contains the same base-called features organized in 5-mer windows) is not generated by default any more. If you want to generate this file, please use the script Slide_Variants.py.

python Epinano_Variants.py -h
usage: Epinano_Variants.py [-h] [-r REFERENCE] [-b BAM]
                           [-c NUMBER_CPUS]

optional arguments:
  -h, --help            show this help message and exit
  -c NUMBER_CPUS, --number_cpus NUMBER_CPUS
                        number of CPUs

Required Arguments:
  -r REFERENCE, --reference REFERENCE
                        reference file indexed with samtools faidx and with
                        sequence dictionary created using picard
                        CreateSequenceDictionary
  -b BAM, --bam BAM     bam file; if given; no need to offer reads file;
                        mapping will be skipped

Example:

python $EPINANO_HOME/Epinano_Variants.py -c 6 -r reference.fasta -b sample.reads.bam

Note 1: it is possible to organize the variants in any kmer length using $EPINANO_HOME/misc/Slide_Variants.py.

Note 2: the users should split the computations for each reference sequences if the reference genome is large.

Note 3: UPDATE in EpiNano 1.2.2: We have now released an improved version of EpiNano_variants.py that requires less resources, so can be used with whole genomes. You can find this script here, and use it instead of EpiNano_Variants.py.

STEP 2. Extract current intensity values

This is optional. Users who are interested in exploring the electric signals including current intensities and duration time can rely on this to extract the relevant information. User can also use the extracted features to train SVM models.

Epinano_Current uses Nanopolish to extract current signal level information and then collapses it on a single position basis.

Note 1: Please add the /path/to/nanopolish to environmental $PATH variable, otherwise the script will fail.

$ sh Epinano_Current.sh -h

Epinano_Current.sh [-h] [-b bam -r reads -f genome/transriptome reference -d fast5dir -t threads -m bam_file_mapping_type]

        it runs nanopolish eventalign; aggreagets current intensity values associated with single positions

        nanopolish, samtools have to be installed and added to environmental paths!!

        -h [this help text]
        -b [bam file]
        -d [fast5 folder]
        -r [fastq reads file]
        -f [reference file used to produce bam file]
        -t [number of threads]
        -m [t: reads mapping performed using reference transcriptome; g: reads mapping performed with reference genome]

STEP 3. Predict RNA modifications

EpiNano offers two alternative methods to predict RNA modifications: i) EpiNano-Error uses of the variants/error frequencies computed above ii) EpiNano-SVM uses an SVM algorithm to train models and predict modifications.

a) Predicting RNA modifications using EpiNano-Error:

Epinano_DiffErr.R fits a linear regression model between paired unmodified and modified samples, and then detects outliers, i.e., observations with large residuals, which tend to be underlined by base modifications.

Note 1: different types of RNA base modification show distinct biases toward the spefic types of errors. Thus, offered Epinano_sumErr.py to combine mismatches, indels and even quality scores. Just like the independent types of errors, the combined error is internally performed when running Epinano_ErrDiff.R.

$ Rscript Epinano_DiffErr.R -h

Usage:
        DiffErr.R v0.1 compares a given feature between two samples. It predicts potential modified sites mainly through two methods:
                1. Compute deviance of selected featuers between samples and then calculate z-scores. Outliers or potential modified sites will then
                be determined based on user-defined threshold. Note that this is not suited for 'curlcake' data because they are highly modified (~25% of the bases 
		in the RNA molecules are modified bases).
                2. Fit a linear regression model between two samples.
                        1) detect residuals outliers of the given linear model.
                        2) compute z-scores of residuals for each observation and determine outliers using user-defined z-score threshold.
                Examples:
                        1 compare sum_err between two samples
                        Rscript Epinano_DiffErr.R -k ko.csv -w wt.csv -t 3 -o Test -c 30 -f sum_err  -d 0.1
                        2 same as above, but generate plots, one for each reference.
                        Rscript Epinano_DiffErr.R -k ko.csv -w wt.csv -t 3 -o Test -c 30 -f sum_err  -d 0.1 -p

Options:
        -c COVERAGE, --coverage=COVERAGE
                minimum coverage/depth; default: 30

        -t THRESHOLD, --threshold=THRESHOLD
                minimum z-score (i.e., number of standard deviation from mean) to determine modified sites; default: 3

        -d DEVIANCE, --deviance=DEVIANCE
                minimum deviance of selected feature between two samples; default: 0.1

        -f FEATURE, --feature=FEATURE
                the feature (column name(s) in from input file) to use to predict modifications

        -k KO_SAMPLE, --ko_sample=KO_SAMPLE
                knockout/unmodified sample

        -w WT_SAMPLE, --wt_sample=WT_SAMPLE
                wildtype/modified sample

        -o OUT_PREFIX, --out_prefix=OUT_PREFIX
                output prefix

        -p, --plot
                whether or not generate plots;  default: no plots will be generated because Epinano_Plot.R can do the job

        -h, --help
                Show this help message and exit

b) Predicting RNA modifications using EpiNano-SVM:

EpiNano 1.2 includes pre-trained models (found in $EPINANO_HOME/models/), which have been trained using synthetic molecules (curlcakes) with and without introduced m6A modifications. However, the user can train their own models using EpiNano_Predict, employing the features generated with EpiNano_Variants.py and/or EpiNano_Current.py as shown in the previous steps. The relevant commands can be found in test_data/train_models/.

$ python Epinano_Predict.py -h

Command:  Epinano_Predict.py -h
usage: Epinano_Predict.py [-h] [-k KERNEL] [-o OUT_PREFIX] [-a] [-M MODEL]
                          [-t TRAIN] [-mc MODIFICATION_STATUS_COLUMN] -p
                          PREDICT -cl COLUMNS

optional arguments:
  -h, --help            show this help message and exit
  -k KERNEL, --kernel KERNEL
                        kernel used for training SVM, choose any one from
                        'linear', 'poly', 'rbf', 'sigmoid'; if no choice made,
                        all 4 kernels will be used
  -o OUT_PREFIX, --out_prefix OUT_PREFIX
                        output file prefix
  -a, --accuracy_estimation
                        '-a' performs accuracy estimation with known modified
                        status from --predict file; only feasible when there
                        is prior knolwdge of modificaton status indiciated by
                        --modification_status_column
  -M MODEL, --model MODEL
                        pre-trained model that can ben used for prediction
  -t TRAIN, --train TRAIN
                        file name of feature table used for training; can be
                        gzipped
  -mc MODIFICATION_STATUS_COLUMN, --modification_status_column MODIFICATION_STATUS_COLUMN
                        column number from (input file1, i.e, traing file)
                        that contains modification status information

required arguments:
  -p PREDICT, --predict PREDICT
                        file name of feature table used for making predictions
                        or testing accuracy; can be gzipped. when this file is
                        the same the one used for training, half of the data
                        will be chosen for training.
  -cl COLUMNS, --columns COLUMNS
                        comma separated column number(s) that contain features
                        used for training and prediction

Example:

python $EPINANO_HOME/Epinano_Predict.py 
	--train ko_wt_combined.per_site_raw_feature.rrach.5mer.csv 
	--predict ko_wt_combined.per_site_raw_feature.rrach.5mer.csv --accuracy_estimation 
	--out_prefix train_and_test 
	--columns 8,13,23 
	--modification_status_column 26

While the user can choose to train the algorithm with one sample (--train) and test it on another independent sample (--predict), it is also possible to use the same input file both for training and testing the model, as depicted in the example above. In this scenario, Epinano_Predict will train the models with 50% of the input data, and make predictions with the remaining 50% of the data.

In the above command, ‘--columns’ denotes the column numbers of features that are used for training models (in this case, corresponding to ‘q3’, ’mis3’ and ‘del3’), while ‘--modification_status_column’ indicates the prior knowledge of the modification statuses, i.e., the labels ‘mod’ and ‘unm’. Switching on --accuracy_estimation will report the accuracy of the trained model(s). Unless ‘--kernel’ is used, Epinano_Predict will train models with multiple kernels.

With the trained models, the user can make predictions of modifications.

python $EPINANO_HOME/Epinano_Predict.py 
	--model q3.mis3.del3.MODEL.linear.model.dump 
	--predict some_sample.per_site.5mer.csv 
	--columns 8,13,23  
	--out_prefix some_sample.modification

In the command above, we employ a previously trained model ‘q3.mis3.del3.MODEL.linear.model.dump’ that will predict m6A modifications in RRACH k-mers on a dataset that is specified with ‘--predict’. Please remember to filt your dataset before or after making predicitons to keep only RRACH k-mers.

b) Running EpiNano 1.1

Build feature table (on which predictions will be made)

For step-by-step instructions to build a feature table, please take a look at the Wiki
To train SVM and perform predictions:

This step includes SVM training, prediction and performance assessment using single and multiple features.
$ python3 SVM.py -h

Command:  scripts/SVM.py -h
usage: SVM.py [-h] [-k KERNEL] [-o OUT_PREFIX] [-a] [-M MODEL] [-t TRAIN]
              [-mc MODIFICATION_STATUS_COLUMN] -p PREDICT -cl COLUMNS

optional arguments:
  -h, --help            show this help message and exit
  -k KERNEL, --kernel KERNEL
                        kernel used for training SVM, choose any one from
                        'linear', 'poly', 'rbf', 'sigmoid'; if no choice made,
                        all 4 kernels will be used
  -o OUT_PREFIX, --out_prefix OUT_PREFIX
                        ouput file prefix
  -a, --accuracy_estimation
                        -a perform accuracy estimation with known modified
                        status from --predict file
  -M MODEL, --model MODEL
                        pre-trained model that can ben used for prediction; if
                        this is not available SVM model will be trained and
                        dumped; there can be multiple models, which should be
                        in the same order as kernels applied
  -t TRAIN, --train TRAIN
                        file name of feature table used for training
  -mc MODIFICATION_STATUS_COLUMN, --modification_status_column MODIFICATION_STATUS_COLUMN
                        column number from (input file1, i.e, traing file)
                        that contains modification status information

required arguments:
  -p PREDICT, --predict PREDICT
                        file name of feature table used for making predictions
                        or testing accuracy. when this file is the same the
                        one used for training, half of the data will be chosen
                        for training.
  -cl COLUMNS, --columns COLUMNS
                        comma seperated column number(s) that contain features
                        used for training and prediciton

Examples

With the example svm input files from example/svm_input folder:

training with the example feature tables

	# the command below will train models using all quality scores are all positions from sample1 and make prediction on sample2
	python3 SVM.py -a -t sample1.csv -p sample2.csv -cl 1-5 -mc 11 -o test

	# while this command will do the same thing except choosing a 'linear' kernel for SVM training
	python3 SVM.py -a -k linear -cl 1-5 -t sample1.csv -p sample2.csv -mc 11

	# this 3rd command uses base quality and mismatch frequencies of the centred bases for SVM training
	python3 SVM.py -t sample1.csv -p sample2.csv -cl 3,7 -mc 11

predict modifications

	#use previously trained model and epinano-scripts-generated site-wise feature table to make predictions
	python3 SVM.py -a -M M6A.mis3.del3.q3.poly.dump -p test.csv -cl 7,12,22 -mc 28 -o pretrained.prediction

Further Documentation

Please check HERE for additional information on usage on EpiNano 1.2

Please check the Wiki for additional information on usage on EpiNano 1.1

Citing this work

If you find this work useful, please cite:

Huanle Liu*, Oguzhan Begik*, MorghanC Lucas, Jose Miguel Ramirez, Christopher E. Mason, David Wiener, Schraga Schwartz, John S. Mattick, Martin A. Smith and Eva Maria Novoa. Accurate detection of m6A RNA modifications in native RNA sequences . Nature Communications 2019, 10:4079.

Link to paper: https://www.nature.com/articles/s41467-019-11713-9

License

See LICENSE.md for details

Contact

Please read the Wiki before opening an issue. Also, please go over other issues that may have been previously resolved (check out "closed" issues). If you still have doubts/concerns/suggestions, please open a new Issue. Thanks!

epinano's People

Contributors

Stargazers

Watchers

epinano's Issues

Modification Status Column

Hi,

When trying to predict modification status using the pre-trained model on a direct RNA library if the modification status is not known what should one input as the required "modification status" column -mc?

Epinano 1.1: Backtick(`) in deletion values

The deletion column contains backticks . Was it implemented for any specific reason.

Detection of 2′-hydroxyl modifications in direct RNA-seq data

Hello, I would like to use your software for prediction of RNA's native structure, similar to SHAPE-MaP presented during London Calling 2019. Recently, generated two sets of data (SHAPE +ve and -ve), indexed files, aligned reads with minimap2 and processed with nanopolish eventalign. Results look good but your program seems to be such more advanced i.e. single base resolution. I believe there is a need to train a model for detection of SHAPE adducts binding at the 2′-hydroxyl position. Do you think EpiNano will be able to recognize modification on various bases or can identify certain/single base modification (e.g. 'A') only? I also would be interested to apply your program with per-read resolution as our RNA may have heterogenic tertiary structures. Do you have any idea when this update may be implemented into EpiNano?

Thank you very much for your help.

Regards
s-t-c

Unable to run fast5ToEventTbl.py

Hi, I'm trying to use fast5ToEventTbl.py to call my ONT RNA Meth, but for some reason I am unable to run correctly fast5ToEventTbl.py.

I tried to run it with that command:

python2 ./fast5ToEventTbl.py ./my_data/xxx1.fast5 > ./my_data/output.event.tbl

And I got that error for every fast5 in my test folder:
Traceback (most recent call last): File "./scripts/main/fast5ToEventTbl.py", line 44, in <module> main() File "./scripts/main/fast5ToEventTbl.py", line 30, in main base_call = get_base_call_path(f5) File "./scripts/main/fast5ToEventTbl.py", line 22, in get_base_call_path for i in list (h5['Analyses/'].keys()): File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "/lustre/nas1/wenyh/ONT/software/anaconda2/lib/python2.7/site-packages/h5py/_hl/group.py", line 177, in __getitem__ oid = h5o.open(self.id, self._e(name), lapl=self._lapl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5o.pyx", line 190, in h5py.h5o.open KeyError: "Unable to open object (object 'Analyses' doesn't exist)"
I am on CentOS release 6.5 (Final) and I believe all the dependencies are up to date.

I also tried to use other set of fast5 files coming from another run but the error remained.
I thank you in advance for any help. @Huanle @enovoa

Hygine

Feature Extractions from Million Fast5

Untar results in millions of Fast5, each corresponds to one sequence. How do you deal with numerous fast5 efficiently. Do you pack them all in one.

Hello

How much coverage do we need for a direct RNA sequencing experiment that would be optimum for identification of m6a sites ?

Yeast: True positive m6A modified RRACH sites

For the validation of m6A site prediction in yeast, you used 363 true positive sites. Would you be able to deposit this information on github. At least, I can't see that in supplemental file from the publication.

Thanks

Epinano 1.1 Strand Information

For per_site_var.5mer.csv or tsv.per.site.var.csv, how to find the correct strand. Can you add the corresponding read name and the strand it mapped to genome.

minimap2 step: input and reference files

In the example command for minimap2, you use the output of NanoFilt instead of the U>T converted fastq, is that correct?
Also, I assume you use a transcriptome fasta as input for minimap2, right?
Thank you, best, Sophia

implement insertion frequency computed from samtools mpileup

General Questions Regarding Using EpiNano

Hi @Huanle ,
Thanks for updating EpiNano and making it more user-friendly. I am a bit confused about how to implement EpiNano.
From my understanding, we extract per.site.var.slide features from fastq files and then we can use this file with your trained model to predict modifications? so we do not need fast5s at all?
But in the previous version, you had feature extraction from Fast5 files, which extracts features from the event table in fast5s! what was that for? because in your new instructions I do not see any command that needs Fast5s as the input file!

Thanks,
Vahid.

GFF for SK1 fasta

I am unable to find the exact fasta sequence file for the SK1 strain you recently added. Can you provide the source? I would like to get the GFF file for the transcriptome.

Thanks!

Homogenise python versions

Dear developers,
python2 will be soon be discontinued for python3. What about moving the scripts that are working with python2 to python3?
Best,

Luca

SVM features, sequencing depth and modification probability

Hi Eva,

I am currently using Nanopore to sequence RNA from patients' samples and try to investigate RNA m6a methylation in cancers using Epinano. I have read the publication of Epinano on Nature
Communications...impressive work done!

I have some questions while using Epinano, and hope you guys can share
some experience with me. I understand that the SVM performs best with
features of quality, mismatch and deletion at each base. There is a model
file (EpiNano/models/SVM100/model2.1-mis3.del3.q3.poly.dump). I assume
this dump file only considers the central base, and the order of features
is mismatch, deletion and quality? I have successfully used Epinano to
predict RNA methylation, but I have some concerns that there might be a
lot of false negatives. I only have ~1 M reads per sample, and I believe
sequencing depth can be a huge impact on the prediction. I am going to do
a QC on these results by filtering out bases with few supporting reads and
low probabilities (0.95 cut-off?). What's your recommendations on the
minimum depth and probability of the base to be consider as m6a
modifications? Do you have any other suggestions on QC of the
results? Many thanks!

Calculation of Insertion Frequency

Hi. Can I check with you how the insertion frequency is calculated? Thank you.

How to get the correct RNA modifications for reverse transcripts?

Hi,
I use the genome as mapping reference with the following options "-ax splice -uf -k14". Some reads will mapping to the reverse strand, what should I do for those reverse transcripts to get the right modification?
Thanks!

which input fast5

What sort of fast5 files is expected as an input for the
python2 fast5ToEventTbl.py input.fast5 > output.event.tbl
step?
I was trying to use both single and multi read fast5 files, generated from a MinION direct RNA seq run, both without (directly from the sequencing run w/o live base calling) and with basecalling (after guppy) information inside, but they all fail with an error, e.g.:

Traceback (most recent call last):
  File "fast5ToEventTbl.py", line 45, in <module>
    main()

  File "fast5ToEventTbl.py", line 29, in main
    read_id = get_readid(f5)

  File "fast5ToEventTbl.py", line 16, in get_readid
    k = h5['Raw/Reads/'].keys()[0]

  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper

  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper

  File "lib/python2.7/site-packages/h5py/_hl/group.py", line 177, in __getitem__
    oid = h5o.open(self.id, self._e(name), lapl=self._lapl)

  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper

  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper

  File "h5py/h5o.pyx", line 190, in h5py.h5o.open

KeyError: 'Unable to open object (component not found)'

I looked at the fast5ToEventTbl.py script and see that it goes for the Analyses section of the fast5 file. I do have this section, and I also have a section inside this called Basecall_1D_000, e.g.:

GROUP "Analyses" {
     
 GROUP "Basecall_1D_000" {
         
GROUP "BaseCalled_template" {
            
DATASET "Fastq" {
             
  DATATYPE  H5T_STRING {
                 
 STRSIZE H5T_VARIABLE;
                 
 STRPAD H5T_STR_NULLTERM;
                 
 CSET H5T_CSET_UTF8;
                  
CTYPE H5T_C_S1;
              
 }
              
 DATASPACE  SCALAR
              
 DATA {
              
 (0): "@ffaca199-6d26-4d6f-96f3-9180aed3de4d runid=7fa1e1733fdb76c8e61f4b217a6bf781257cd601 sampleid=3T3L1_0hours read=22926 ch=129 start_time=2019-05-03T20:54:16Z
           
CUGCAGUUGGUUGCGAAGUCUUCCAUCCCUCUAGACGAAGCCUUGGGAUGAUGAGAGACGGACAUGACAAACUAGAGGAGUGUGUCCGAAGCAUUCAGAUGGCCUGGUGUGUGGGGCUCCUAAAUUGGUUCCAGUGGGAUAUGGAAUUAAAAGCAAGGAAGCUUCAAAUACAGUGUGGUUGAAGAUGAUAAGGUUGGAACGGAUAUGGAAGAGCAAAUUGCUUUGAGGACUACGUCUACAGUCCAUUGGAUGUGGCUGCUUCUUAACAAGAUUUAAAUCAUCUUAAGUCAGUGCACUUAAAUAAAAGCUUGAAAGAUAGU
         
  +
          
 8:79;?761/074*6@E52-&&*('+3<+/)89B0';?@9::7:88F<<C===4?6;B*@BD6?<60)+,<>33;9?/#-?=::558>JH8=7-$#.AD380166;;0$&99&<=61*.388;0417805,61A528A:;43@@9:6:::982+&).-,*75&)20:1836/A47/.)76)=GEC9DF=EG=;96-?;-3/=;.$#+7GFG=B=:-%&804/*89,7:6.*417'$()277:91/$$(98:=-,21--'+,#-.ABBABA>>9=32(-&&'$'455A44491>'(1*345'25789+++.60/3;&%$%#

One of the fast5 files from the "modified" folder in the /examples folder worked successfully. I believe those are single read fast5 files, is that right? How did you generate them?

Many thanks for your help over again, best, Sophia

PS: I will be happy to share a single read fast5 file of mine, I'll send you an email.

Example unm_example_fast5/mod_example_fast5 can not be mapped to human genome

Hi:

I converted the example fast5 data (mod_example_fast5) to fastq format, and mapped to human GRCh38 reference genome. But the resulted BAM file shows that nothing has mapped (BAM file FLAG with 4)

Curcake SRR8767348; ignoring read without sequence

For one the curlcake sample from Liu_et_Nat_Com_2019 (SRR8767348.fastq.gz), the feature extraction from fastq produced weird results. Unable to understand. Could some please help!

[M::mm_idx_gen::0.0380.26] collected minimizers
[M::mm_idx_gen::0.0430.35] sorted minimizers
[M::main::0.0430.35] loaded/built the index for 4 target sequence(s)
[M::mm_mapopt_update::0.0450.34] mid_occ = 3
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 4
[M::mm_idx_stat::0.0450.35] distinct minimizers: 1864 (99.09% are singletons); average occurrences: 1.009; average spacing: 5.316
[M::worker_pipeline::83.7522.98] mapped 396018 sequences
[M::worker_pipeline::165.269*1.99] mapped 348987 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -ax map-ont /projects/ke-lab/kesara/ONT/downloads/Liu_et_Nat_Com_2019/curlcake/reference/GSE124309_FASTA_sequences_of_Curlcakes.fa /projects/ke-lab/kesara/ONT/results/epinano/SRR8767348.U2T.fastq
[M::main] Real time: 165.282 sec; CPU: 329.645 sec; Peak RSS: 2.254 GB
[bam_sort_core] merging from 0 files and 4 in-memory blocks...
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.6490
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.8810
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.48744
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.94999
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.205713
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.261486
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.272770
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.311546
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.373700
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.389571
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.526071
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.532636
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.576609
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.667801
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.716665
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.52678
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.217676
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.716054
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.210763
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.513356
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.744910
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.10752
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.218171
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.463532
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.548731
[INFO][Sam2Tsv]Count: 5,728 Elapsed: 11 seconds(0.10%) Remains: 3 hours(99.90%) Last: cc6m_2244_t7_ecorv:10
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.269086
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.159458
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.116329
[INFO][Sam2Tsv]Count: 12,016 Elapsed: 22 seconds(1.20%) Remains: 30 minutes(98.80%) Last: cc6m_2244_t7_ecorv:120
[INFO][Sam2Tsv]Count: 19,229 Elapsed: 33 seconds(6.76%) Remains: 7 minutes(93.24%) Last: cc6m_2244_t7_ecorv:676
[INFO][Sam2Tsv]Count: 21,801 Elapsed: 44 seconds(8.54%) Remains: 7 minutes(91.46%) Last: cc6m_2244_t7_ecorv:854
[INFO][Sam2Tsv]Count: 28,585 Elapsed: 55 seconds(11.28%) Remains: 7 minutes(88.72%) Last: cc6m_2244_t7_ecorv:1,128
[INFO][Sam2Tsv]Count: 35,531 Elapsed: 1 minute(14.17%) Remains: 6 minutes(85.83%) Last: cc6m_2244_t7_ecorv:1,417
[INFO][Sam2Tsv]Count: 47,470 Elapsed: 1 minute(17.57%) Remains: 6 minutes(82.43%) Last: cc6m_2244_t7_ecorv:1,757
[INFO][Sam2Tsv]Count: 49,954 Elapsed: 1 minute(18.05%) Remains: 6 minutes(81.95%) Last: cc6m_2244_t7_ecorv:1,805
[INFO][Sam2Tsv]Count: 63,706 Elapsed: 1 minute(0.06%) Remains: 1 day(99.94%) Last: cc6m_2459_t7_ecorv:6
[INFO][Sam2Tsv]Count: 66,561 Elapsed: 1 minute(0.08%) Remains: 1 day(99.92%) Last: cc6m_2459_t7_ecorv:8
[INFO][Sam2Tsv]Count: 69,608 Elapsed: 2 minutes(0.09%) Remains: 1 day(99.91%) Last: cc6m_2459_t7_ecorv:9
[INFO][Sam2Tsv]Count: 74,088 Elapsed: 2 minutes(0.09%) Remains: 1 day(99.91%) Last: cc6m_2459_t7_ecorv:9
[INFO][Sam2Tsv]Count: 74,844 Elapsed: 2 minutes(0.09%) Remains: 1 day(99.91%) Last: cc6m_2459_t7_ecorv:9
[INFO][Sam2Tsv]Count: 75,057 Elapsed: 2 minutes(0.10%) Remains: 1 day(99.90%) Last: cc6m_2459_t7_ecorv:10
[INFO][Sam2Tsv]Count: 75,547 Elapsed: 2 minutes(0.10%) Remains: 1 day(99.90%) Last: cc6m_2459_t7_ecorv:10
[INFO][Sam2Tsv]Count: 75,935 Elapsed: 3 minutes(0.10%) Remains: 2 days(99.90%) Last: cc6m_2459_t7_ecorv:10
[INFO][Sam2Tsv]Count: 77,641 Elapsed: 3 minutes(0.10%) Remains: 2 days(99.90%) Last: cc6m_2459_t7_ecorv:10
[INFO][Sam2Tsv]Count: 82,926 Elapsed: 3 minutes(0.11%) Remains: 2 days(99.89%) Last: cc6m_2459_t7_ecorv:11
[INFO][Sam2Tsv]Count: 86,991 Elapsed: 3 minutes(0.14%) Remains: 1 day(99.86%) Last: cc6m_2459_t7_ecorv:14
[INFO][Sam2Tsv]Count: 87,087 Elapsed: 3 minutes(0.15%) Remains: 1 day(99.85%) Last: cc6m_2459_t7_ecorv:15
[INFO][Sam2Tsv]Count: 87,252 Elapsed: 3 minutes(0.15%) Remains: 1 day(99.85%) Last: cc6m_2459_t7_ecorv:15
[INFO][Sam2Tsv]Count: 87,500 Elapsed: 4 minutes(0.15%) Remains: 1 day(99.85%) Last: cc6m_2459_t7_ecorv:15
[INFO][Sam2Tsv]Count: 88,780 Elapsed: 4 minutes(0.15%) Remains: 2 days(99.85%) Last: cc6m_2459_t7_ecorv:15
[INFO][Sam2Tsv]Count: 94,268 Elapsed: 4 minutes(0.16%) Remains: 1 day(99.84%) Last: cc6m_2459_t7_ecorv:16
[INFO][Sam2Tsv]Count: 97,359 Elapsed: 4 minutes(0.16%) Remains: 2 days(99.84%) Last: cc6m_2459_t7_ecorv:16
[INFO][Sam2Tsv]Count: 102,909 Elapsed: 4 minutes(0.16%) Remains: 2 days(99.84%) Last: cc6m_2459_t7_ecorv:16
[INFO][Sam2Tsv]Count: 108,371 Elapsed: 5 minutes(0.16%) Remains: 2 days(99.84%) Last: cc6m_2459_t7_ecorv:16
[INFO][Sam2Tsv]Count: 111,018 Elapsed: 5 minutes(0.17%) Remains: 2 days(99.83%) Last: cc6m_2459_t7_ecorv:17
[INFO][Sam2Tsv]Count: 116,979 Elapsed: 5 minutes(0.17%) Remains: 2 days(99.83%) Last: cc6m_2459_t7_ecorv:17
[INFO][Sam2Tsv]Count: 122,575 Elapsed: 5 minutes(0.29%) Remains: 1 day(99.71%) Last: cc6m_2459_t7_ecorv:29
[INFO][Sam2Tsv]Count: 127,852 Elapsed: 5 minutes(0.48%) Remains: 20 hours(99.52%) Last: cc6m_2459_t7_ecorv:48
[INFO][Sam2Tsv]Count: 133,454 Elapsed: 5 minutes(1.72%) Remains: 5 hours(98.28%) Last: cc6m_2459_t7_ecorv:172
[INFO][Sam2Tsv]Count: 139,083 Elapsed: 6 minutes(2.98%) Remains: 3 hours(97.02%) Last: cc6m_2459_t7_ecorv:298
[INFO][Sam2Tsv]Count: 144,749 Elapsed: 6 minutes(4.21%) Remains: 2 hours(95.79%) Last: cc6m_2459_t7_ecorv:421
[INFO][Sam2Tsv]Count: 151,006 Elapsed: 6 minutes(5.95%) Remains: 1 hour(94.05%) Last: cc6m_2459_t7_ecorv:595
[INFO][Sam2Tsv]Count: 154,925 Elapsed: 6 minutes(6.84%) Remains: 1 hour(93.16%) Last: cc6m_2459_t7_ecorv:684
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.117494
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.286166
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.3889
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.369619
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.415095
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.439087
[INFO][Sam2Tsv]Count: 157,227 Elapsed: 6 minutes(7.30%) Remains: 1 hour(92.70%) Last: cc6m_2459_t7_ecorv:730
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.674372
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.87373
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.89719
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.98935
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.170353
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.224054
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.248600
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.280525
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.377275
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.519807
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.592519
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.622145
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.639466
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.691410
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.16925
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.139264
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.215124
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.291455
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.412762
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.317610
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.439003
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.678649
[INFO][Sam2Tsv]Count: 158,510 Elapsed: 7 minutes(7.65%) Remains: 1 hour(92.35%) Last: cc6m_2459_t7_ecorv:765
[WARN][Sam2Tsv]Ignoring read without sequence: SRR8767348.98271
[INFO][Sam2Tsv]Count: 159,277 Elapsed: 7 minutes(7.72%) Remains: 1 hour(92.28%) Last: cc6m_2459_t7_ecorv:772
[INFO][Sam2Tsv]Count: 163,214 Elapsed: 7 minutes(8.68%) Remains: 1 hour(91.32%) Last: cc6m_2459_t7_ecorv:868
[INFO][Sam2Tsv]Count: 169,601 Elapsed: 7 minutes(10.12%) Remains: 1 hour(89.88%) Last: cc6m_2459_t7_ecorv:1,012
[INFO][Sam2Tsv]Count: 177,563 Elapsed: 7 minutes(11.88%) Remains: 58 minutes(88.12%) Last: cc6m_2459_t7_ecorv:1,188
[INFO][Sam2Tsv]Count: 186,905 Elapsed: 8 minutes(13.70%) Remains: 51 minutes(86.30%) Last: cc6m_2459_t7_ecorv:1,370
[INFO][Sam2Tsv]Count: 196,915 Elapsed: 8 minutes(15.27%) Remains: 46 minutes(84.73%) Last: cc6m_2459_t7_ecorv:1,527
[INFO][Sam2Tsv]Count: 209,441 Elapsed: 8 minutes(17.26%) Remains: 40 minutes(82.74%) Last: cc6m_2459_t7_ecorv:1,726
[INFO][Sam2Tsv]Count: 226,472 Elapsed: 8 minutes(20.34%) Remains: 33 minutes(79.66%) Last: cc6m_2459_t7_ecorv:2,034
[INFO][Sam2Tsv]Count: 240,011 Elapsed: 8 minutes(22.47%) Remains: 30 minutes(77.53%) Last: cc6m_2595_t7_ecorv:5
[INFO][Sam2Tsv]Count: 245,567 Elapsed: 9 minutes(22.50%) Remains: 31 minutes(77.50%) Last: cc6m_2595_t7_ecorv:8
[INFO][Sam2Tsv]Count: 251,220 Elapsed: 9 minutes(22.52%) Remains: 31 minutes(77.48%) Last: cc6m_2595_t7_ecorv:10
[INFO][Sam2Tsv]Count: 254,828 Elapsed: 9 minutes(22.52%) Remains: 32 minutes(77.48%) Last: cc6m_2595_t7_ecorv:10
[INFO][Sam2Tsv]Count: 260,504 Elapsed: 9 minutes(22.52%) Remains: 32 minutes(77.48%) Last: cc6m_2595_t7_ecorv:10
[INFO][Sam2Tsv]Count: 265,000 Elapsed: 9 minutes(22.52%) Remains: 33 minutes(77.48%) Last: cc6m_2595_t7_ecorv:10
[INFO][Sam2Tsv]Count: 270,525 Elapsed: 9 minutes(22.53%) Remains: 34 minutes(77.47%) Last: cc6m_2595_t7_ecorv:11
[INFO][Sam2Tsv]Count: 275,950 Elapsed: 10 minutes(22.53%) Remains: 34 minutes(77.47%) Last: cc6m_2595_t7_ecorv:11
[INFO][Sam2Tsv]Count: 279,381 Elapsed: 10 minutes(22.53%) Remains: 35 minutes(77.47%) Last: cc6m_2595_t7_ecorv:11
[INFO][Sam2Tsv]Count: 285,038 Elapsed: 10 minutes(22.56%) Remains: 36 minutes(77.44%) Last: cc6m_2595_t7_ecorv:14
[INFO][Sam2Tsv]Count: 290,857 Elapsed: 10 minutes(22.78%) Remains: 36 minutes(77.22%) Last: cc6m_2595_t7_ecorv:36
[INFO][Sam2Tsv]Count: 295,898 Elapsed: 10 minutes(24.16%) Remains: 34 minutes(75.84%) Last: cc6m_2595_t7_ecorv:174
[INFO][Sam2Tsv]Count: 301,757 Elapsed: 11 minutes(25.74%) Remains: 31 minutes(74.26%) Last: cc6m_2595_t7_ecorv:332
[INFO][Sam2Tsv]Count: 307,762 Elapsed: 11 minutes(27.27%) Remains: 30 minutes(72.73%) Last: cc6m_2595_t7_ecorv:485
[INFO][Sam2Tsv]Count: 314,040 Elapsed: 11 minutes(28.77%) Remains: 28 minutes(71.23%) Last: cc6m_2595_t7_ecorv:635
[INFO][Sam2Tsv]Count: 320,957 Elapsed: 11 minutes(30.39%) Remains: 26 minutes(69.61%) Last: cc6m_2595_t7_ecorv:797
[INFO][Sam2Tsv]Count: 325,295 Elapsed: 11 minutes(31.31%) Remains: 26 minutes(68.69%) Last: cc6m_2595_t7_ecorv:889
[INFO][Sam2Tsv]Count: 333,474 Elapsed: 12 minutes(32.80%) Remains: 24 minutes(67.20%) Last: cc6m_2595_t7_ecorv:1,038
[INFO][Sam2Tsv]Count: 342,027 Elapsed: 12 minutes(34.40%) Remains: 23 minutes(65.60%) Last: cc6m_2595_t7_ecorv:1,198
[INFO][Sam2Tsv]Count: 350,753 Elapsed: 12 minutes(35.76%) Remains: 22 minutes(64.24%) Last: cc6m_2595_t7_ecorv:1,334
[INFO][Sam2Tsv]Count: 359,807 Elapsed: 12 minutes(37.43%) Remains: 21 minutes(62.57%) Last: cc6m_2595_t7_ecorv:1,501
[INFO][Sam2Tsv]Count: 369,442 Elapsed: 12 minutes(38.62%) Remains: 20 minutes(61.38%) Last: cc6m_2595_t7_ecorv:1,620
[INFO][Sam2Tsv]Count: 380,514 Elapsed: 12 minutes(40.22%) Remains: 19 minutes(59.78%) Last: cc6m_2595_t7_ecorv:1,780
[INFO][Sam2Tsv]Count: 395,411 Elapsed: 13 minutes(42.44%) Remains: 17 minutes(57.56%) Last: cc6m_2595_t7_ecorv:2,002
[INFO][Sam2Tsv]Count: 406,515 Elapsed: 13 minutes(43.68%) Remains: 17 minutes(56.32%) Last: cc6m_2595_t7_ecorv:2,126
[INFO][Sam2Tsv]Count: 427,821 Elapsed: 13 minutes(47.08%) Remains: 15 minutes(52.92%) Last: cc6m_2709_t7_ecorv:8
[INFO][Sam2Tsv]Count: 432,891 Elapsed: 13 minutes(47.10%) Remains: 15 minutes(52.90%) Last: cc6m_2709_t7_ecorv:10
[INFO][Sam2Tsv]Count: 437,761 Elapsed: 13 minutes(47.10%) Remains: 15 minutes(52.90%) Last: cc6m_2709_t7_ecorv:10
[INFO][Sam2Tsv]Count: 442,793 Elapsed: 14 minutes(47.11%) Remains: 15 minutes(52.89%) Last: cc6m_2709_t7_ecorv:11
[INFO][Sam2Tsv]Count: 447,936 Elapsed: 14 minutes(47.12%) Remains: 16 minutes(52.88%) Last: cc6m_2709_t7_ecorv:12
[INFO][Sam2Tsv]Count: 453,185 Elapsed: 14 minutes(47.14%) Remains: 16 minutes(52.86%) Last: cc6m_2709_t7_ecorv:14
[INFO][Sam2Tsv]Count: 458,257 Elapsed: 14 minutes(47.14%) Remains: 16 minutes(52.86%) Last: cc6m_2709_t7_ecorv:14
[INFO][Sam2Tsv]Count: 463,289 Elapsed: 14 minutes(47.18%) Remains: 16 minutes(52.82%) Last: cc6m_2709_t7_ecorv:18
[INFO][Sam2Tsv]Count: 468,680 Elapsed: 15 minutes(47.18%) Remains: 16 minutes(52.82%) Last: cc6m_2709_t7_ecorv:18
[INFO][Sam2Tsv]Count: 473,745 Elapsed: 15 minutes(47.18%) Remains: 17 minutes(52.82%) Last: cc6m_2709_t7_ecorv:18
[INFO][Sam2Tsv]Count: 478,580 Elapsed: 15 minutes(47.20%) Remains: 17 minutes(52.80%) Last: cc6m_2709_t7_ecorv:20
[INFO][Sam2Tsv]Count: 483,621 Elapsed: 15 minutes(47.26%) Remains: 17 minutes(52.74%) Last: cc6m_2709_t7_ecorv:26
[INFO][Sam2Tsv]Count: 488,529 Elapsed: 15 minutes(47.37%) Remains: 17 minutes(52.63%) Last: cc6m_2709_t7_ecorv:37
[INFO][Sam2Tsv]Count: 493,220 Elapsed: 16 minutes(47.97%) Remains: 17 minutes(52.03%) Last: cc6m_2709_t7_ecorv:97
[INFO][Sam2Tsv]Count: 498,707 Elapsed: 16 minutes(49.44%) Remains: 16 minutes(50.56%) Last: cc6m_2709_t7_ecorv:244
[INFO][Sam2Tsv]Count: 504,196 Elapsed: 16 minutes(50.93%) Remains: 15 minutes(49.07%) Last: cc6m_2709_t7_ecorv:392
[INFO][Sam2Tsv]Count: 509,455 Elapsed: 16 minutes(52.17%) Remains: 15 minutes(47.83%) Last: cc6m_2709_t7_ecorv:516
[INFO][Sam2Tsv]Count: 515,568 Elapsed: 16 minutes(53.67%) Remains: 14 minutes(46.33%) Last: cc6m_2709_t7_ecorv:666
[INFO][Sam2Tsv]Count: 519,924 Elapsed: 16 minutes(54.76%) Remains: 14 minutes(45.24%) Last: cc6m_2709_t7_ecorv:775
[INFO][Sam2Tsv]Count: 526,384 Elapsed: 17 minutes(56.48%) Remains: 13 minutes(43.52%) Last: cc6m_2709_t7_ecorv:947
[INFO][Sam2Tsv]Count: 533,537 Elapsed: 17 minutes(58.19%) Remains: 12 minutes(41.81%) Last: cc6m_2709_t7_ecorv:1,118
[INFO][Sam2Tsv]Count: 538,545 Elapsed: 17 minutes(59.36%) Remains: 11 minutes(40.64%) Last: cc6m_2709_t7_ecorv:1,235
[INFO][Sam2Tsv]Count: 546,542 Elapsed: 17 minutes(61.37%) Remains: 11 minutes(38.63%) Last: cc6m_2709_t7_ecorv:1,436
[INFO][Sam2Tsv]Count: 555,744 Elapsed: 17 minutes(63.39%) Remains: 10 minutes(36.61%) Last: cc6m_2709_t7_ecorv:1,638
[INFO][Sam2Tsv]Count: 560,945 Elapsed: 18 minutes(64.39%) Remains: 9 minutes(35.61%) Last: cc6m_2709_t7_ecorv:1,738
[INFO][Sam2Tsv]Count: 572,867 Elapsed: 18 minutes(66.60%) Remains: 9 minutes(33.40%) Last: cc6m_2709_t7_ecorv:1,959
[INFO][Sam2Tsv]Count: 586,593 Elapsed: 18 minutes(68.71%) Remains: 8 minutes(31.29%) Last: cc6m_2709_t7_ecorv:2,170
[INFO][Sam2Tsv]. Completed. N=745,740. That took:19 minutes

plot_ROC function in SVM.py

I dont see in the code if plot_ROC function is being called. Can you please clarify that. Dont see that "roc_curve" and "auc" function.

How to get my own train file

Hi, I just use your scripts to deal with my ONT-Met dataset a few weeks ago. Fortunately, I got some results, such as a file named per_site.var.current.csv. But what makes me confused is that I have no idea how to get my own train file by using per_site.var.current.csv file. Or something I was wrong about understanding the SVM.py script. Thanks. I really need your help.

which fast5 for fast5ToEventTbl.py?

Hi,
I got fast5 files generated with gupy from sequencing company. After I got that fast5 files, I basecalled fast5 files generated with albacore. The command is listed belloe:
read_fast5_basecaller.py --input ./multi_to_single/"$i"/ --recursive --worker_threads 10 --flowcell FLO-MIN106 --kit SQK-RNA001 --save_path $output/output_basecall_single/${i} -o fastq,fast5 -q 0 -n 0
And I got that error for every fast5 in my folder './multi_to_single/"$i"/"$j".fast5 "$j" extraction failed'.

Error encountered with trimming fastQ files

Hi,

I'm trying to trim the raw fastq with NanoFilt and am using Ubuntu, pip3 and python 3.6.8. When I enter the command, I receive this error: cat: invalid option -- 'q'. Any suggestions as to how to overcome this?

Thanks!

Which model and Which column of each model has the modification status?

Hi @Huanle and @enovoa ,

Which result (per-site or per-read) and which model in the models' folder do you recommend to use? Also, I really do not know which column in your models has the modification status. In the code example for mis3,del3,q3 you selected column 28, does this column is the same for all models in the model folder?
Can you also please explain some about the prediction output. I used the model in the example code and the result looks like this:
#Kmer,Window,Ref,Coverage,q1,q2,q3,q4,q5,mis1,mis2,mis3,mis4,mis5,ins1,ins2,ins3,ins4,ins5,del1,del2,del3,del4,del5,prediction,dist,ProbM,ProbU CCTGG,137039625:137039626:137039627:137039628:137039629,chr9,166.0:166.0:166.0:166.0:166.0,23.741999999999997,27.06,20.511999999999997,28.066999999999997,24.686999999999998,0.0,0.0,0.0,0.012,0.006,0.0,0.0,0.0,0.0,0.0,0.018000000000000002,0.0,0.0,0.006,0.0,unm,21.354076018509286,3.00000089999998e-14,0.9999999999999699 TAAGA,124403220:124403221:124403222:124403223:124403224,chr10,1.0:1.0:1.0:1.0:1.0,16.0,8.0,5.0,3.0,5.0,0.0,0.0,0.0,1.0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,mod,-0.8762640257834018,0.7606523413528626,0.2393476586471375
So, there is also prediction for k-mers that do not have Adenin, and also for the k-mers more than one A? if a k-mer with more than one A predicted as modified, does that mean all the As in the k-mer are modified?

Many thanks,
Vahid.

Current Intensity Extraction

In the new release of epinano, did you remove the option to extract current intensity extraction for five positions of kmer. Having current intensity data is still useful in addition to quality, mismatch, deletion and insertions.

SK1 Reference Genome

Hi @Huanle and @tleonardi ,
Can you please provide me the SK1 strain reference genome you have used?
The ones I can find at SGD and NCBI for SK1 have about 500 scaffolds and they are not assembled to chromosomes.

Thanks,
Vahid.

some problems in per_read_var.py script

In the step of generating per_read variants. I run the example command
python per_read_var.py all_fastq_pass.h5t3.U2T.sort.bam.tsv > all_fastq_pass.h5t3.U2T.per_read.var.csv

I got three files in the output folder, which are the following three files 'all_fastq_pass.h5t3.U2T.per_read.var.csv'、'all_fastq_pass.h5t3.U2T.sort.bam.tsv.del.csv' and 'all_fastq_pass.h5t3.U2T.sort.bam.tsv_per_rd_var.del.adjust.csv'.
the all_fastq_pass.h5t3.U2T.per_read.var.csv is empty.

In the per_read_var.py script, there are only two outputs as follows

del_tmp = prefix + '.del.csv' (all_fastq_pass.h5t3.U2T.sort.bam.tsv.del.csv')
print >>tmp_fh,','.join ([inf[0],inf[1],ref_base,inf[2],inf[3],rd_base,str (qualities[k]),mis[k],ins[k],str(dels[k])])

adjusted_file = prefix+'_per_rd_var.del.adjust.csv'
(all_fastq_pass.h5t3.U2T.sort.bam.tsv_per_rd_var.del.adjust.csv')
print >>adj_fh, ",".join (["#REF",'REF_POS','REF_BASE','READ_NAME','READ_POSITION','READ_BASE','BASE_QUALITY','MISMATCH','INSERTION','DELETION'])
print >>adj_fh, ','.join(map (str, ary))

So my question is whether ‘all_fastq_pass.h5t3.U2T.sort.bam.tsv_per_rd_var.del.adjust.csv’ is the real per_read variants information. And the command line of generating per_read variants should be 'python per_read_var.py all_fastq_pass.h5t3.U2T.sort.bam.tsv' instead of 'python per_read_var.py all_fastq_pass.h5t3.U2T.sort.bam.tsv > all_fastq_pass.h5t3.U2T.per_read.var.csv'.

Some questions about using Epinano

which fast5 for fast5ToEventTbl.py?
Actually, The fast5 files generated with gupy from sequencing company. After I got that fast5 files, I basecalled fast5 files generated with albacore. The command is listed belloe:
read_fast5_basecaller.py --input ./multi_to_single/"$i"/ --recursive --worker_threads 10 --flowcell FLO-MIN106 --kit SQK-RNA001 --save_path $output/output_basecall_single/${i} -o fastq,fast5 -q 0 -n 0
And I got the same error for every fast5 in my folder './multi_to_single/"$i"/"$j".fast5 "$j" extraction failed'.
Which reference sequences should be used to RNA mapping, genome reference or transcriptome reference ? For species without a reference genome, can I mapping RNA to an assembled transcriptome?
In the step of calling variants for each single read-to-reference alignment, reads mapped to reverse strand of reference seqeucne will be flipped. Through this step, can I only get the m6A modification for the forward transcripts and the information of the reverse transcripts will be lose?
In the step of generating per_read variants. I run the example command
python per_read_var.py all_fastq_pass.h5t3.U2T.sort.bam.tsv > all_fastq_pass.h5t3.U2T.per_read.var.csv

In the per_read_var.py file, there are only two outputs as follows

del_tmp = prefix + '.del.csv' (all_fastq_pass.h5t3.U2T.sort.bam.tsv.del.csv')
print >>tmp_fh,','.join ([inf[0],inf[1],ref_base,inf[2],inf[3],rd_base,str (qualities[k]),mis[k],ins[k],str(dels[k])])

Error when running fast5ToEventTbl.py

Hi @Huanle
I got this error when I run this command on test data:
python ~/Software/EpiNano/scripts/main/fast5ToEventTbl.py ~/Downloads/multifast5_1.fast5
Traceback (most recent call last):
File "/home/vakbari/Software/EpiNano/scripts/main/fast5ToEventTbl.py", line 45, in
main()
File "/home/vakbari/Software/EpiNano/scripts/main/fast5ToEventTbl.py", line 29, in main
read_id = get_readid(f5)
File "/home/vakbari/Software/EpiNano/scripts/main/fast5ToEventTbl.py", line 16, in get_readid
k = h5['Raw/Reads/'].keys()[0]
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "/projects/vakbari_prj/anaconda3/envs/epinano/lib/python2.7/site-packages/h5py/_hl/group.py", line 177, in getitem
oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5o.pyx", line 190, in h5py.h5o.open
KeyError: 'Unable to open object (component not found)'

I have installed h5py, numpy , pandas and sklearn before running the script.
Do you know what make this error?
Many thanks,
Vahid.

How to detection of RNA modifications for the reverse transcripts?

Hi,
In the step of calling variants for each single read-to-reference alignment, reads mapped to reverse strand of reference seqeucne will be flipped. Through this step, can I only get the m6A modification for the forward transcripts and the information of the reverse transcripts will be lose?

Regarding Epinano command issue

I used guppy based calling software to get fast5 and fastq. I used new release of epinano 1.1.1 version so i don't understand why it give this like error. please suggest me to solve this problem.
Commad: SVM.py -a -t infected.tsv.per.site.var.per_site_var.5mer.csv -p infected.tsv.per.site.var.per_site_var.5mer.csv -cl 1-5 -mc 11 -o infect_test

Colunms-used: 1-5 output: infect_test.#Kmer.Window.Ref.Coverage.q1.SVM
Traceback (most recent call last):
File "/home/aclab/apps/EpiNano-epinano1.1.1/scripts/SVM.py", line 153, in
model_fit = model.fit (X_train, y_train)
File "/usr/lib/python3/dist-packages/sklearn/svm/base.py", line 149, in fit
X, y = check_X_y(X, y, dtype=np.float64, order='C', accept_sparse='csr')
File "/usr/lib/python3/dist-packages/sklearn/utils/validation.py", line 573, in check_X_y
ensure_min_features, warn_on_dtype, estimator)
File "/usr/lib/python3/dist-packages/sklearn/utils/validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: '1.0:1.0:1.0:1.0:1.0'

In advance thank you.

Empty .csv files

When running the TSV_to_Variants_Freq.py3 with a 100GB .tsv file, the process kills after about 1 hour with over 200GB of available memory remaining. I am pretty sure that it is a technical issue (i.e. RAM or other hardware issues) as I am not running it on a server put on a workbench. The created .csv files are empty.

However, I wanted to open the discussion in case of future occurrences and to possibly solve other related issues.

SVM.py index out of bounds

Dear all,

when trying to use Epinano (all versions) with the built model on the test sample your code returns the error:

Commad:  SVM.py -a -M M6A.mis3.del3.q3.poly.dump -p sample1.csv -cl 7,12,22 -mc 28 -o pretrained.prediction
Traceback (most recent call last):
  File "SVM.py", line 95, in <module>
    names = list (predict_df.columns[cols])
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py", line 3940, in __getitem__
    result = getitem(key)
IndexError: index 11 is out of bounds for axis 0 with size 11

Kinds regards

Which reference sequences should be used to RNA mapping

Hi,
Which reference sequences should be used to RNA mapping, genome reference or transcriptome reference ? For species without a reference genome, can I mapping RNA to an assembled transcriptome?

Could you provide us with a train file you had trained

hi, could you provide me with a train file which was trained for normal ONT-Met dataset if I could not get my own train file？

per_site_var input

Hi @Huanle ,

As you said in the readme file to increase the speed we can split the sam2tsv file based on the reference as input for per_site_var.py. From my understanding, you meant we can split based on the chromosomes, for example for each chromosome we separate the sam2tsv result (chr1.tsv, chr2.tsv, etc)? If yes, Can we further split each chromosome into smaller files (chr1_1.tsv, chr1_2, etc)?

Thanks,
Vahid.

Does the method suitable for other species?

Hi huanle, I am very interested in the software developed by your group. I want to use the software to predict the methylation status for my own data, however, I do not have the training dataset. Could I use your model for my data produced by direct RNA sequencing without PCR and reverse transcription.

Nanopolish software which developed for predicting the DNA methylation status was suitable for all the other species. I have an expectation for this matter.

Looking foward for your reply.

checking docs

Hiya. I'm just going through the documentation, prior to installing and trying out, just to get straight what's needed, following through the pipeline.
Can I just confirm that step 7 in the docs, instead of reading...

python2 slide_per_site_var.py mod.ref.per_site.var.csv > mod.per_site.var.sliding.win.csv
python2 slide_per_site_var.py unm.ref.per_site.var.csv > unm.per_site.var.sliding.win.csv

..should actually read....

python2 slide_per_site_var.py mod.per_site.var.csv > mod.per_site.var.sliding.win.csv
python2 slide_per_site_var.py unm..per_site.var.csv > unm.per_site.var.sliding.win.csv

I reckon so, otherwise I've missed something. many thanks

adding a splitting step for TSV

Hi,
to speed up the pipeline we can think to split the TSV produced by sam2tsv in overlapping windows.

About the SK1 reference genome

Hi Huanle, Thank you very much for kindly providing the SK1 reference genome to me, though I only matched about 900 sites of the 1308 m6A sites to be A or T, at least it got work. Really thank you for your help!

Ying

It seems that there is a small flaw in "per_site_var.py"

Hi:
Thank you for your code.It is very useful.
But I found a bug in 41th lines of per_site_var.py:
if (ary[-2] != ary[5]):
instead of
if (ary[-2] != ary[4]):

Because the 5th column of the Sam2Tsv output file is read-base instead of the 4th column.

The sequence of your cloned fragments downstream of T7 promoter

Hi @Huanle,
In the manuscript, you have provided the link to the raw fast5 files for the cloned fragments you used to make reads for the development of your method. But, I could not find the reference (i.e. the sequence of fragments you cloned). I would appreciate it if you could kindly provide me the sequences of those fragments.

Many thanks,
Vahid.

Empty per-read and per-site file

Hi @Huanle

After running per_read_var.py and per_site_var.py scripts the results file are empty?
I exactly followed all the previous step and everything was fine but these two scripts did not give any results.
Thanks,
Vahid.

per_read_var.stats.py script

I cannot find the per_read_var.stats.py script that is used in step #5 of your workflow. Is it supposed to be in the /scripts subfolder of your repository? Or has it been renamed?
Thanks and best, Sophia

CSV files for predicting and training

Hello, I see that fast5 file is not used in the feature extraction process. Have you used the current information of nanopore in this way？

SVM.py gives error

Command: ~/EpiNano/scripts/SVM.py -a -t mod.tsv.per.site.var.csv -p unm.tsv.per.site.var.csv -cl 1-5 -mc 11 -o test

gives the error message "IndexError: single positional indexer is out-of-bounds" and I'm unclear on how to fix this issue

Question about the type of input file and reference sequences

Hello，
I am confused that must EpiNano software have both 'unmodified' and 'modified' sequences as input files to detect m6A RNA modifications ? Does this mean that I can't detect native RNA m6A modifications in the case that the 'unmodified' sequences were not available？
Which reference fasta sequences should be used to RNA mapping, genome reference or transcriptome reference ?
Are the modified RNA bases results transcript isoform-specfic modification information or genome site-specfic modification information ?

missing shabang in different scripts

Hi,
I see that some scripts are missing of the shabang.

Best,
Luca

Memory requirements of TSV_to_Variants_Freq.py3

Hi,
I've been trying to run Epinano (v1.1) following the procedure outlined in the wiki.
I'm running TSV_to_Variants_Freq.py3 as a batch job on LSF, but I've noticed that it uses a lot of memory. In my last try it used over 80GB before getting killed by the scheduler.
The input file is 48GB and I'm running the script with -t 2.
Is the use of so much RAM expected? Do you have guidelines on the amount needed so that I can reserve it with the scheduler?
Thanks!

Examples are not working

Hi,
I just wanted to try the examples for running the SVM.py stated in the Readme but I encounter an error. According to the shebang python3 should be used. I tried running it with the python3.6
python3.6 EpiNano/scripts/main/SVM.py -a -t EpiNano/examples/svm_input/sample1.csv -p EpiNano/examples/svm_input/sample2.csv -cl 1-5 -mc 11 -o test

The output is:
Commad: EpiNano/scripts/main/SVM.py -a -t EpiNano/examples/svm_input/sample1.csv -p EpiNano/examples/svm_input/sample2.csv -cl 1-5 -mc 11 -o test Colunms-used: 1-5 output: test.q1.q2.q3.q4.q5.SVM Traceback (most recent call last): File "EpiNano/scripts/main/SVM.py", line 118, in <module> X_train, _, y_train, _, indices_train, _ = train_test_split(X,Y.values.ravel(), indices, test_size=0, random_state= 100) File "/usr/local/lib64/python3.6/site-packages/sklearn/model_selection/_split.py", line 2100, in train_test_split default_test_size=0.25) File "/usr/local/lib64/python3.6/site-packages/sklearn/model_selection/_split.py", line 1734, in _validate_shuffle_split '(0, 1) range'.format(test_size, n_samples)) ValueError: test_size=0 should be either positive and smaller than the number of samples 19953 or a float in the (0, 1) range

Is there anything I can do to solve the issue?
Thank you
Best Alex

novoalab / epinano Goto Github PK

epinano's Introduction

Table of Contents

Upgrades

About EpiNano

Modes of Running EpiNano

Considerations when using EpiNano

Pre-requisites

Getting the code

Running EpiNano

a) Running EpiNano 1.2

STEP 1. Extract base-calling error features

STEP 2. Extract current intensity values

STEP 3. Predict RNA modifications

a) Predicting RNA modifications using EpiNano-Error:

b) Predicting RNA modifications using EpiNano-SVM:

b) Running EpiNano 1.1

Examples

Further Documentation

Citing this work

License

Contact

epinano's People

Contributors

Stargazers

Watchers

Forkers

epinano's Issues

Recommend Projects

Recommend Topics

Recommend Org