nanoporetech / remora Goto Github PK

View Code? Open in Web Editor NEW

150.0 20.0 18.0 45.7 MB

Methylation/modified base calling separated from basecalling.

Home Page: https://nanoporetech.com

License: Other

Python 95.86% Cython 4.14%

nanopore methylation basecalling

remora's Introduction

Remora

Remora models predict methylation/modified base status separated from basecalling. The Remora repository is focused on the preparation of modified base training data and training modified base models. Some functionality for running Remora models and investigation of raw signal is also provided. For production modified base calling use Dorado. For recommended modified base downstream processing use modkit. For more advanced modified base data preparation from "randomers" see the Betta release community note and reach out to customer support to inquire about access ([email protected]).

Installation

Install from pypi:

pip install ont-remora

Install from github source for development:

git clone [email protected]:nanoporetech/remora.git
pip install -e remora/[tests]

It is recommended that Remora be installed in a virtual environment. For example python3 -m venv venv; source venv/bin/activate.

See help for any Remora sub-command with the -h flag.

Getting Started

Remora models predict modified bases anchored to canonical basecalls or reference sequence from a nanopore read.

The Remora training/prediction input unit (referred to as a chunk) consists of:

Section of normalized signal
Canonical bases attributed to the section of signal
Mapping between these two

Chunks have a fixed signal length defined at data preparation/model training time. These values are saved with the Remora model to extract chunks in the same manner at inference. A fixed position within the chunk is defined as the "focus position" around which the fixed signal chunk is extracted. By default, this position is the center of the "focus base" being interrogated by the model.

The canonical bases and mapping to signal (a.k.a. "move table") are combined for input into the neural network in several steps. First each base is expanded to the k-mer surrounding that base (as defined by the --kmer-context-bases hyper-parameter). Then each k-mer is expanded according to the move table. Finally each k-mer is one-hot encoded for input into the neural network. This procedure is depicted in the figure below.

Data Preparation

Remora data preparation begins from a POD5 file containing signal data and a BAM file containing basecalls from the POD5 file. Note that the BAM file must contain the move table (--emit-moves in Dorado) and the MD tag (default in Dorado with mapping and --MD argument for minimap2). If using minimap2 for alignment use samtools fastq -T "*" [in.bam] | minimap2 -y -ax lr:hq [ref.fa] - | samtools view -b -o [out.bam] in order to transfer the move table tags through the alignment step since minimap2 does not support SAM/BAM input.

The following example generates training data from canonical (PCR) and modified (M.SssI treatment) samples in the same fashion as the released 5mC CG-context models. Example reads can be found in the Remora repository (see test/data/ directory).

K-mer tables for applicable conditions can be found in the kmer_models repository.

remora \
  dataset prepare \
  can_reads.pod5 \
  can_mappings.bam \
  --output-path can_chunks \
  --refine-kmer-level-table levels.txt \
  --refine-rough-rescale \
  --motif CG 0 \
  --mod-base-control
remora \
  dataset prepare \
  mod_reads.pod5 \
  mod_mappings.bam \
  --output-path mod_chunks \
  --refine-kmer-level-table levels.txt \
  --refine-rough-rescale \
  --motif CG 0 \
  --mod-base m 5mC

The above commands each produce a core Remora dataset stored in the directory defined by --output-path. Core datasets contain memory mapped numpy files for each core array (chunk data) and a JSON format metadata config file. These memory mapped files allow efficient access to very large datasets.

Before Remora, 3.0 datasets were stored as numpy array dictionaries. Updating datasets can be accomplished with the scripts/update_dataset.py script included in the repository.

Composing Datasets

Core datasets (or other composed datasets) can be composed to produce a new dataset. The remora dataset make_config command creates these config files specifying the composition of the new dataset. When reading batches from these combined datasets, the default behavior will be to draw chunks randomly from the entire set of chunks. This setting is useful for multiple flowcells of the same condition.

The --dataset-weights argument produces a config which generates batches with a fixed proportion of chunks from each input dataset. This setting is useful when combining different data types, for example control and modified datasets.

The remora dataset merge command is supplied to merge datasets, copying the data into a new core Remora dataset. This may increase efficiency of data access for datasets composed of many core datasets, but only supports the default behavior from the make_config command (sampling over all chunks).

The remora dataset copy command is provided in order to move datasets to a new location. This can be useful when handling config datasets composed of many core datasets. Copying a dataset is especially useful to achieve higher training speeds when core datasets are stored on a network file system (NFS).

Composed dataset config files can also be specified manually. Config files are JSON format files containing a single list, where each element is a list of two items. The first is the path to the dataset and the second is the weight (must be a positive value). The make_config output config file will also contain the dataset hash to ensure the contents of a dataset are unchanged, but this is an optional third field in the config.

Metadata attributes from each core dataset are checked for compatibility and merged where applicable. Chunk raw data are loaded from each core dataset at specified proportions to construct batches at loading time. In a break from Remora <3.0, datasets allow "infinite iteration", where each core dataset is drawn from indefinitely and independently to supply training chunks. For validation from a fixed set of chunks, finite iteration is also supported.

To generate a dataset config from the datasets created above one can use the following command.

remora \
  dataset make_config \
  train_dataset.jsn \
  can_chunks \
  mod_chunks \
  --dataset-weights 1 1 \
  --log-filename train_dataset.log

Model Training

Models are trained with the remora model train command. For example a model can be trained with the following command.

remora \
  model train \
  train_dataset.jsn \
  --model remora/models/ConvLSTM_w_ref.py \
  --device 0 \
  --chunk-context 50 50 \
  --output-path train_results

This command will produce a "best" model in torchscript format for use in Bonito, remora infer, or remora validate commands. Models can be exported for use in Dorado with the remora model export train_results/model_best.pt train_results_dorado_model command.

Model Inference

For testing purposes, inference within Remora is provided. For standard model architectures and inference methods, using the exported Dorado model during basecalling is recommended.

remora \
  infer from_pod5_and_bam \
  can_signal.pod5 \
  can_mappings.bam \
  --model train_results/model_best.pt \
  --out-file can_infer.bam \
  --log-filename can_infer.log \
  --device 0
remora \
  infer from_pod5_and_bam \
  mod_signal.pod5 \
  mod_mappings.bam \
  --model train_results/model_best.pt \
  --out-file mod_infer.bam \
  --log-filename mod_infer.log \
  --device 0

The remora validate from_modbams command is deprecated and will be removed in a future version of Remora. The modkit validate command is now recommended for this purpose.

Reference-anchored Inference

Reference-anchored inference allows users to make per-read per-site modified base calls against the reference sequence to which a read is mapped. This is in contrast to standard Remora model inference where calls are made against the basecalls. This mode can be useful to explore modified bases around which the canonical basecaller does not perform well. This inference mode is toggled by the --reference-anchored argument to the remora infer from_pod5_and_bam command.

The output BAM file from this command will take each mapped read and replace the basecalls with the mapped reference bases. The move table will be transferred to the mapped reference bases and interpolated over mapping reference deletions in order to make enable extraction of Remora chunks for inference.

Note that this means that the canonical basecalls will show 0 errors over the entire output BAM file. The intended purpose of this output is only to store the modified base status for each read at each applicable base. Any analysis of basecall metrics should not use the output of this command.

Pre-trained Models

See the selection of current released models with remora model list_pretrained. Pre-trained models are stored remotely and can be downloaded using the remora model download command or will be downloaded on demand when needed.

Models may be run from Bonito. See Bonito documentation to apply Remora models.

More advanced research models may be supplied via Rerio. These files require download from Rerio and then the path to this download must be provided to Remora. Note that older ONNX format models require Remora version < 2.0.

Downloaded or trained models can be inspected with the remora model inspect command to view the metadata attributes of the model.

Python API and Raw Signal Analysis

Raw signal plotting is available via the remora analyze plot ref_region command.

The plot ref_region command is useful for gaining intuition into signal attributes and visualize signal shifts around modified bases. As an example using the test data, the following command produces the plots below. Note that only a single POD5 file per sample is allowed as input and that the BAM records must contain the mv and MD tags (see the see "Data Preparation" section above for details).

remora \
  analyze plot ref_region \
  --pod5-and-bam can_reads.pod5 can_mappings.bam \
  --pod5-and-bam mod_reads.pod5 mod_mappings.bam \
  --ref-regions ref_regions.bed \
  --highlight-ranges mod_gt.bed \
  --refine-kmer-level-table levels.txt \
  --refine-rough-rescale \
  --log-filename log.txt

The Remora API to access, manipulate and visualize nanopore reads including signal, basecalls, and reference mapping is described in more detail in the notebooks section of this repository.

Terms and Licence

This is a research release provided under the terms of the Oxford Nanopore Technologies' Public Licence. Research releases are provided as technology demonstrators to provide early access to features or stimulate Community development of tools. Support for this software will be minimal and is only provided directly by the developers. Feature requests, improvements, and discussions are welcome and can be implemented by forking and pull requests. Much as we would like to rectify every issue, the developers may have limited resource for support of this software. Research releases may be unstable and subject to rapid change by Oxford Nanopore Technologies.

Research Release

Research releases are provided as technology demonstrators to provide early access to features or stimulate Community development of tools. Support for this software will be minimal and is only provided directly by the developers. Feature requests, improvements, and discussions are welcome and can be implemented by forking and pull requests. However much as we would like to rectify every issue and piece of feedback users may have, the developers may have limited resource for support of this software. Research releases may be unstable and subject to rapid iteration by Oxford Nanopore Technologies.

remora's People

Contributors

Stargazers

Watchers

Forkers

rasheedelbouri hyejun18 pmbio thachnguyen odetteg soundag tcb72 ainews1 chilampoon saroudant lkwhite ssghost hilam8899 rhakodesh miuuu13 rnabioco

remora's Issues

training a sequence-specific model

Hi Remora Team,

I’m interested in 5-methyl CpG detection on a ~3 kb human sequence that has a very high (~10%) CpG density. Would you recommend training a new model for this specific sequence using a PCR template with and without in vitro M.Sssi methylation? Would a sequence-specific model have advantages over the currently available pre-trained dna_r9.4.1_e8 model, and what would you recommend for the number of reads to use for training?

Thanks in advance for your advice!

R10.4 settings recommendations

Hello!

I am interested in training a model on R10.4 modified data.

In the README, which uses a R9.4 model as an example, the following suggestions are made for running remora dataset prepare:

--chunk-context 50 50 
--kmer-context-bases 6 6

Would this be different for R10.4?

Thanks!
Paul

Model improvement questions

Greetings,

This is mostly about how to improve the quality of remora models and a few other questions will be asked below. I have trained a custom modification remora model and used it to basecall a modified strand of dna. The created model is pretty poor in regards of it mistaking natural CG sites with modified ones. I presumed this was due to poor modification efficiency on my end. I used the default 0.0.0 remora mC model to basecall a methylated control strand and created a model of it aswell. I was surprised to see that my mC model was poor quality aswell. I was wondering if you have any suggestions on how to improve the model training itself as I am unable to train a basic mC model using nearly 100% methylated dna strands. I'm attaching some IGV pictures visualizing the remora's pre trained mC model, a trained mC model, and a custom model for our modification in that order:

Megalodon mod_mappings using a pre-trained remora 5mC model

Megalodon mod_mappings using a 5mC model trained by me model

Megalodon mod_mappings using a 5ahyC model trained by me model

Note that IGV shows 5ahyC in blue just like 5hmC

The pre-trained model makes me believe that the methylation efficiency is sufficient. It is rather the model training where I could do some improvement. I used the workflow written in depo's readme. The models were prepared on a different ~1kb substrate of relatively spaced CG motifs (similar amount and spacing as in the pictures). At this moment I have a few questions regarding this type of model training and some unrelated:

Do you perhaps have any suggestions how I could improve the model training process? Some settings to fiddle with? Or is it the substrate that is lacking?
How exactly is the hmC-mC model trained? Is it possible to train a model which could seperate hmC and my custom modification as the hmC-mC model deos?
Is it possible to train a model using only + strands? I.e. mapping the signals of + strands only or seperating them afterwards? Or rather is there a way to process a fast5 file to seperate the strands assuming the sequence is not palindromic and is barcoded. This is important since we have difficulty modifying both strands.
What exactly is described by accuracy when a remora model is in training? Note that my trained models had >0.99 accuracy
More importantly what kind of substrate do you recommend for model creation? CG content/length etc.?

remora model with multiple modifications

Greetings!

I'm wondering whether remora models can handle multiple modifications at the same time. For instance a model that can predict 5mC and 6mA simultaneously. It seems not since the {--motif, --mod-bases} pair can only represent one modification per model. Any plans for upgrading remora for handling multiple modifications?

Thank you very much!

R10.4 250bps signal to base mapping

I am trying to prepare a dataset for base modification detection that has been sequenced on a MinION device using an R10.4 flow cell at high accuracy mode 250bps.

Checking at the https://github.com/nanoporetech/megalodon/tree/master/megalodon/model_data available models to prepare the data I don't think this model is available yet on megalodon. Should I just wait for the model to be released or is one of these models in megalodon compatible?

Alternatively, I am used to the old fashion tombo resquiggle for R9.4 for base to signal mapping, and I found that remora also does something similar. It refers to a kmer_model_filename, but my guess is that this is not open to the public right?

Best,
Marc

Several issues on remora usage

Hi,
Thanks for this amazing tool. I have several questions and would really appreciate it if you can help.

What's the difference between the pre-trained models dna_r9.4.1_e8 and dna_r9.4.1_e8.1?
What's the relationship of the remora pre-trained models and models in rerio repo? In the latest Megalodon it seems that Megalodon will call the remora model, but in the previous one Megalodon is using rerio model. A little confused here.
Is remora independent from Megalodon and Taiyaki? Will remora replace Megalodon somehow in the future? Can you provide more information on its usages?
Is remora a new methylation calling tool or not? If so, how can I use the remora to call methylations? Any plan on a detailed tutorial like Megalodon?
Do you have any plan to release the training datasets for remora shown on NCM2021?
It seems that in default remora ont-pyguppy-client-lib==5.1.9, however, the latest version of Guppy is only 5.0.16 in the community. Is there a delay for the Guppy release? Or is it possible to use remora on the older version of Guppy?

Thank you so much for your help!

Best,
Ziwei

Installation of taiyaki

Hello!

I am attempting to get everything prepared for training remora models.

I have megalodon installed and I am attempting to install taiyaki. I followed what was suggested in the README, namely:
Remora data preparation begins from Taiyaki mapped signal files generally produced from Megalodon containing modified base annotations. This requires installation of Taiyaki via pip install git+https://github.com/nanoporetech/taiyaki.

However, when I run this command in a clean python virtual environment, I encounter the following error:

Any thoughts on why this is happening? Are there any specific requirements needed to install taiyaki in this way?

Thanks,
Paul

remora model calls different from cell samples and blood samples

Hello
I have sequenced some blood samples and some cell samples recieved from Coriell using the remora modified basecalling models. I Seems to me like more unmodified bases are called from the cell samples. In the picture from IGV the two tracks at the bottom (cells) are far more blue than the three tracks on the top (blood). The same is seen at many other positions.

Does anyone have experience with this or seen anything similar?

Remora model selection with different accuracy

Hi developers,

Thanks for this nice tool. We are using Remora model with guppy config file (dna_r9.4.1_450bps_fast/hac/sup.cfg) accordingly for 5hmC calling for the same sample, but the proportion results show some difference, especially for super accuracy one as below,

dna_r9.4.1_e8 fast 0.0.0 5hmc_5mc CG 0 -- 0.07761384709206084
dna_r9.4.1_e8 hac 0.0.0 5hmc_5mc CG 0 -- 0.07974461372838691
dna_r9.4.1_e8 sup 0.0.0 5hmc_5mc CG 0 -- 0.19680854712480966

Would you please provide some suggestions on Remora model selection for 5hmC and 5mC calling of human DNA samples?

Thank you very much.

Best regards,
Ying

Install ont-remora==2.0.0 failed, due to pod5 install failed

When I install 2.0.0, it failed:

pip install ont-remora==2.0.0
Collecting ont-remora==2.0.0
  Using cached ont-remora-2.0.0.tar.gz (76 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
Collecting requests
  Downloading requests-2.27.1-py2.py3-none-any.whl (63 kB)
     |████████████████████████████████| 63 kB 786 kB/s
ERROR: Could not find a version that satisfies the requirement pod5>=0.0.43 (from ont-remora) (from versions: none)
ERROR: No matching distribution found for pod5>=0.0.43

However, when I install pod5, still failed:

model for SQK-LSK109

Dear developers,

Thanks for the nice tool! May I ask you whether I can use the pre-trained model dna_r9.4.1_e8 if my ligation kit is SQK-LSK109 (kit 9?)? If not, do you think it's ok for me to use remora to train my own model? What model or even software for methylation calling do you recommend if my sequencing chemistry is before SQK-LSK110?

Also, I see that remora only provide models for 5mC. I'm working on plants, and would like to get the information for other types of methylation as well. Could we also use remora to train for other type of modification?

Many thanks!

Accuracy about Remora

Hi, thank you very much for sharing this useful tool.

I try to use Megalodon to call base modification, but I find it is very slow. Because when running 'call base modification' step within Megalodon, it will load the specific remora model once I set this, I have an idea that I can run Remora directly if I just want to call base modification for my sample, and the accuracy will be similar to Megalodon, is it right?

Remora default model warning [Bonito]

Hi Marcus and team, thanks for working on the methylation calling and integration with Bonito.

I thought to open an issue here because we consistently get a Remora warning (regardless of Bonito installation on multiple systems / containers) that the version of the model is not available and the default model is used instead, e.g.

bonito basecaller [email protected] --modified-bases 5mC  --reference chm13.mmi fast5/ > basecalls.bam
> loading model [email protected]
> loading modified base model
> warning (remora): Remora model for basecall model version (v3.4) not found. Using default Remora model for dna_r9.4.1_e8.1_fast.
> loaded modified base model to call (alt to C): m=5mC
> loading reference
> outputting aligned bam
...

My question is whether this is intended behaviour? It seems like that no other models but the default models are available when checking with Remora:

remora model list_pretrained

[07:33:21] Remora pretrained modified base models:
Pore             Basecall_Model_Type    Basecall_Model_Version    Modified_Bases    Remora_Model_Type      Remora_Model_Version
---------------  ---------------------  ------------------------  ----------------  -------------------  ----------------------
dna_r9.4.1_e8    fast                   0.0.0                     5mc               CG                                        0
dna_r9.4.1_e8    fast                   0.0.0                     5hmc_5mc          CG                                        0
dna_r9.4.1_e8    hac                    0.0.0                     5mc               CG                                        0
dna_r9.4.1_e8    hac                    0.0.0                     5hmc_5mc          CG                                        0
dna_r9.4.1_e8    sup                    0.0.0                     5mc               CG                                        0
dna_r9.4.1_e8    sup                    0.0.0                     5hmc_5mc          CG                                        0
dna_r9.4.1_e8.1  fast                   0.0.0                     5mc               CG                                        0
dna_r9.4.1_e8.1  hac                    0.0.0                     5mc               CG                                        0
dna_r9.4.1_e8.1  sup                    0.0.0                     5mc               CG                                        0
dna_r10.4_e8.1   fast                   0.0.0                     5mc               CG                                        0
dna_r10.4_e8.1   fast                   0.0.0                     5hmc_5mc          CG                                        0
dna_r10.4_e8.1   hac                    0.0.0                     5mc               CG                                        0
dna_r10.4_e8.1   hac                    0.0.0                     5hmc_5mc          CG                                        0
dna_r10.4_e8.1   sup                    0.0.0                     5mc               CG                                        0
dna_r10.4_e8.1   sup                    0.0.0                     5hmc_5mc          CG                                        0

Multiple GPU device

Hi @marcus1487 ,

I was wondering is it possible to specify multiple GPU devices for Remora during training? like --device 0 1 2

Thanks,
Vahid

Regarding "--motif N 0"

Hi developers,

I am trying out Remora for our dataset, where the modified bases don't appear in specific context. In this case, can I set --motif N 0 during the data preparation step and how will it affect the performance?

Thanks,

--Kai

'Remora model list_pretrained'

Hello. Thank you very much for sharing this useful tool.
When I install 'remora' from github source for development, it success. However, when I run 'remora model list_pretrtained', there is an error:

'''
Traceback (most recent call last):
File "/lustre/home/rongqiao/anaconda3/envs/remora1.1.0-env/bin/remora", line 33, in
sys.exit(load_entry_point('ont-remora', 'console_scripts', 'remora')())
File "/lustre/home/rongqiao/remora1.1.0/remora/src/remora/main.py", line 69, in run
cmd_func(args)
File "/lustre/home/rongqiao/remora1.1.0/remora/src/remora/parsers.py", line 674, in run_list_pretrained
from remora.model_util import get_pretrained_models
File "/lustre/home/rongqiao/remora1.1.0/remora/src/remora/model_util.py", line 11, in
import onnx
File "/lustre/home/rongqiao/anaconda3/envs/remora1.1.0-env/lib/python3.8/site-packages/onnx/init.py", line 11, in
from onnx.external_data_helper import load_external_data_for_model, write_external_data_tensors, convert_model_to_external_data
File "/lustre/home/rongqiao/anaconda3/envs/remora1.1.0-env/lib/python3.8/site-packages/onnx/external_data_helper.py", line 14, in
from .onnx_pb import TensorProto, ModelProto
File "/lustre/home/rongqiao/anaconda3/envs/remora1.1.0-env/lib/python3.8/site-packages/onnx/onnx_pb.py", line 8, in
from .onnx_ml_pb2 import * # noqa
File "/lustre/home/rongqiao/anaconda3/envs/remora1.1.0-env/lib/python3.8/site-packages/onnx/onnx_ml_pb2.py", line 33, in
_descriptor.EnumValueDescriptor(
File "/lustre/home/rongqiao/anaconda3/envs/remora1.1.0-env/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 755, in new
_message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:

Downgrade the protobuf package to 3.20.x or lower.
Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates
'''

Is this because I install the package from github source?

how to interpret the results from "remora infer from_taiyaki_mapped_signal"

Thanks for the great tool!

Just wondering how the "read_pos" in the results from "remora infer from_taiyaki_mapped_signal" are chosen - so which positions of a read are shown in the result, please? If only modified positions are shown, why will we have different class_pred values? Also the read_pos is the relative position within the read but not the position

And for "class_pred", 1 means modified and 0 means unmodified, don't they?

What's the meaning of "label", please?

Thanks!
Jon

About the pre-trained models

I am wondering how the pre-trained models were made, specifically what organism did the training data come from?

As I understand, basecalling accuracy for a sample can be improved by using a model trained on data coming from the same taxonomy as the sample. Does the accuracy of modification-calling also benefit from taxon-specific training data?

Is remora 5hmC/5mC ready for "prime time"?

Is the 5hmC/5mC remora mode quantitative enough for biological inference now? Unfortunately I haven't seen any benchmarking papers/preprints out there, and I haven't seen any data on 5hmC performance aside from its introduction in some of the nanopore conferences.

We know that the regular 5mC model is essentially as good/better than bisulfite 5mC calling. Do you have that information for 5hmC/5mC?

Error while installing remora

Hello Everyone,

I am currently trying to get remora and the Basecaller Bonito on our HPC. I am using the pip install command but i always get the Error :

      ############################
      # Package would be ignored #
      ############################
      Python recognizes 'remora.trained_models' as an importable package, however it is
      included in the distribution as "data".
      This behavior is likely to change in future versions of setuptools (and
      therefore is considered deprecated).
  
      Please make sure that 'remora.trained_models' is included as a package by using
      setuptools' `packages` configuration field or the proper discovery methods
      (for example by using `find_namespace_packages(...)`/`find_namespace:`
      instead of `find_packages(...)`/`find:`).
  
      You can read more about "package discovery" and "data files" on setuptools
      documentation page.
  
  
  !!
  
    check.warn(importable)
  error: command 'icc' failed: No such file or directory
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for ont-remora
Failed to build ont-remora
ERROR: Could not build wheels for ont-remora, which is required to install pyproject.toml-based projects

Maybe this is a known issue or someone can help me out. I am using a PyPi mirror currently since the HPC has no net connection.

I would appreciate any help!

kind regards,

Azlan

Training remora for use with bonito

Hi!

I'm excited to try to train remora for additional base modifications.

The Remora Readme describes how to create training data using megalodon and taiyaki and then states

"This command will produce a final model in ONNX format for use in Bonito, Megalodon or remora infer commands."

I just wanted to be extra sure and ask: If I use megalodon and taiyaki as described above, will a Remora model trained that way be usable with Bonito? Or would I need to then also perform inference using megalodon+remora? Are there separate instructions how to generate a training dataset for remora using bonito?

Data preparation scripts for Remora models with random bases

Hello Remora Team,

In this year's ONT update, Clive mentioned that the newer models that perform better than BS-seq are trained with sequences that contain a modified position with +-30 random bases around that position, if I understand it correctly. Are the scripts to prepare the training data for this kind of input data publicly available? Right now only fully modified and unmodified reads are applicable with the data preparation scripts uploaded here, correct?

Thanks for your help!

Cheers,
Anna

direct download links ?

Hi,

I'd like to download the remora models directly via wget etc (likely having proxy problems).
Do you have direct links somewhere ? Thanks

remora model download --pore dna_r9.4.1_e8

Traceback (most recent call last):
  File "/home/hpc/davenpor/.local/bin/remora", line 8, in <module>
    sys.exit(run())
  File "/home/hpc/davenpor/.local/lib/python3.7/site-packages/remora/main.py", line 69, in run
    cmd_func(args)
  File "/home/hpc/davenpor/.local/lib/python3.7/site-packages/remora/parsers.py", line 809, in run_download
    model_dl.download(model_url)
  File "/home/hpc/davenpor/.local/lib/python3.7/site-packages/remora/download.py", line 29, in download
    'filename="([^"]+)', req.headers["content-disposition"]
  File "/home/hpc/davenpor/.local/lib/python3.7/site-packages/requests/structures.py", line 54, in __getitem__
    return self._store[key.lower()][1]
KeyError: 'content-disposition'

How to draw histogram plots of modified base probabilities using Ml probability tag in modified BAM file?

Hi,

First of all, I'd like to thank you for developing Remora.

I want to compare modified base probabilities between mutant and control groups based on Ml probability tag in modified BAM files that were produced by Remora.

I want to be sure modification difference between two groups is real or not by looking into base probabilities, that's reason why I want to compare probabilities two groups. How to draw a histogram plot based on 5mC probabilities for both groups?

Could you help me with that?

Best.

How do I use Remora v2 with Guppy?

How do I use the newest Remora release with Guppy? Or do I have to wait for an integrated release?

I noticed that I could possibly use a separate command that utilizes both the bam and pod5 file, but we are still using fast5s.

How to use megalodon with ONNX model?

Dear authors,

I have followed the guide in the readme to make an onnx model.
At the end of the guide it is mentioned that:
"This command will produce a final model in ONNX format for use in Bonito, Megalodon or remora infer commands."

However I cannot find any instructions on how to use megalodon with an onnx model.

The megalodon option --remora-modified-bases seems to be specific for pretrained models, but is there some other option for home-made models? Or is there some other way to convert an onnx model into a useable model for megalodon?

With kind regards,
Carlo

Insect 5hmC values anomalous

I am working with an insect genome and trying to call 5mC and optionally 5hmC. When using megalodon with --remora-modified-bases dna_r9.4.1_e8 model calling 5mC only, I get around 6% 5mC (too high). while calling 5hmC_5mC, I get 0.55% 5mC (about right), but I am getting near 60% 5hmC which seems FAR too high to be realistic. I've never heard of an insect with such incredibly high 5hmC.

Shouldn't the 5mC values match from both calls?
What's going on with the 5hmC? If I can't trust that one, why should I trust the 5hmC levels?

My calls are

megalodon /path/to/wasp-runs/ --sort-mappings --outputs mod_mappings mods per_read_mods --reference waspassembly.fasta --devices 0 --processes 23 --output-directory megalodon-out-5mc-sup --guppy-params " --use_tcp" --overwrite --guppy-server-path /opt/ont/guppy/bin/guppy_basecall_server --guppy-config dna_r9.4.1_450bps_sup.cfg --remora-modified-bases dna_r9.4.1_e8 sup 0.0.0 5mc CG 0
megalodon /path/to/wasp-runs/ --sort-mappings --outputs mod_mappings mods per_read_mods --reference waspassembly.fasta --devices 0 --processes 23 --output-directory megalodon-out-5hmC-5mc-sup --guppy-params " --use_tcp" --overwrite --guppy-server-path /opt/ont/guppy/bin/guppy_basecall_server --guppy-config dna_r9.4.1_450bps_sup.cfg --remora-modified-bases dna_r9.4.1_e8 sup 0.0.0 5hmc_5mc CG 0

Question about data preparation for training model

Hi,

I understand that to train a model using remora you first have to basecall fully unmethylated (pcr) or fully methylated reads (sssI) then merge both result to build a training dataset using taiyaki/misc/merge_mappedsignalfiles.py. However in my case I need to use only specific genomic positions I know to be always methylated/unmethylated from BS-seq reference. Is this something I can do with remora before the merging of basecalls ? Or using taiyaki ?

Thanks,

Paul

question about 'remora infer' command

Hello, thank you so much for sharing this very useful tool.
When I use 'remora infer' command to call modification, I just set the parameters '--onnx model', '--output-path', '--overwrite' in additional to the required parameters. This command runs successfully and it run on the CPU. However, it runs so slowly and it shows it will finish after 150 hours.
So, I try to run 'remora infer' command on GPU device. After I install 'onnxruntime-gpu' package and run 'remora infer' after including '--device 0' parameter, there is an error message:

Traceback (most recent call last): File "...", line 33, in <module> sys.exit(load_entry_point('ont-remora', 'console_scripts', 'remora')()) File "/.../remora1.1.1/remora/src/remora/main.py", line 69, in run cmd_func(args) File "/.../remora1.1.1/remora/src/remora/parsers.py", line 793, in run_infer_from_taiyaki_mapped_signal from remora.inference import infer File "/.../remora1.1.1/remora/src/remora/inference.py", line 17, in <module> from remora.model_util import load_model File "/.../remora1.1.1/remora/src/remora/model_util.py", line 7, in <module> import pkg_resources File "/.../python3.8/site-packages/pkg_resources/__init__.py", line 3260, in <module> def _initialize_master_working_set(): File "/.../python3.8/site-packages/pkg_resources/__init__.py", line 3234, in _call_aside f(*args, **kwargs) File "/.../python3.8/site-packages/pkg_resources/__init__.py", line 3272, in _initialize_master_working_set working_set = WorkingSet._build_master() File "/.../python3.8/site-packages/pkg_resources/__init__.py", line 581, in _build_master ws.require(__requires__) File "/.../python3.8/site-packages/pkg_resources/__init__.py", line 909, in require needed = self.resolve(parse_requirements(requirements)) File "/.../python3.8/site-packages/pkg_resources/__init__.py", line 795, in resolve raise DistributionNotFound(req, requirers) pkg_resources.DistributionNotFound: The 'onnxruntime>=1.7' distribution was not found and is required by ont-remora

I think it should not be the version error. Because when I run 'import onnxruntime' and 'onnxruntime.get_device()' in python, it will show 'GPU'. Like this:
>>> import onnxruntime
>>> onnxruntime.get_device()
'GPU'

I takes a lot of time to try to solve this problem, but still have no idea. I wonder if you have any advice to help me. Thank you very much.

Any support for 5hmC on the newest Remora models for r9.4.1?

We have a ton of data using R9.4.1 (Kit 10) chemistry (eg. in the order of hundreds+ human samples) and would like to explore 5hmC calling. Is that possible with the new remora v2 models or do we have to use the older initial Remora ones? As far as I know those ones have substantially less accuracy?

Various questions about Remora

Hello,

I have a few questions that are not really related to each other.

First: I always assumed when training and running the model on new data, you had to know the reference in each case in order to generate ground-truth sequences. However, I just noticed this paragraph:

"The Remora API can be applied to make modified base calls given a basecalled read via a RemoraRead object. sig should be a float32 numpy array. seq is a string derived from sig (can be either basecalls or other downstream derived sequence; e.g. mapped reference positions). seq_to_sig_map should be an int32 numpy array of length len(seq) + 1 and elements should be indices within sig array assigned to each base in seq."

Lets say I know the reference sequence for the training data but may not know the reference for some new unseen data. Would it be advisable to do the following process?

Training:

Basecall using Guppy
Generate ground-truth sequences by mapping basecalls to a reference
Use Taiyaki prepare_mapped_reads.py to map signals to ground-truth sequences
Convert Taiyaki .hdf5 into Remora .npz using remora dataset prepare
Remora model train on resulting .npz
Generate .onnx model file

Testing on new data:

Basecall using Guppy
Using Taiyaki prepare_mapped_reads.py (?) to map signals to basecalls (NOT a reference)
Convert Taiyaki .hdf5 into Remora .npz using remora dataset prepare
Run remora infer from_remora_dataset on resulting .npz file with the .onnx model file generated during training.

If the answer to the above question is yes, then what is the best way to map signals to the basecalls? Would I just use the same process (prepare_mapped_reads.py with basecalls.fastq as reference?)

My second question has to do with remora dataset prepare. The default for the --motif parameter is N 0. However, from what I understand, a canonical base (ACTG or any combination) motif/position has to be declared when running Taiyaki's prepare_mapped_reads.py. Generating predictions in any context would be ideal for my situation, but I'm not sure how to get the default here to work. If I try to run prepare_mapped_reads.py with --alphabet ACTG --mod Y N mod_long_name_here, it throws an assertion error saying "Canonical coding for modified base must be a canonical base, got N.) If I try running remora dataset prepare with default parameters after successfully running prepare_mapped_reads.py with something that works like --alphabet ACTG --mod Y A mod_long_name_here, then remora throws a RemoraError saying "Canonical base within motif does not match canonical equivalent for modified base (A)."

What I'm getting at here is it doesn't seem to be possible to run remora with the default "any context" --motif parameter N 0 because of limitations of the tools used further upstream, such as Taiyaki's prepare_mapped_reads.py. If there is a way to generate a dataset in which the default Remora --motif parameter works, it would be of great help to know how to do that.

Thanks!

Running Megalodon for Remora 5mC_all_context_sup_r1041_e82 model

We are interested in trying out doing methylation calling on data generated from an R10.4.1 flow cell that has already been basecalled using the SUP basecalling model.

From everything we've read, it seems like this is the exact use for the Rerio model: 5mC_all_context_sup_r1041_e82

I had a few questions about the logistics of actually running this model though. We have successfully downloaded the file and have the .onnx, but im not sure of what we should be using for the following parameters:

Should we use the --do-not-use-guppy-server command if we want to use the basecalling that has already been done and is in our fast5 files?

If not, what should we specify for our --guppy-config file? My intuition is: dna_r10.4.1_e8.2_260bps_sup.cfg, but when we try this it times out without ever starting. When looking at the logs: "Could not load guppy server configuration state: 'Configurations'"

For this rerio model, should we specify --remora-modified-bases?

If this is of any help, here is basically what we are trying, that is working is:

megalodon /SSD/TestData/fast5_pass/ --reference testref.mmi --devices 0 --guppy-server-path /opt/ont/guppy/bin/basecall_server --outputs mod_mappings mods mappings --output-directory /SSD/test_directory/ --processes 30 --remora-model /opt/ont/guppy/data/remora_models_5mc_all_context_sup_r1041_e82.onnx --guppy-config dna_r10.3_450bps_hac.cfg

It seems to be running, but i am not sure if it appropriate, in particular the guppy-config file.

Thanks in advance

Correct model to use for 5hmC calling

Not sure if this should be a bonito issue or a remora issue.

I run the following:

bonito basecaller [email protected] $input_path --modified-bases 5mC 5hmC --reference $reference > basecalls_with_mods.sam

I then I get the following error:

--- Logging error ---
  Traceback (most recent call last):
    File "/usr/local/lib/python3.8/dist-packages/remora/model_util.py", line 549, in load_model
      submodels = submodels[modified_bases]
  KeyError: '5hmc_5mc'
  
  During handling of the above exception, another exception occurred:
  
  Traceback (most recent call last):
    File "/usr/lib/python3.8/logging/__init__.py", line 1085, in emit
      msg = self.format(record)
    File "/usr/lib/python3.8/logging/__init__.py", line 929, in format
      return fmt.format(record)
    File "/usr/local/lib/python3.8/dist-packages/bonito/mod_util.py", line 25, in format
      self._style._fmt = self.fmt
  AttributeError: 'CustomFormatter' object has no attribute 'fmt'
  Call stack:
    File "/usr/local/bin/bonito", line 8, in <module>
      sys.exit(main())
    File "/usr/local/lib/python3.8/dist-packages/bonito/__init__.py", line 34, in main
      args.func(args)
    File "/usr/local/lib/python3.8/dist-packages/bonito/cli/basecaller.py", line 75, in main
      mods_model = load_mods_model(
    File "/usr/local/lib/python3.8/dist-packages/bonito/mod_util.py", line 47, in load_mods_model
      return load_model(
    File "/usr/local/lib/python3.8/dist-packages/remora/model_util.py", line 551, in load_model
      LOGGER.error(
  Message: 'Remora model for modified bases 5hmc_5mc not found for [email protected].'
  Arguments: ()

I checked the pre-trained models which suggest that at least for "v0.0.0" there should be a 5hmC model:

[15:29:15] Remora pretrained modified base models:
Pore              Basecall_Model_Type    Basecall_Model_Version    Modified_Bases    Remora_Model_Type      Remora_Model_Version
----------------  ---------------------  ------------------------  ----------------  -------------------  ----------------------
dna_r10.4_e8.1    fast                   0.0.0                     5mc               CG                                        0
dna_r10.4_e8.1    fast                   0.0.0                     5mc               CG                                        1
dna_r10.4_e8.1    fast                   0.0.0                     5hmc_5mc          CG                                        0
dna_r10.4_e8.1    fast                   v3.3                      5mc               CG                                        0
dna_r10.4_e8.1    fast                   v3.3                      5mc               CG                                        1
dna_r10.4_e8.1    hac                    0.0.0                     5mc               CG                                        0
dna_r10.4_e8.1    hac                    0.0.0                     5mc               CG                                        1
dna_r10.4_e8.1    hac                    0.0.0                     5hmc_5mc          CG                                        0
dna_r10.4_e8.1    hac                    v3.3                      5mc               CG                                        0
dna_r10.4_e8.1    hac                    v3.3                      5mc               CG                                        1
dna_r10.4_e8.1    sup                    0.0.0                     5mc               CG                                        0
dna_r10.4_e8.1    sup                    0.0.0                     5mc               CG                                        1
dna_r10.4_e8.1    sup                    0.0.0                     5hmc_5mc          CG                                        0
dna_r10.4_e8.1    sup                    v3.4                      5mc               CG                                        0
dna_r10.4_e8.1    sup                    v3.4                      5mc               CG                                        1
dna_r10.4.1_e8.2  fast                   0.0.0                     5mc               CG                                        0
dna_r10.4.1_e8.2  fast                   0.0.0                     5mc               CG                                        1
dna_r10.4.1_e8.2  fast                   v3.5.1                    5mc               CG                                        0
dna_r10.4.1_e8.2  fast                   v3.5.1                    5mc               CG                                        1
dna_r10.4.1_e8.2  hac                    0.0.0                     5mc               CG                                        0
dna_r10.4.1_e8.2  hac                    0.0.0                     5mc               CG                                        1
dna_r10.4.1_e8.2  hac                    v3.5.1                    5mc               CG                                        0
dna_r10.4.1_e8.2  hac                    v3.5.1                    5mc               CG                                        1
dna_r10.4.1_e8.2  sup                    0.0.0                     5mc               CG                                        0
dna_r10.4.1_e8.2  sup                    0.0.0                     5mc               CG                                        1
dna_r10.4.1_e8.2  sup                    v3.5.1                    5mc               CG                                        0
dna_r10.4.1_e8.2  sup                    v3.5.1                    5mc               CG                                        1

I changed my call to:

bonito basecaller [email protected] $input_path --modified-bases 5mC 5hmC --reference $reference > basecalls_with_mods.sam

But this then complains about a non existent model. I am sure I am being stupid?

  > available models:
   - [email protected]
   - [email protected]
   - [email protected]
   - [email protected]
   - [email protected]
   - [email protected]
   - [email protected]
   - [email protected]
   - [email protected]
   - [email protected]
   - [email protected]
   - [email protected]

Installing megalodon on python3.6 breaks remora

I think there is some weird python incompatibility here. I installed megalodon on python3.6 because I needed to get the pyguppy api for guppy 5.0.16 (it doesn't exist for python 3.9). When typing in the following, I get an error:

[billylau@sh02-01n58 /scratch/groups/hanleeji/ONT_test/20211011_PRM_1096] (job 40324222) $ remora model list_pretrained
Traceback (most recent call last):
  File "/home/users/billylau/.local/bin/remora", line 8, in <module>
    sys.exit(run())
  File "/home/users/billylau/.local/lib/python3.6/site-packages/remora/main.py", line 67, in run
    cmd_func(args)
  File "/home/users/billylau/.local/lib/python3.6/site-packages/remora/parsers.py", line 560, in run_list_pretrained
    from remora.model_util import get_pretrained_models
  File "/home/users/billylau/.local/lib/python3.6/site-packages/remora/model_util.py", line 13, in <module>
    import torch
  File "/home/users/billylau/.local/lib/python3.6/site-packages/torch/__init__.py", line 573, in <module>
    import torch.quantization
  File "/home/users/billylau/.local/lib/python3.6/site-packages/torch/quantization/__init__.py", line 9, in <module>
    from .quantize_fx import *
  File "/home/users/billylau/.local/lib/python3.6/site-packages/torch/quantization/quantize_fx.py", line 1, in <module>
    from .fx import Fuser  # noqa: F401
  File "/home/users/billylau/.local/lib/python3.6/site-packages/torch/quantization/fx/__init__.py", line 1, in <module>
    from .quantize import Quantizer
  File "/home/users/billylau/.local/lib/python3.6/site-packages/torch/quantization/fx/quantize.py", line 2, in <module>
    from torch._fx import (
  File "/home/users/billylau/.local/lib/python3.6/site-packages/torch/_fx/__init__.py", line 89, in <module>
    from .symbolic_trace import symbolic_trace, Tracer
  File "/home/users/billylau/.local/lib/python3.6/site-packages/torch/_fx/symbolic_trace.py", line 9, in <module>
    from .proxy import Proxy, _create_proxy, TracerBase
  File "/home/users/billylau/.local/lib/python3.6/site-packages/torch/_fx/proxy.py", line 7, in <module>
    from typing import Tuple, Dict, Optional, Iterable, NoReturn, Any, Union, Callable
ImportError: cannot import name 'NoReturn'

When I install megalodon from scratch using python 3.9, that command works perfectly. I'm not sure why having a different python version would do this, though. Or maybe the problem is something else -- I noticed that it's trying to do something with torch.

questions: remora models for R9.4 in guppy, bonito, and dorado

Hi,

I am using remora models to call methylation with different basecallers, such as guppy v6.1.7, bonito v0.6.2, and dorado v0.1.1.
data is sequenced from Nanopore PromethION with R9.4.1 pore.

The reason I use multiple callers is that I am confused by those models:

keep reading that the guppy is the recommendation since it's already integrated with remora. so I found models like dna_r9.4.1_450bps_modbases_5hmc_5mc_cg_fast_prom.cfg and dna_r9.4.1_450bps_modbases_5mc_cg_fast_prom.cfg under guppy model folder. Seems like remora can be used to call both 5mc and 5hmc for R9 in guppy? how come I couldn't find them with remora model list_pretrained
bonito seems will load_model from remora if I don't specify the path of modbase_model and only use --modified-bases 5mC. There is only one available sup model for R9 [email protected] in bonito. I am wondering which remora model is eventually used (v3.3, v3.5 or whichever lastest).
dorado stores and downloads its own remora models, so for sup mode, currently, there are [email protected] and [email protected]_5mCG@v0. Since the versions are different. I also need to specify the --modified-bases-model separately.

I think my questions are:

Am I using the right remora model in guppy? Is guppy really using remora model to call both 5mC, 5hmC in R9.4.1? how come this is not available in other callers and even remora model list. and eventually when remora updates the models, do I need to update guppy to get newer models?
both bonito and dorado use one [email protected] for canonical base calling, should I use latest version of remora modbases models or try to find the matched v3.3. Again, which version of remora is integrated in guppy, bonito, and dorado, respectively?

Thanks for any comments.

what kind of modification sites can be detected?

Thansks for developing this software or algorithm for modification detection. May I ask with the pre-trained models in remora, what kinds of methylation sites can be detected by remora? I assume 5mC site definitely could be detected, what about 6mA site or RNA methylation?

Error when running with remora with megalodon

Hello,

Very excited to try this tool. I have a couple of ideas for some training but first I would like to try it with megalodon to compare results with previous work.

So far, I have created a py-venv to install remora and megalodon toguether. I am using guppy 5.0.17 for basecalling.

Install seemed to be ok but I have this error when launching megalodon :

[17:51:50] Running Megalodon version 2.5.0
******************** WARNING: "mods" output requested, so "per_read_mods" will be added to outputs. ********************
[17:51:50] Loading guppy basecalling backend
[2022-09-28 17:51:54.560985] [0x00007efff5983700] [info]    Connecting to server as ''
[2022-09-28 17:51:54.563003] [0x00007efff5983700] [info]    Connected to server as ''. Connection id: 78acbf80-dc98-4661-9f48-347d898fe981
Traceback (most recent call last):
  File "/home/prom/.local/bin/megalodon", line 8, in <module>
    sys.exit(_main())
  File "/Nanopore/megalodon/megalodon/__main__.py", line 754, in _main
    megalodon._main(args)
  File "/Nanopore/megalodon/megalodon/megalodon.py", line 1797, in _main
    model_info = backends.ModelInfo(
  File "/Nanopore/megalodon/megalodon/backends.py", line 596, in __init__
    self.pyguppy_load_settings(
  File "/Nanopore/megalodon/megalodon/backends.py", line 1175, in pyguppy_load_settings
    self.pyguppy_set_model_attributes(
  File "/Nanopore/megalodon/megalodon/backends.py", line 1113, in pyguppy_set_model_attributes
    from remora import model_util
  File "/home/prom/.local/lib/python3.8/site-packages/remora/model_util.py", line 15, in <module>
    from torch import nn
  File "/home/prom/.local/lib/python3.8/site-packages/torch/nn/__init__.py", line 1, in <module>
    from .modules import *  # noqa: F403
  File "/home/prom/.local/lib/python3.8/site-packages/torch/nn/modules/__init__.py", line 1, in <module>
    from .module import Module
  File "/home/prom/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 7, in <module>
    from ..parameter import Parameter
  File "/home/prom/.local/lib/python3.8/site-packages/torch/nn/parameter.py", line 2, in <module>
    from torch._C import _disabled_torch_function_impl
ModuleNotFoundError: No module named 'torch._C'

I have looked for this error on pytorch github but it I am not sure answers fit my issue.

Any help is appreciated as always !

Best,
Paul

ModuleNotFoundError: No module named 'pandas' when attempting remora model list_pretrained

There seems to be a missing package requirement during install. It's fixed when I run conda install pandas afterward.

billylau@suzuki:~$ conda create -n remora python=3.8
Fetching package metadata ...........
Solving package specifications: .

Package plan for installation in environment /home/billylau/.conda/envs/remora:

The following NEW packages will be INSTALLED:

    _libgcc_mutex:    0.1-main                
    ca-certificates:  2021.10.26-h06a4308_2   
    certifi:          2021.10.8-py38h06a4308_0
    ld_impl_linux-64: 2.35.1-h7274673_9       
    libffi:           3.3-he6710b0_2          
    libgcc-ng:        9.1.0-hdf63c60_0        
    libstdcxx-ng:     9.1.0-hdf63c60_0        
    ncurses:          6.3-h7f8727e_2          
    openssl:          1.1.1l-h7f8727e_0       
    pip:              21.2.4-py38h06a4308_0   
    python:           3.8.12-h12debd9_0       
    readline:         8.1-h27cfd23_0          
    setuptools:       58.0.4-py38h06a4308_0   
    sqlite:           3.36.0-hc218d9a_0       
    tk:               8.6.11-h1ccaba5_0       
    wheel:            0.37.0-pyhd3eb1b0_1     
    xz:               5.2.5-h7b6447c_0        
    zlib:             1.2.11-h7b6447c_3       

Proceed ([y]/n)? y

#
# To activate this environment, use:
# > source activate remora
#
# To deactivate an active environment, use:
# > source deactivate
#

billylau@suzuki:~$ source activate remora
(remora) billylau@suzuki:~$ remora model list_pretrained
remora: command not found
(remora) billylau@suzuki:~$ pip install ont-remora
Collecting ont-remora
  Downloading ont-remora-0.1.1.tar.gz (15.0 MB)
     |████████████████████████████████| 15.0 MB 11.6 MB/s 
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
    Preparing wheel metadata ... done
Collecting numpy
  Downloading numpy-1.21.4-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
     |████████████████████████████████| 15.7 MB 158.2 MB/s 
Collecting tqdm
  Using cached tqdm-4.62.3-py2.py3-none-any.whl (76 kB)
Collecting onnxruntime>=1.7
  Downloading onnxruntime-1.9.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB)
     |████████████████████████████████| 4.8 MB 48.1 MB/s 
Collecting thop
  Downloading thop-0.0.31.post2005241907-py3-none-any.whl (8.7 kB)
Collecting onnx
  Downloading onnx-1.10.2-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (12.7 MB)
     |████████████████████████████████| 12.7 MB 30.9 MB/s 
Collecting torch
  Downloading torch-1.10.0-cp38-cp38-manylinux1_x86_64.whl (881.9 MB)
     |████████████████████████████████| 881.9 MB 5.5 kB/s 
Collecting scikit-learn
  Using cached scikit_learn-1.0.1-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (25.9 MB)
Collecting tabulate
  Downloading tabulate-0.8.9-py3-none-any.whl (25 kB)
Collecting protobuf
  Downloading protobuf-3.19.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
     |████████████████████████████████| 1.1 MB 134.1 MB/s 
Collecting flatbuffers
  Downloading flatbuffers-2.0-py2.py3-none-any.whl (26 kB)
Collecting typing-extensions>=3.6.2.1
  Downloading typing_extensions-4.0.1-py3-none-any.whl (22 kB)
Collecting six
  Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)
Collecting threadpoolctl>=2.0.0
  Using cached threadpoolctl-3.0.0-py3-none-any.whl (14 kB)
Collecting scipy>=1.1.0
  Downloading scipy-1.7.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (39.3 MB)
     |████████████████████████████████| 39.3 MB 52.9 MB/s 
Collecting joblib>=0.11
  Using cached joblib-1.1.0-py2.py3-none-any.whl (306 kB)
Building wheels for collected packages: ont-remora
  Building wheel for ont-remora (PEP 517) ... done
  Created wheel for ont-remora: filename=ont_remora-0.1.1-cp38-cp38-linux_x86_64.whl size=15350084 sha256=13147a2391f69bcb3473967dbe260b507877ab37007a79bdc32a36c710435ae0
  Stored in directory: /home/billylau/.cache/pip/wheels/65/3f/90/2d990ece00be8f22d2aa09b387689ff7bfa2eb4f361ae5d7aa
Successfully built ont-remora
Installing collected packages: typing-extensions, numpy, torch, threadpoolctl, six, scipy, protobuf, joblib, flatbuffers, tqdm, thop, tabulate, scikit-learn, onnxruntime, onnx, ont-remora
Successfully installed flatbuffers-2.0 joblib-1.1.0 numpy-1.21.4 onnx-1.10.2 onnxruntime-1.9.0 ont-remora-0.1.1 protobuf-3.19.1 scikit-learn-1.0.1 scipy-1.7.3 six-1.16.0 tabulate-0.8.9 thop-0.0.31.post2005241907 threadpoolctl-3.0.0 torch-1.10.0 tqdm-4.62.3 typing-extensions-4.0.1
(remora) billylau@suzuki:~$ remora model list_pretrained
Traceback (most recent call last):
  File "/home/billylau/.conda/envs/remora/bin/remora", line 8, in <module>
    sys.exit(run())
  File "/home/billylau/.conda/envs/remora/lib/python3.8/site-packages/remora/main.py", line 67, in run
    cmd_func(args)
  File "/home/billylau/.conda/envs/remora/lib/python3.8/site-packages/remora/parsers.py", line 560, in run_list_pretrained
    from remora.model_util import get_pretrained_models
  File "/home/billylau/.conda/envs/remora/lib/python3.8/site-packages/remora/model_util.py", line 10, in <module>
    import pandas as pd
ModuleNotFoundError: No module named 'pandas'
(remora) billylau@suzuki:~$ conda install pandas
Fetching package metadata ...........
Solving package specifications: .

Package plan for installation in environment /home/billylau/.conda/envs/remora:

The following NEW packages will be INSTALLED:

    blas:            1.0-mkl               
    bottleneck:      1.3.2-py38heb32a55_1  
    intel-openmp:    2021.4.0-h06a4308_3561
    mkl:             2021.4.0-h06a4308_640 
    mkl-service:     2.4.0-py38h7f8727e_0  
    mkl_fft:         1.3.1-py38hd3c417c_0  
    mkl_random:      1.2.2-py38h51133e4_0  
    numexpr:         2.7.3-py38h22e1b3c_1  
    numpy:           1.21.2-py38h20f2e39_0 
    numpy-base:      1.21.2-py38h79a1101_0 
    pandas:          1.3.4-py38h8c16a72_0  
    python-dateutil: 2.8.2-pyhd3eb1b0_0    
    pytz:            2021.3-pyhd3eb1b0_0   
    six:             1.16.0-pyhd3eb1b0_0   

Proceed ([y]/n)? y

mkl-service-2. 100% |##################################################################################################################################################################################| Time: 0:00:00  16.51 MB/s
numpy-base-1.2 100% |##################################################################################################################################################################################| Time: 0:00:00  81.29 MB/s
bottleneck-1.3 100% |##################################################################################################################################################################################| Time: 0:00:00  27.95 MB/s
mkl_fft-1.3.1- 100% |##################################################################################################################################################################################| Time: 0:00:00  29.60 MB/s
mkl_random-1.2 100% |##################################################################################################################################################################################| Time: 0:00:00  31.75 MB/s
numpy-1.21.2-p 100% |##################################################################################################################################################################################| Time: 0:00:00  15.87 MB/s
numexpr-2.7.3- 100% |##################################################################################################################################################################################| Time: 0:00:00  29.95 MB/s
pandas-1.3.4-p 100% |##################################################################################################################################################################################| Time: 0:00:00  29.95 MB/s
(remora) billylau@suzuki:~$ remora model list_pretrained
[10:40:16] Remora pretrained modified base models:
Pore             Basecall_Model_Type    Basecall_Model_Version    Modified_Bases    Remora_Model_Type      Remora_Model_Version
---------------  ---------------------  ------------------------  ----------------  -------------------  ----------------------
dna_r9.4.1_e8    fast                   0.0.0                     5mc               CG                                        0
dna_r9.4.1_e8    fast                   0.0.0                     5hmc_5mc          CG                                        0
dna_r9.4.1_e8    hac                    0.0.0                     5mc               CG                                        0
dna_r9.4.1_e8    hac                    0.0.0                     5hmc_5mc          CG                                        0
dna_r9.4.1_e8    sup                    0.0.0                     5mc               CG                                        0
dna_r9.4.1_e8    sup                    0.0.0                     5hmc_5mc          CG                                        0
dna_r9.4.1_e8.1  fast                   0.0.0                     5mc               CG                                        0
dna_r9.4.1_e8.1  hac                    0.0.0                     5mc               CG                                        0
dna_r9.4.1_e8.1  sup                    0.0.0                     5mc               CG                                        0
dna_r10.4_e8.1   fast                   0.0.0                     5mc               CG                                        0
dna_r10.4_e8.1   fast                   0.0.0                     5hmc_5mc          CG                                        0
dna_r10.4_e8.1   hac                    0.0.0                     5mc               CG                                        0
dna_r10.4_e8.1   hac                    0.0.0                     5hmc_5mc          CG                                        0
dna_r10.4_e8.1   sup                    0.0.0                     5mc               CG                                        0
dna_r10.4_e8.1   sup                    0.0.0                     5hmc_5mc          CG                                        0
(remora) billylau@suzuki:~$

about balanced_batch

Hello, thank you very much for providing tools for detecting RNA modifications. I‘m doing some testing with this tool and getting some good results. However, when I trained the model on a sample with a small proportion of modifications, I found that this parameter “--balanced-batch” was not sampling the training set to get the same number of positive and negative samples as I imagined, but the same number of positive and negative samples in the validation set. The combination of the code has confirmed my conjecture. I do not know much about machine learning. May I ask if this design can help the training of the model?

How to label positive site when I wanna train my own model

For example, I wanna train a model to predict methylated A
The previous pipeline seems to label all A in the positive set as positive.
But I just wanna label some of the sites.
How can I do it?

quick question

Hi, I´m new to the ONT technology analsyis and before I get into it I was wondering if remora could be used to detected methylated bases on plants (meaning that also detects other Cs contexts like CHG and CHH)
I read about DeepSignal-Plant which supposedly was designed for that kind of context

I want to know if it suits for my type of data (plants)

Thanks

Diego

CRF models are not fully supported.

Hello,

Thanks for developing this software!

I'm trying to run Remora with Megalodon with the following command:

megalodon ecoli_ci_test_fast5 --guppy-config dna_r9.4.1_450bps_fast.cfg --remora-modified-bases dna_r9.4.1_e8 fast 0.0.0 5hmc_5mc CG 0 --outputs basecalls mappings mod_mappings mods --reference /projects/li-lab/Nanopore_compare/nf_input/reference_genome/ecoli/Ecoli_k12_mg1655.fasta --devices 0 --processes 20 --guppy-server-path /projects/li-lab/software/ont-guppy-gpu_5.0.16/bin/guppy_basecall_server --overwrite

When I run it, I get a message that says CRF models are not fully supported. It appears to be taking a very long time to run. Also, at the end of the guppy_log, it says

2022-01-18 17:48:32.768178 [guppy/info] New client connected Client 1 anonymous_client_1 id: 9fcd2103-4b21-4e72-9e42-b0ae103868a9 (connection string = 'dna_r9.4.1_450bps_fast:>timeout_interval=15000>client_name=>alignment_type=auto:::').
2022-01-18 17:48:32.813812 [guppy/info] Client 1 anonymous_client_1 id: 9fcd2103-4b21-4e72-9e42-b0ae103868a9 has disconnected.

I'm using Megalodon version 2.4.1 with PyGuppy and Guppy GPU version 5.0.16. This is similar to an error mentioned in Issue #2, specifically this comment: #2 (comment). Do you know why this error is occurring?

I'm also a little bit confused, since it says on the Remora GitHub that running Remora on GPU resources is experimental with little support. However, Megalodon, which is running the Remora trained model (if I understand it correctly), requires a path to a Guppy basecall server, which greatly benefits from GPU usage. As a result, I am running on GPU resources. Could this be a source of any error?

Any help is greatly appreciated. Thank you!

About Remora and Megalodon

Hello, thank you very much for sharing this useful tool. I would like to know what is the difference between Remora and Megalodon? Aren't they tools to call base modification?

How megalodon get the threshold when load remora model

After I trained a remora model, I will run megalodon.
But I do not know how the threshold is generated by megalodon.
Can you tell me the function of it or just like modbampy default 0.33/0.66

Remora model train error

Hello,
I am trying to use remora with megalodon, anyway when I run the "model train" command I get this error:

remora   model train   /data/Nanopore//Remora/remora_train_chunks.npz   --model /data/Tools/lib/taiyaki/models/ConvLSTM_w_ref.py   --size 96   --epochs 100   --early-stopping 10   --scheduler StepLR   --lr-sched-kwargs step_size 10 int   --lr-sched-kwargs gamma 0.5 float   --output-path /data/Nanopore/Remora/Results/remora_train_results

[04:34:54] Seed selected is 599182259
[04:34:54] Loading dataset from Remora file
[04:34:54] Dataset loaded with labels: Counter({1: 130143, 0: 64236})
[04:34:54] Dataset summary:
               num chunks : 194379
       label distribution : Counter({1: 130143, 0: 64236})
                base_pred : False
                mod_bases : m
           mod_long_names : ('5mC',)
       kmer_context_bases : (4, 4)
            chunk_context : (50, 50)
                   motifs : [('CG', 0)]
 chunk_extract_base_start : False
     chunk_extract_offset : 0
          sig_map_refiner : Loaded 0-mer table with 0 central position.

[04:34:54] Loading model
Traceback (most recent call last):
  File "/home/usr/.conda/envs/remora/bin/remora", line 8, in <module>
    sys.exit(run())
  File "/home/usr/.conda/envs/remora/lib/python3.7/site-packages/remora/main.py", line 69, in run
    cmd_func(args)
  File "/home/usr/.conda/envs/remora/lib/python3.7/site-packages/remora/parsers.py", line 604, in run_model_train
    args.balanced_batch,
  File "/home/usr/.conda/envs/remora/lib/python3.7/site-packages/remora/train_model.py", line 154, in train_model
    model = model_util._load_python_model(copy_model_path, **model_params)
  File "/home/usr/.conda/envs/remora/lib/python3.7/site-packages/remora/model_util.py", line 200, in _load_python_model
    loader.exec_module(netmodule)
  File "<frozen importlib._bootstrap_external>", line 724, in exec_module
  File "<frozen importlib._bootstrap_external>", line 860, in get_code
  File "<frozen importlib._bootstrap_external>", line 791, in source_to_code
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/data/Nanopore/Remora/Results/remora_train_results/model.py", line 8
    <!DOCTYPE html>
    ^
SyntaxError: invalid syntax

I followed the instructions for installation and everything went well, I was able to run the previous commands of the tutorial in the github.

How to interpret 'remora model list_pretrained'/model selection?

With remora v1.1.1, I get the following from remora model list_pretrained. How do I interpret this and how do I select which model to use? For example, what's the difference between 0.0.0 and v3.5.1 when I'm using the Kit 14? If I'm feeding it through megalodon which one do I pick?

(megalodon2.5.0) billylau@suzuki:~$ remora model list_pretrained
[15:35:29] Remora pretrained modified base models:
Pore              Basecall_Model_Type    Basecall_Model_Version    Modified_Bases    Remora_Model_Type      Remora_Model_Version
----------------  ---------------------  ------------------------  ----------------  -------------------  ----------------------
dna_r9.4.1_e8     fast                   0.0.0                     5mc               CG                                        0
dna_r9.4.1_e8     fast                   0.0.0                     5hmc_5mc          CG                                        0
dna_r9.4.1_e8     hac                    0.0.0                     5mc               CG                                        0
dna_r9.4.1_e8     hac                    0.0.0                     5hmc_5mc          CG                                        0
dna_r9.4.1_e8     sup                    0.0.0                     5mc               CG                                        0
dna_r9.4.1_e8     sup                    0.0.0                     5hmc_5mc          CG                                        0
dna_r9.4.1_e8.1   fast                   0.0.0                     5mc               CG                                        0
dna_r9.4.1_e8.1   hac                    0.0.0                     5mc               CG                                        0
dna_r9.4.1_e8.1   sup                    0.0.0                     5mc               CG                                        0
dna_r10.4_e8.1    fast                   0.0.0                     5mc               CG                                        0
dna_r10.4_e8.1    fast                   0.0.0                     5mc               CG                                        1
dna_r10.4_e8.1    fast                   0.0.0                     5hmc_5mc          CG                                        0
dna_r10.4_e8.1    fast                   v3.3                      5mc               CG                                        0
dna_r10.4_e8.1    fast                   v3.3                      5mc               CG                                        1
dna_r10.4_e8.1    hac                    0.0.0                     5mc               CG                                        0
dna_r10.4_e8.1    hac                    0.0.0                     5mc               CG                                        1
dna_r10.4_e8.1    hac                    0.0.0                     5hmc_5mc          CG                                        0
dna_r10.4_e8.1    hac                    v3.3                      5mc               CG                                        0
dna_r10.4_e8.1    hac                    v3.3                      5mc               CG                                        1
dna_r10.4_e8.1    sup                    0.0.0                     5mc               CG                                        0
dna_r10.4_e8.1    sup                    0.0.0                     5mc               CG                                        1
dna_r10.4_e8.1    sup                    0.0.0                     5hmc_5mc          CG                                        0
dna_r10.4_e8.1    sup                    v3.4                      5mc               CG                                        0
dna_r10.4_e8.1    sup                    v3.4                      5mc               CG                                        1
dna_r10.4.1_e8.2  fast                   0.0.0                     5mc               CG                                        0
dna_r10.4.1_e8.2  fast                   0.0.0                     5mc               CG                                        1
dna_r10.4.1_e8.2  fast                   v3.5.1                    5mc               CG                                        0
dna_r10.4.1_e8.2  fast                   v3.5.1                    5mc               CG                                        1
dna_r10.4.1_e8.2  hac                    0.0.0                     5mc               CG                                        0
dna_r10.4.1_e8.2  hac                    0.0.0                     5mc               CG                                        1
dna_r10.4.1_e8.2  hac                    v3.5.1                    5mc               CG                                        0
dna_r10.4.1_e8.2  hac                    v3.5.1                    5mc               CG                                        1
dna_r10.4.1_e8.2  sup                    0.0.0                     5mc               CG                                        0
dna_r10.4.1_e8.2  sup                    0.0.0                     5mc               CG                                        1
dna_r10.4.1_e8.2  sup                    v3.5.1                    5mc               CG                                        0
dna_r10.4.1_e8.2  sup                    v3.5.1                    5mc               CG                                        1

Which model to use for 5hmC calling

Hello,

I would like to use bonito/remora to call hydroxymethylation. While the 5mC calling seems to work fine, I get the following error when I try to run bonito basecaller [email protected] fast5/ --modified-bases 5hmC --reference reference.mmi > basecalls_with_5hmC.bam:

> loading model [email protected]
> loading modified base model
> warning (remora): Remora model for basecall model version (v3.3) not found. Using default Remora model for dna_r9.4.1_e8_sup.
--- Logging error ---
Traceback (most recent call last):
  File "/miniconda3/envs/bonito/lib/python3.7/site-packages/remora/model_util.py", line 452, in load_model
    submodels = submodels[modified_bases]
KeyError: '5hmc'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/miniconda3/envs/bonito/lib/python3.7/logging/__init__.py", line 983, in emit
    msg = self.format(record)
  File "/miniconda3/envs/bonito/lib/python3.7/logging/__init__.py", line 829, in format
    return fmt.format(record)
  File "/miniconda3/envs/bonito/lib/python3.7/site-packages/bonito/mod_util.py", line 25, in format
    self._style._fmt = self.fmt
AttributeError: 'CustomFormatter' object has no attribute 'fmt'
Call stack:
  File "/miniconda3/envs/bonito/bin/bonito", line 8, in <module>
    sys.exit(main())
  File "/miniconda3/envs/bonito/lib/python3.7/site-packages/bonito/__init__.py", line 34, in main
    args.func(args)
  File "/miniconda3/envs/bonito/lib/python3.7/site-packages/bonito/cli/basecaller.py", line 59, in main
    args.modified_bases, args.model_directory, args.modified_base_model
  File "/miniconda3/envs/bonito/lib/python3.7/site-packages/bonito/mod_util.py", line 52, in load_mods_model
    quiet=True,
  File "/miniconda3/envs/bonito/lib/python3.7/site-packages/remora/model_util.py", line 455, in load_model
    f"Remora model for modified bases {modified_bases} not found "
Message: 'Remora model for modified bases 5hmc not found for [email protected].'
Arguments: ()

When I run remora model list_pretrained, the pore dna_r9.4.1_e8 does appear to have a model available for 5hmc:

Pore             Basecall_Model_Type    Basecall_Model_Version    Modified_Bases    Remora_Model_Type      Remora_Model_Version
---------------  ---------------------  ------------------------  ----------------  -------------------  ----------------------
dna_r9.4.1_e8    fast                   0.0.0                     5mc               CG                                        0
dna_r9.4.1_e8    fast                   0.0.0                     5hmc_5mc          CG                                        0
dna_r9.4.1_e8    hac                    0.0.0                     5mc               CG                                        0
dna_r9.4.1_e8    hac                    0.0.0                     5hmc_5mc          CG                                        0
dna_r9.4.1_e8    sup                    0.0.0                     5mc               CG                                        0
dna_r9.4.1_e8    sup                    0.0.0                     5hmc_5mc          CG                                        0
dna_r9.4.1_e8.1  fast                   0.0.0                     5mc               CG                                        0
dna_r9.4.1_e8.1  hac                    0.0.0                     5mc               CG                                        0
dna_r9.4.1_e8.1  sup                    0.0.0                     5mc               CG                                        0
dna_r10.4_e8.1   fast                   0.0.0                     5mc               CG                                        0
dna_r10.4_e8.1   fast                   0.0.0                     5hmc_5mc          CG                                        0
dna_r10.4_e8.1   hac                    0.0.0                     5mc               CG                                        0
dna_r10.4_e8.1   hac                    0.0.0                     5hmc_5mc          CG                                        0
dna_r10.4_e8.1   sup                    0.0.0                     5mc               CG                                        0
dna_r10.4_e8.1   sup                    0.0.0                     5hmc_5mc          CG                                        0

Thanks in advance for your help.

typo in the README

There's a typo in the README:

--mod-bases m
should be "--mod-base m 5mC "

Cheers,
Jon

RNA Support

Remora does not currently support RNA. We are working to add this support back and will post updates on this here.

some basic question about using remora

Hi,

Thanks for developing this useful tool.
I am current using
(1). Megalodon + Remora
(2). Bonito + Remora
for 5mc and 5hmc detection in human genome.

May I ask some basic questions about running this pipeline:

what is the difference between sssI_fast5s and pcr_fast5s in the example ?
Just want to confirm if I understood correctly about what each command do:
megalodon or bonito basecaller: annotate the modified bases in fast5 files
python taiyaki/misc/merge_mappedsignalfiles.py: map the modified base annotations to reads
remora dataset prepare: use the mapped annotations to prepare data for remora model training
remora model train: train remora model
remora infer: use the trained model to infer modified bases.

and also, is above the most recommended way for modified base detection ?
I mean, does "bonito + remora" has better accuracy than using bonito alone with pre-trained remora model ?
(or, does "megalodon + remora" has better accuracy than using megalodon alone with pre-trained remora model ?)

in the first step, megalodon and bonito which is more recommended ? (i.e. high accuracy)

Thank you very much.

nanoporetech / remora Goto Github PK

remora's Introduction

Remora

Installation

Getting Started

Data Preparation

Composing Datasets

Model Training

Model Inference

Reference-anchored Inference

Pre-trained Models

Python API and Raw Signal Analysis

Terms and Licence

Research Release

remora's People

Contributors

Stargazers

Watchers

Forkers

remora's Issues

Recommend Projects

Recommend Topics

Recommend Org