nanoporetech / modkit Goto Github PK

View Code? Open in Web Editor NEW

118.0 14.0 6.0 201.45 MB

A bioinformatics tool for working with modified bases

Home Page: https://nanoporetech.com/

License: Other

Shell 0.64% Rust 99.09% CSS 0.01% Python 0.27%

nanopore modified-bases methylation

modkit's Introduction

Modkit

A bioinformatics tool for working with modified bases from Oxford Nanopore. Specifically for converting modBAM to bedMethyl files using best practices, but also manipulating modBAM files and generating summary statistics. Detailed documentation and quick-start can be found in the online documentation.

Installation

Pre-compiled binaries are provided for Linux from the release page. We recommend the use of these in most circumstances.

Building from source

The provided packages should be used where possible. We understand that some users may wish to compile the software from its source code. To build modkit from source cargo should be used.

git clone https://github.com/nanoporetech/modkit.git
cd modkit
cargo install --path .
# or
cargo install --git https://github.com/nanoporetech/modkit.git

Usage

Modkit comprises a suite of tools for manipulating modified-base data stored in BAM files. Modified base information is stored in the MM and ML tags (see section 1.7 of the SAM tags specification). These tags are produced by contemporary basecallers of data from Oxford Nanopore Technologies sequencing platforms.

Constructing bedMethyl tables

A primary use of modkit is to create summary counts of modified and unmodified bases in an extended bedMethyl format. bedMethyl files tabulate the counts of base modifications from every sequencing read over each reference genomic position.

In its simplest form modkit creates a bedMethyl file using the following:

modkit pileup path/to/reads.bam output/path/pileup.bed --log-filepath pileup.log

No reference sequence is required. A single file (described below) with base count summaries will be created. The final argument here specifies an optional log file output.

The program performs best-practices filtering and manipulation of the raw data stored in the input file. For further details see filtering modified-base calls.

For user convenience the counting process can be modulated using several additional transforms and filters. The most basic of these is to report only counts from reference CpG dinucleotides. This option requires a reference sequence in order to locate the CpGs in the reference:

modkit pileup path/to/reads.bam output/path/pileup.bed --cpg --ref path/to/reference.fasta

The program also contains a range of presets which combine several options for ease of use. The traditional preset,

modkit pileup path/to/reads.bam output/path/pileup.bed \
  --ref path/to/reference.fasta \
  --preset traditional

performs three transforms:

restricts output to locations where there is a CG dinucleotide in the reference,
reports only a C and 5mC counts, using procedures to take into account counts of other forms of cytosine modification (notably 5hmC), and
aggregates data across strands. The strand field od the output will be marked as '.' indicating that the strand information has been lost.

Using this option is equivalent to running with the options:

modkit pileup --cpg --ref <reference.fasta> --ignore h --combine-strands

For more information on the individual options see the Advanced Usage help document.

Description of bedMethyl output

Below is a description of the bedMethyl columns generated by modkit pileup. A brief description of the bedMethyl specification can be found on Encode.

Definitions:

N_mod - Number of calls passing filters that were classified as a residue with a specified base modification.
N_canonical - Number of calls passing filters were classified as the canonical base rather than modified. The exact base must be inferred by the modification code. For example, if the modification code is m (5mC) then the canonical base is cytosine. If the modification code is a, the canonical base is adenosine.
N_{other mod} - Number of calls passing filters that were classified as modified, but where the modification is different from the listed base (and the corresponding canonical base is equal). For example, for a given cytosine there may be 3 reads with h calls, 1 with a canonical call, and 2 with m calls. In the bedMethyl row for h N_{other_mod} would be 2. In the m row N_{other_mod} would be 3.
N_{valid_cov} - the valid coverage. N_{valid_cov} = N_mod + N_{other_mod} + N_canonical, also used as the score in the bedMethyl
N_diff - Number of reads with a base other than the canonical base for this modification. For example, in a row for h the canonical base is cytosine, if there are 2 reads with C->A substitutions, N_diff will be 2.
N_delete - Number of reads with a deletion at this reference position
N_fail - Number of calls where the probability of the call was below the threshold. The threshold can be set on the command line or computed from the data (usually failing the lowest 10th percentile of calls).
N_nocall - Number of reads aligned to this reference position, with the correct canonical base, but without a base modification call. This can happen, for example, if the model requires a CpG dinucleotide and the read has a CG->CH substitution such that no modification call was produced by the basecaller.

bedMethyl column descriptions

column	name	description	type
1	chrom	name of reference sequence from BAM header	str
2	start position	0-based start position	int
3	end position	0-based exclusive end position	int
4	modified base code	single letter code for modified base	str
5	score	Equal to N_{valid_cov}.	int
6	strand	'+' for positive strand '-' for negative strand, '.' when strands are combined	str
7	start position	included for compatibility	int
8	end position	included for compatibility	int
9	color	included for compatibility, always 255,0,0	str
10	N_{valid_cov}	See definitions above.	int
11	fraction modified	N_mod / N_{valid_cov}	float
12	N_mod	See definitions above.	int
13	N_canonical	See definitions above.	int
14	N_{other_mod}	See definitions above.	int
15	N_delete	See definitions above.	int
16	N_fail	See definitions above.	int
17	N_diff	See definitions above.	int
18	N_nocall	See definitions above.	int

Description of columns in `modkit summary`:

Totals table

The lines of the totals table are prefixed with a # character.

row	name	description	type
1	bases	comma-separated list of canonical bases with modification calls.	str
2	total_reads_used	total number of reads from which base modification calls were extracted	int
3+	count_reads_{base}	total number of reads that contained base modifications for {base}	int
4+	filter_threshold_{base}	filter threshold used for {base}	float

Modification calls table

The modification calls table follows immediately after the totals table.

column	name	description	type
1	base	canonical base with modification call	char
2	code	base modification code, or `-` for canonical	char
3	pass_count	total number of passing (confidence >= threshold) calls for the modification in column 2	int
4	pass_frac	fraction of passing (>= threshold) calls for the modification in column 2	float
5	all_count	total number of calls for the modification code in column 2	int
6	all_frac	fraction of all calls for the modification in column 2	float

Advanced usage examples

For complete usage instructions please see the command-line help of the program or the Advanced usage help documentation. Some more commonly required examples are provided below.

To combine multiple base modification calls into one, for example to combine basecalls for both 5hmC and 5mC into a count for "all cytosine modifications" (with code C) the --combine-mods option can be used:

modkit pileup path/to/reads.bam output/path/pileup.bed --combine-mods

In standard usage the --preset traditional option can be used as outlined in the Usage section. By more directly specifying individual options we can perform something similar without loss of information for 5hmC data stored in the input file:

modkit pileup path/to/reads.bam output/path/pileup.bed --cpg --ref path/to/reference.fasta \
    --combine-strands

To produce a bedGraph file for each modification in the BAM file the --bedgraph option can be given. Counts for the positive and negative strands will be put in separate files.

modkit pileup path/to/reads.bam output/directory/path --bedgraph <--prefix string>

The option --prefix [str] parameter allows specification of a prefix to the output file names.

Licence and Copyright

Modkit is distributed under the terms of the Oxford Nanopore Technologies, Ltd. Public License, v. 1.0. If a copy of the License was not distributed with this file, You can obtain one at http://nanoporetech.com

modkit's People

Contributors

Stargazers

Watchers

Forkers

matthiaszepper e-sollier pastvir wdecoster rlowe-altoslabs abhi12122000

modkit's Issues

General range of --edge-filter for kb level sequencing

Hi modkit maker,
For the modkit summary module, do you recommend a rough number for the --edge-filter option when processing common genomic sequencing reads? Is there any literature discussing why or when these edge-trimming is needed?
Thank you,

offset in motif-bed

Hej!
I am using the version 0.1.7 and in motif-bed now you need to specify an offset along with the motif you what to have. I tried to search in the advanced documentation and in the modkit motif-bed --help but I don't find anything about this offset. what it is? There is only one example with a 0, but I don't understand why is a 0.

Thank you so much,
Violeta de Anca

Any chance you could compile for earlier glibc?

Hi I get:

modkit-0.1.2/modkit: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.29' not found (required by modkit-0.1.2/modkit)
modkit-0.1.2/modkit: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.28' not found (required by modkit-0.1.2/modkit)

On Ubuntu 18.04. Alas it will likely be a few months before we can rid ourselves of this version of Ubuntu on our compute cluster. Any chance you could compile your binaries targeting an earlier glibc?

Possible option to constrain `summary` only to the aligned bases?

We have some cfDNA sample sequenced, and we would like to also check methylation information. However, we noticed a big difference between statistics generated by counting from bed file, and calculated by summary. See below.

Note that in this sample, we used 5mC only model to do base calling.

First, I run pileup with default setting, then I calculate statistic using awk. I also show the log file of pileup run. There is no error or issues showing.

> modkit pileup PAK67493_barcode19_5mC_sorted.bam puc19.pileup.bed --ref neb_puc19.fa --log-filepath puc19.pileup.log 
> awk '$4=="m" {can+=$13; mod+=$12; oth+=$14; valid+=$10; del+=$15; filter+=$16; diff+=$17; nocall+=$18} END{print (can) " ("(can/valid)") CpG canonical\n" (mod) " (" (mod/valid) ") 5mCpG modified\n" (oth) " (" (oth/valid) ") 5hmCpG modified\n" (valid) " total valid \n" (filter) " filtered, " (diff) " diff nt, " (del) " del, " (nocall) " nocall."}' puc19.pileup.bed
163204 (0.139098) CpG canonical
1010099 (0.860902) 5mCpG modified
0 (0) 5hmCpG modified
1173303 total valid
101811 filtered, 4990799 diff nt, 52618 del, 3183387 nocall.
> cat puc19.pileup.log
[src/logging.rs::54][2023-05-04 12:04:26][DEBUG] command line: modkit pileup PAK67493_barcode19_5mC_sorted.bam puc19.pileup.bed --ref neb_puc19.fa --log-filepath puc19.pileup.log
[src/commands.rs::122][2023-05-04 12:04:26][INFO] sampling 10042 reads from BAM
[src/reads_sampler.rs::190][2023-05-04 12:04:26][DEBUG] sampled 10042 records
[src/commands.rs::700][2023-05-04 12:04:26][WARN] Threshold of 0.6074219 for base C is low. Consider increasing the filter-percentile or specifying a higher threshold.
[src/commands.rs::839][2023-05-04 12:04:42][INFO] Done, processed 2452 rows.

Then I run summary, I show output of summary, and the log file. I used --sampling-frac 1.0 option to summarize from all reads.

>modkit summary --log-filepath puc19.summary.log --sampling-frac 1.0 PAK67493_barcode19_5mC_sorted.bam > puc19.summary
> cat puc19.summary
mod_bases       C
count_reads_C   139229
C_calls_unmodified      1114267
C_frac_unmodified       0.39316085211485025
C_filtered_unmodified   172961
C_calls_modified_m      1719858
C_frac_modified_m       0.6068391478851497
C_filtered_modified_m   130193
C_total_mod_calls       2834125
C_total_filtered_mod_calls      303154
total_reads_used        139229
> cat  puc19.summary.log
[src/logging.rs::54][2023-05-04 12:11:40][DEBUG] command line: modkit summary --log-filepath puc19.summary.log --sampling-frac 1.0 PAK67493_barcode19_5mC_sorted.bam
[src/commands.rs::90][2023-05-04 12:11:40][INFO] sampling 100% of reads
[src/reads_sampler.rs::190][2023-05-04 12:11:43][DEBUG] sampled 139229 records
[src/summarize.rs::88][2023-05-04 12:11:43][INFO] calculating threshold at 10% percentile
[src/thresholds.rs::98][2023-05-04 12:11:43][DEBUG] calculating per base thresholds
[src/thresholds.rs::101][2023-05-04 12:11:44][DEBUG] probs per base took 1s
[src/thresholds.rs::111][2023-05-04 12:11:44][DEBUG] filter thresholds took 0s
[src/thresholds.rs::116][2023-05-04 12:11:44][INFO] calculated thresholds: C: 0.6074219
[src/summarize.rs::172][2023-05-04 12:11:45][DEBUG] computing summary took 0s
[src/writers.rs::268][2023-05-04 12:11:45][WARN] this output format will not be default in the next version, the table output (set with --table) will become default and this format will require the --tsv option

However, the calculated statistics are quite different.

Then I recalled that when doing alignment, only half of the bases are aligned in our sample, as we run Short Fragment Mode of nanopore as our sample are cfDNA of short length. Without trimming of adapter (due to the issue associated with MM/ML see #11), almost half of the bases are not mapped (softclip). I suppose during pileup, the methylation calls on those bases do not contributing to Nmod, Ncanonical, and Nothermod (hence counted as valid), rather assigned to Nnocall (?). As the fraction are high in our case, the statistics summary calculated are then very different from what is summerized from BED file.

I wonder whether it is possible to provide an option to restrict summary calculation only on mapped region of the read sequence. However, that might be against the whole purpose of summary as a quick alternative to generate statistics before doing full pileup calculation.

But at least, it is then worth mentioning in the document that the base mapping rate will strongly influence the summary outcome when compared with the statistics calculated from BED file.

Thanks.

How does modkit handle secondary/supplementary alignments?

I'm feeding in MinKNOW/guppy's output (merged with samtools) straight into modkit v0.1.6. How does modkit handle the secondary/supplementary alignments? Here are some basic stats for one file, which indicates this may be a significant number:

$ /mnt/ix1/Resources/tools/samtools/v1.15.1/samtools-1.15.1/samtools flagstat -@ 40 ./cases/P10227_30544.D01.merged.sorted.bam
69326287 + 0 in total (QC-passed reads + QC-failed reads)
53240227 + 0 primary
10567988 + 0 secondary
5518072 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
64771719 + 0 mapped (93.43% : N/A)
48685659 + 0 primary mapped (91.45% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

For megalodon, does it only take in the primary alignments when looking at mod_mappings.bam? Trying to figure out whether this needs to be taken into account or not.

Edit: Seems like mod_mappings.bam from megalodon is looking only at primary alignments:

$ /mnt/ix1/Resources/tools/samtools/v1.15.1/samtools-1.15.1/samtools flagstat -@ 40 mod_mappings.bam 
941703 + 0 in total (QC-passed reads + QC-failed reads)
941703 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
941703 + 0 mapped (100.00% : N/A)
941703 + 0 primary mapped (100.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Workflow suggestions for follow-up metagenome analysis?

Hello modkit team,

Using your tool, I have been able to create bedMethyl files for my reads mapped onto a metagenomic assembly, and get high level methylation information using modkit summary.

As I am new to epigenomic analysis I wanted to solicit your advice regarding next steps.

I would like to accomplish the following:

Retrieve the methylated version of my MAGs
Compute possible methylated variants of each MAG
Ideally, visualize these results

Do you have any recommendations for existing tools or workflows that would allow me to accomplish this? I have MAGs already assembled. Thank you.

`error: unexpected argument '--collapse' found` for modkit 0.1.8

Hi,
I tried to run modkit 0.1.8with the options:
modkit pileup --cpg --ref <reference.fasta> --collapse h --combine-strands

But I got the error:

error: unexpected argument '--collapse' found

  tip: to pass '--collapse' as a value, use '-- --collapse'

I assume that the online document is out of date due to version upgrades?

modkit: command not found

Hi,

I downloaded the modkit repo from the given instructions and added it to the PATH.
Trying to run modkit pileup but throws an error saying command not found.

Inconsistence between per-site result and per-read result

I have different results for per-site results using 2 different methods:
a) Use the modkit pileup to extract per-site results directly
b) Use the modkit extract to extract per-read result -> Acculate the per-read to per-site by reference genome location
Ideally, they should give the same results as the logic is the same.

So I have the following questions:

What is the logic/relationship between your design for modkit pileup and modkit extract? Have you ever compared the results with the per-read and per-site?
Does base calling quality matter for the methylation calling? Do you apply any filter on the base_qual for the modkit pileup per-site result? I know mod_qual is considered to calculate the base modification threshold. Not sure if a similar process is done for the base calling threshold.

Thank you so much for your help!

Output bed for every CpG position

Can modkit (or modbam2bed) output a result at every single CpG position despite having zero coverage? As far as I know modbam2bed can't do it. Like have an NA or . at the lines with missing values?

In theory you could create such a file using a left join with the motif bed, but that's an extra step that is quite annoying...

Log skipped and discarded reads, such as HardClip reads from minimap2

I have a question about how modkit handling HardClip reads in minimap2. The issue is when hardclip is happening, part of the read sequences are removed. However, when I check MM and ML tag, it does not seems to be updated by minimap2 accordingly. Then there is a mismatch between what specified in MM/ML and what is in the remained read seqeuences.

How does modkit handling this?

I briefly checked my bam file. It seems all hardclip reads are of supplementary alignment. I am not sure how general this is the case. Dose minimap mark hardcipped reads as supplementary?

Many thanks for your help.

panic: index out of bounds

Hi,

Great to see Rust tools at ONT :)
I just downloaded the latest binary and ran the command below:

modkit pileup --partition-tag HP --combine-strands --cpg --ref ~/GRCh38.fa alignment.cram calls.bed --region chr10:13107989-13108319

First, I noted that --cpg and --ref were required, and also the name of an output file while I want to partition per haplotype. That seems weird, and I don't know what would have happened to my output if it didn't end up panicking.

modkit outputted
> parsing region chr10:13107989-13108319
and then took a suspiciously long time (2min45s) before running into:

thread 'main' panicked at 'index out of bounds: the len is 196 but the index is 18446744073709551615', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/rust-htslib-0.43.1/src/bam/mod.rs:847:41
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

The index position looks like an underflow to me. The same error happened on another comparable interval. Please let me know what I can do to help debug this.

Wouter

ImplicitProbModified

Hi,

I am having similar issues with an empty bedfile, and have explored the closed issues.

I am working on Adaptive Sampling from GridIon and used guppy for basecalling and alignment.

guppy_basecaller \ -i ${input_fast5s} -s ${output_folder} \ -c res_dna_r941_min_modbases_5mC_5hmC_v001.cfg \ --align_ref ${reference_fasta} \ --compress_fastq --bam_out --recursive

and have checked that they are mapped from bamfiles and samtools idxstats. However, the bed files were empty, and I received an error

[src/pileup/subcommand.rs::632][2023-07-12 22:30:19][INFO] Done, processed 0 rows. Processed ~0 reads and skipped ~3901 reads.
[src/read_cache.rs::372][2023-07-12 22:30:19][DEBUG] read bbe24401-4164-4742-a27a-bb2e121ca5a9, Skipped: record bbe24401-4164-4742-a27a-bb2e121ca5a9 has un-allowed mode (ImplicitProbModified), use '--force-allow-implicit' or 'modkit update-tags --mode ambiguous'

bam eg:

3e2975b6-36df-4d6e-95fa-679aa0ea5111 16 chr1 2371141 60 86M9I29M1D54M2D42M1D13M3I98M1I25M2D22M3D18M1D10M1I30M8D17M25S * 0 483 GGAGCTTGCAGTGAGCCAAGATCGTGCCACCGCACTCCAGCCTGGGTGACACAGCGAGACTCTGTCTCCAAACAACAACAACAACCACAACCACAACAACAACAACAACACCATGAAGAAAGATTCCTGGGTTTCAGTGATCCTCGCACCTCAGCCTCCCGAAGTGCTGGGATTTTAGGTGAGCCACTGTGCCTGACTGATAGTCTAATTTTAAAAAAATAAAACAATATTTGGGTAATAGACATTTCTCCAAAGAAGATAAACAAACAGCCAATCAACACAGCGAGAGCCACTCATCGTCATCACTCGTTAGGGAAATGCGAATCAAAATCACTGAGGAACTATCACTTCACACCCACAGCAGCTATTACAAAAAGACACACCAAGTGTTGGGAAGACGAGAGAAATCTGGAACCTTCATGCATTACTGGTGGGACTGCACAGCAATTTTGGAAAAAGCAATAATGTAACTGAAACGAAGTG &,7,'(/..),),/$.)5&(%%'))&,(/+(%2,+/-$,;),-5/4+(,7=446891.962-9>;'01&&**/4(59.9(25808(,0,+3%,'1)4')6'&.)0).-;<,<66'313<6):4&-&$((.,3+5%(()&7:-=1(475:/5>;=:7++6B:'-71*%%&(,6:+<)06,)+?5.+5+<96-<-,21;3,)9;;:0&.11)#.&)/0.)%$##1&$)(6178;0-<+/;61,=,40-84)==0-91(2.+.+18(5+:.65,,2=+'))%'5;/-),'++%&%%)42)>:96/4'8-2:>.:51&%$&/(1%:99/6'%'-2+.(++0'8,:08=.>&,1=@?>6''6)0%.'(1):/2;7'7)1+&,52>@.<=>)0)0792=;';)17>,>7:),&*.%%%%#%$&"%*7':7)+-)9<;62'((''%%'%1$28**8%%#&$##$ AS:i:708 MD:Z:17G24C61A10^C7C1C0T43^GT34T7^G0G0G0T113G19^GG22^ATT18^C38T1^AAATGCTG17 NM:i:44 cm:i:32 de:f:0.0505 ms:i:735 nn:i:0 s1:i:228 s2:i:0 tp:A:P MM:Z:C+h,2,17,12,4,0,2,22,1,10,6,0,0,7;C+m,2,17,12,4,0,2,22,1,10,6,0,0,7; ML:B:C,0,0,0,0,0,0,0,0,0,0,0,0,0,254,254,255,254,255,255,255,255,255,255,110,254,248 qs:i:9 mx:i:3 ch:i:49 rn:i:118 st:Z:2023-07-06T12:28:29Z f5:Z:FAV71569_pass_1c3972e3_f5114067_0.fast5 ns:i:5661 ts:i:45 sm:f:78.563 sd:f:10.0631 sv:Z:med_mad RG:Z:f511406703c9d9eafbc59f28229b17deccce7224_unknown

I have then ran the pile up with --force-allow-implicit and got the output. could you explain the cause for the ImplicitProbModified error?

do you also suggest setting --sampling-frac 1.0 for adaptive sampling, and/or --filter-threshold?

fatal runtime error: stack overflow

Hello!

I am using modkit extract on my modified bam files that were basecalled with guppy v6.5.7. I have two samples, one job completed (I will call it bam1) successfully and for the other sample (I will call bam2), I got a fatal runtime error: stack overflow error.

This is the command I ran: modkit extract bl6_10F_03_methyl.bam bl6_10F_03_methyl_extract.tsv --reference ~/references/mm39.fa --include-bed ~/references/mm39_cg_motifs.bed --threads 8

And it gave me the following output and error:

> specifying include-only BED outputs only mapped sites
> parsing BED at /home/jsakr/references/mm39_cg_motifs.bed
> processed 43845398 BED lines
▹▹▹▹▹ [00:41:08] 0 ~records failed
▹▹▹▹▹ [00:41:08] 116138 ~records skipped
▸▹▹▹▹ [00:41:08] 751607 ~records used
▹▹▹▹▹ [00:41:08] 103214646 rows written
[00:00:00] ----------------------------------------       0/61      contigs
[00:01:45] ##############--------------------------     596/1818    processing chr2 
thread '<unknown>' has overflowed its stack
fatal runtime error: stack overflow
Aborted (core dumped)

I tried to run this same command for bam2 on different machines and got the same error. I also tried increase the number of threads to 16. I added --log-filepath and this is the last few lines of the log file:

[src/mod_bam.rs::63][2023-08-11 17:44:41][DEBUG] record b66026b6-18a9-44f1-b72f-f4fbee839b67 has no base modification information, skipping
[src/mod_bam.rs::63][2023-08-11 17:44:42][DEBUG] record ee1287f1-ac56-49a2-afef-5e991fc16723 has no base modification information, skipping
[src/mod_bam.rs::63][2023-08-11 17:44:46][DEBUG] record 616c20a1-4c76-4408-a3d5-71bd8f028a36 has no base modification information, skipping
[src/mod_bam.rs::63][2023-08-11 17:45:20][DEBUG] record 5c46ab57-0bd3-427b-af46-1aa90fc8e6c6 has no base modification information, skipping
[src/mod_bam.rs::63][2023-08-11 17:45:21][DEBUG] record 5c46ab57-0bd3-427b-af46-1aa90fc8e6c6 has no base modification information, skipping

A difference between the two files is that bam1 is much smaller than bam2 (4.2 Gbases vs 14.9 Gbases of data). I am not sure how significant this is. Any suggestions?

Thank you!

Jaz

Install issue: libz-sys fail

Hi,

I've attempted to install modkit on our HPC cluster as per process described, however I'm running into an error.

See below:

error: failed to run custom build command for `libz-sys v1.1.12`

Caused by:
  process didn't exit successfully: `/scratch/prj/ppn_microglia_mod/directrna/scripts/modkit/target/release/build/libz-sys-485e2e67260b5c1e/build-script-build` (exit status: 101)
  --- stdout
  cargo:rerun-if-env-changed=LIBZ_SYS_STATIC
  cargo:rerun-if-changed=build.rs
  CMAKE_TOOLCHAIN_FILE_x86_64-unknown-linux-gnu = None
  CMAKE_TOOLCHAIN_FILE_x86_64_unknown_linux_gnu = None
  HOST_CMAKE_TOOLCHAIN_FILE = None
  CMAKE_TOOLCHAIN_FILE = None
  CMAKE_GENERATOR_x86_64-unknown-linux-gnu = None
  CMAKE_GENERATOR_x86_64_unknown_linux_gnu = None
  HOST_CMAKE_GENERATOR = None
  CMAKE_GENERATOR = None
  CMAKE_PREFIX_PATH_x86_64-unknown-linux-gnu = None
  CMAKE_PREFIX_PATH_x86_64_unknown_linux_gnu = None
  HOST_CMAKE_PREFIX_PATH = None
  CMAKE_PREFIX_PATH = None
  CMAKE_x86_64-unknown-linux-gnu = None
  CMAKE_x86_64_unknown_linux_gnu = None
  HOST_CMAKE = None
  CMAKE = None
  running: cd "/scratch/prj/ppn_microglia_mod/directrna/scripts/modkit/target/release/build/libz-sys-c59a3584a7360b8f/out/build" && CMAKE_PREFIX_PATH="" "cmake" "/users/k19022845/.cargo/registry/src/github.com-1ecc6299db9ec823/libz-sys-1.1.12/src/zlib-ng" "-DBUILD_SHARED_LIBS=OFF" "-DZLIB_COMPAT=ON" "-DZLIB_ENABLE_TESTS=OFF" "-DWITH_GZFILEOP=ON" "-DCMAKE_INSTALL_PREFIX=/scratch/prj/ppn_microglia_mod/directrna/scripts/modkit/target/release/build/libz-sys-c59a3584a7360b8f/out" "-DCMAKE_C_FLAGS= -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_C_COMPILER=/usr/bin/cc" "-DCMAKE_CXX_FLAGS= -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_CXX_COMPILER=/usr/bin/c++" "-DCMAKE_ASM_FLAGS= -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_ASM_COMPILER=/usr/bin/cc" "-DCMAKE_BUILD_TYPE=Release"

  --- stderr
  thread 'main' panicked at '
  failed to execute command: No such file or directory (os error 2)
  is `cmake` not installed?

  build script failed, must exit now', /users/k19022845/.cargo/registry/src/github.com-1ecc6299db9ec823/cmake-0.1.50/src/lib.rs:1098:5
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
error: failed to compile `mod_kit v0.1.12 (/scratch/prj/ppn_microglia_mod/directrna/scripts/modkit)`, intermediate artifacts can be found at `/scratch/prj/ppn_microglia_mod/directrna/scripts/modkit/target`

Any assistance with this would be greatly appreciated!

modkit pileup bed output file has space delimited columns 10-18 and tab delimited 1-9

Seems there is a bug in the bed output from modkit pileup:

$ modkit --version
mod_kit 0.1.12

$ modkit pileup -n 4 --preset traditional  -r ref.fasta methylated.ref.sorted.bam test.bed

$ head test.bed | awk -F'\t' '{ print $10; }'
38 0.00 0 38 0 0 1 2 2
36 0.00 0 36 0 6 1 0 4
37 0.00 0 37 0 0 4 4 2
29 3.45 1 28 0 0 12 3 3
39 2.56 1 38 0 0 1 4 3
40 0.00 0 40 0 1 2 0 4
41 0.00 0 41 0 1 1 0 4
40 0.00 0 40 0 0 3 0 4
43 0.00 0 43 0 0 4 0 1
44 0.00 0 44 0 0 1 0 3

All columns after 10 are space separated and should be tab delimited?

Per-read support?

Hi,
Thanks for the new tool! I am assuming that the bedMethyl output is the "aggregated" site result from the read level currently. So any chance that you will add the per-read level support? It might be interesting to explore the read heterogeneity at the single-base, single-read level.

feat: add `ref_mod_strand` to `extract` output

As discussed in #33 , add a column to the modkit extract output designating which strand of the reference the base modification is on. This will be . when the input modBAM is unaligned.

Setting dual thresholds with modkit

Continuing from epi2me-labs/modbam2bed#57

We traditionally set thresholds for both unmethylated (<20%) and methylated (>80%). This throws out ambiguous calls (20-80%). This approach is supported for methylation called by megalodon in a recent publication (https://www.nature.com/articles/s41467-021-23778-6).

For modifications like 5mC, there usually are not a lot of ambiguous calls, but calling everything <80% probability as "unmethylated" does not seem prudent.

Previously, to perform this dual thresholding, I would use modbam2bed with the following command:

 modbam2bed -t {threads} \
		-e \
		-m 5mC \
		--cpg \
		-a 0.20 -b 0.80 \
		{input.genome} {input.modbam} > {output}

Since modbam2bed is being deprecated, how would I perform this same type of command with modkit?

How to use the mod_qual in the modkit extract output?

Dear Developers,

Your tool has been very useful. Thanks for making it widely available to the scientific research community.
Context: I'm doing analysis at read level. For modkit extract done at CpG context, 2 consecutive rows gives information about 5mc & 5hmc at the column 11, mod_qual.

Que: If h=0.005859375 & m=0.001953125 at position chr1:146581135 does that automatically make Canonical prob (C) = 0.9921875 ? Which implies that site (chr1:146581135) is most likely is C.

do i have to do extra filtering to reveal the real identity of the genomic site (chr1:146581135)?

Thanks in advance
-S

> dt
                                read_id forward_read_position ref_position
1: 1f6c2b71-bb44-4f6f-a164-3ada21d8c7b2                 19290    146581135
2: 1f6c2b71-bb44-4f6f-a164-3ada21d8c7b2                 19290    146581135
3: 1f6c2b71-bb44-4f6f-a164-3ada21d8c7b2                 18593    146581827
4: 1f6c2b71-bb44-4f6f-a164-3ada21d8c7b2                 18593    146581827
5: 1f6c2b71-bb44-4f6f-a164-3ada21d8c7b2                 18481    146581939
6: 1f6c2b71-bb44-4f6f-a164-3ada21d8c7b2                 18481    146581939
7: 1f6c2b71-bb44-4f6f-a164-3ada21d8c7b2                 18215    146582203
8: 1f6c2b71-bb44-4f6f-a164-3ada21d8c7b2                 18215    146582203
9: 1f6c2b71-bb44-4f6f-a164-3ada21d8c7b2                 17843    146582575
   chrom mod_strand ref_strand ref_mod_strand fw_soft_clipped_start
1:  chr1          +          -              -                    32
2:  chr1          +          -              -                    32
3:  chr1          +          -              -                    32
4:  chr1          +          -              -                    32
5:  chr1          +          -              -                    32
6:  chr1          +          -              -                    32
7:  chr1          +          -              -                    32
8:  chr1          +          -              -                    32
9:  chr1          +          -              -                    32
   fw_soft_clipped_end read_length    mod_qual mod_code base_qual ref_kmer
1:                  11       19415 0.005859375        h        19    ACGTG
2:                  11       19415 0.001953125        m        19    ACGTG
3:                  11       19415 0.017578125        h        16    TCGGA
4:                  11       19415 0.005859375        m        16    TCGGA
5:                  11       19415 0.150390630        h        31    ACgtt
6:                  11       19415 0.009765625        m        31    ACgtt
7:                  11       19415 0.005859375        h        50    TCGGT
8:                  11       19415 0.001953125        m        50    TCGGT
9:                  11       19415 0.017578125        h        45    ACGtt
   query_kmer canonical_base modified_primary_base
1:      CACGT              C                     C
2:      CACGT              C                     C
3:      TCCGA              C                     C
4:      TCCGA              C                     C
5:      AACGT              C                     C
6:      AACGT              C                     C
7:      ACCGA              C                     C
8:      ACCGA              C                     C
9:      AACGT              C                     C
>

Probability threshold too aggressive

Me again,

Today I'm calculating global methylation on a 30X coverage primate genome blood sample, using modkit defaults.
~/Desktop/modkit0.1.3/modkit pileup -t 32 --cpg --ref panTro6.fa --combine-strands total-chimp.modmapped.bam total-chimp.modmapped.bam.modkit.bed. The automatically determined threshold seem to be too aggressive about filtering. On this data set, modkit default chooses threshold 0.9902344.

When I use defaults and summarize the 5mc with awk, I get 88.9% global 5mC, which is rather high. The same calculation from modbam2bed using defaults comes in at 77.8%. I re-ran modkit with modkit --filter-threshold 0.66, I get 79.7%. That's more reasonable. Primate blood methylation really shouldn't be higher than 80%.

Taking a look at the total counts of modified CpG calls I see this:
264,978,433 5mc calls modkit default
28,821,284 lines

276,737,102 5mc calls modkit --filter-threshold 0.66
29,010,994 lines

276,147,783 5mc calls modbam2bed default
29,395,297 lines

So, the total number of calls missing is just a small fraction ~12million calls missing (or just 1% of them) at the 99% threshold level. That seems like it shouldn't be a problem. So how can my methylation be jumping from 79.7% to 88.9%??

I tried increasing the sampling fraction and that didn't help. The number of lines is nearly the same so modkit is calling all the reference CpGs. I believe the threshold sampling algorithm is too strict in kicking out modified bases. Of course I could just manually set my own threshold but that defeats the purpose of using modkit in the first place.

Modkit sample-probs on the bam file gives me this output which I don't know how to interpret.
~/Desktop/modkit0.1.3/modkit sample-probs -t 32 total-chimp.modmapped.bam> sampling 10042 reads from BAM to estimate probability distribution
▹▸▹▹▹ [00:02:05] 1322327 probabilities sampled > sampled 1322327 probabilities
q p
0.10 0.99
0.50 1.00
0.90 1.00

Failed modification calls

Hello,

I had a question about exactly what a "failed" modification call means.

I am running the following modkit command on a modbam that includes both 5mC and 5hmC calls.

~/software/modkit_0.1.5/modkit pileup --log-filepath mod_calls/mod_forebrain.log \
-t24 \
--filter-threshold 0.8 \
--cpg \
--ref ~/references/mm39/mm39.fa \
--only-tabs \
alignments/mod.bam mod_calls/mods.bed

I've noticed that the number of "failed" modification calls (column 16 of pileup output) seems to be affected by changes in the thresholding.

I have tested thresholds 0.8 and 0.1 just to see what happens. It appears that if I change the threshold from 0.8 to 0.1, calls that were formerly counted as "failed" are now counted as either 5mC or 5hmC.

Is this because at the 0.8 threshold, both probabilities for the modifications and the canonical base did not exceeded the threshold (>0.8)? So, for example, the base call would be called "failed" if the probabilities of canonical, 5hmC, and 5mC were something like 0.1, 0.6, 0.3? If neither modification meets the threshold, could it or should it be called canonical? Do the probabilities of the 3 states equal 1?

That brings me to a slightly broader question about how everything is being called. Let's say I am using the 0.8 threshold, do these scenarios below seem correct? Probabilities are listed as canonical, 5hmC, 5mC:

0.9, 0.05, 0.05 - called canonical
0.05, 0.9, 0.05 - called 5hmC
0.05, 0.05, 0.9 - called 5mC
0.20, 0.4, 0.4 - called as failed

Thanks for your help!

modkit summary results validation.

I am comparing modkit summary against the modkit pileup bed file to confirm whether the percentages match up and it appears something is off. I called a fast5 sequenced with mouse brain DNA using dorado using a 5mc_5hmc model so I expect ample quantities of both modifications. I used modkit pileup with the --cpg flag and receive a bedmethyl with the format described in the modkit documentation.

Modkit summary reveals the following:
./dist/modkit summary calls.modmapped.bam
mod_bases C
count_reads_C 4081393
C_calls_modified_m 22817251
C_frac_modified_m 0.4827792093865955
C_filtered_modified_m 0
C_calls_unmodified 17031731
C_frac_unmodified 0.36036618200260717
C_filtered_unmodified 0
C_calls_modified_h 7413308
C_frac_modified_h 0.1568546086107973
C_filtered_modified_h 0
C_total_mod_calls 47262290
C_total_filtered_mod_calls 0
total_reads_used 4081393

However, my when I run awk against the resulting .bed file I see the following:
awk '$4=="m" {can+=$13; mod+=$12; oth+=$14; valid+=$10} END{print (can/valid) " CpG canonical\n" (mod/valid) " 5mCpG modified\n" (oth/valid) " 5hmCpG modified"}' calls.modmapped.modkit.bed
0.442508 CpG canonical
0.386512 5mCpG modified
0.17098 5hmCpG modified

I don't expect the percentages to match up perfectly because of differences in the mapping in the .bam file and in the resulting .bed file of course, but they should be close. The 5hmC percentage is probably correct, but the canonical C and modified C appear to be switched. Do I have that right? According to the Readme.md, column 13 is canonical, and column 12 is modified (in this case, column 14 "other" is 5hmC count). Is it possible that 'modkit summary' is parsing columns 12 and 13 as swapped?

Error in --include-bed: "Improperly formatted BED line"

Hi,

I'm attempting modkit pileup --include-bed.

$ modkit --version
mod_kit 0.1.12

$ modkit pileup input.bam output.bed --ref hg38.fa
This worked.

But

$ modkit pileup input.bam output.bed --ref hg38.fa --include-bed target.bed
This did not work.

> improperly formatted BED line chrX    55613399        55636192
> improperly formatted BED line chrX    71085850        71152454
> improperly formatted BED line chrX    71229623        71265146
> improperly formatted BED line chrX    71523103        71585892
> improperly formatted BED line chrX    72291637        72317187
> improperly formatted BED line chrX    77494879        77796233
> improperly formatted BED line chrX    101339446       101400796
> improperly formatted BED line chrX    129971106       130068071
> improperly formatted BED line chrX    148490616       149010663
> improperly formatted BED line chrX    154418632       154446516
> processed 0 BED lines
> attempting to sample 10042 reads
> Error! zero reads found in bam index

What is proper format of BED line?
Thanks.

modkit and unmapped reads in bam file from guppy

Hello!

I am running a modified base calling using Guppy without alignment.

I have a BAM file from Guppy
bam_runid_ca07d29b288c9f8881387334d8fc7d195fb11339_0_0.bam.gz
It is gzipped just to be able to add it here. I use it just as a .bam file.

I sort and index it using samtools and then run modkit pileup as follows (including output from modkit)

$ samtools sort -M -o barcode1.sorted.bam -O bam bam_runid_ca07d29b288c9f8881387334d8fc7d195fb11339_0_0.bam
$samtools index barcode1.sorted.bam
$modkit pileup --include-unmapped --no-filtering barcode1.sorted.bam barcode1.bed
> not performing filtering
> Done, processed 0 rows. Processed ~0 reads and skipped zero reads.

At the same time, if I run modkit summary on the original file (not indexed), I get the following

$ modkit summary bam_runid_ca07d29b288c9f8881387334d8fc7d195fb11339_0_0.bam 
> sampling 10042 reads from BAM
> calculating threshold at 10% percentile
> calculated thresholds: C: 0.75390625
# bases             C 
# total_reads_used  3894 
# count_reads_C     3894 
# pass_threshold_C  0.75390625 
 base  code  pass_count  pass_frac    all_count  all_frac 
 C     h     461         0.016623994  1051       0.0341123 
 C     -     26492       0.9553208    28241      0.916618 
 C     m     778         0.028055245  1518       0.049269717

The same output is obtained if I run modkit summary on barcode1.sorted.bam (for which the index exists).

I tried to convert bam to sam and inspect it in a text editor and every line contains MM and ML records.

I accept that barcode1.sorted.bam is empty. I suspect it is because samtools index it by reference positions, which does not exist in the original bam file (it contains only unmapped reads).

Also, please note that it is just simple tutorial data, and the modifications are probably not real. The point is to make the workflow work at this stage.

I use modkit version 0.1.9.

What am I doing wrong here? Or is it a bug/feature in modkit?

Thank you very much in advance.

Q: compared to modbam2bed?

Just curious .. which one is the tool of choice when converting modified BAM file to bedMethyl format?
In our current workflows I use modbam2bed.
modkit seems to perform similar tasks ..

What are the main differences?

Question about cytosine count discrepancy in modified basecalling process

Hello,

I wanted to bring up a question regarding the cytosine count discrepancy that I encountered during the basecalling and modified basecalling process. I would appreciate any insights or clarification.

Here's a summary of the steps I followed:

I used the following command to perform basecalling:
dorado basecaller [email protected] --modified-bases-models [email protected]_5mCG@v0 pod5/ > calls.bam
Next, I aligned the resulting BAM file using the dorado aligner with the following command:
dorado aligner genome.fa calls.bam -t 16 > calls_aligned.bam
I sorted the BAM file with samtools
Finally, I ran the modkit pileup command to process the aligned BAM file and generate a bed file:
modkit pileup --filter-threshold 0.0 calls_aligned_sorted.bam dorado_methyl_no_filt.bed

Now, here's the issue I encountered: The genome of the organism I sequenced contains a total of 19,505,736 cytosines (forward and reverse strands), out of which I expected 5,059,256 to be in a CG context (not necessarily methylated). However, the resulting bed file contains 11,365,547 cytosines.

If we basecall first and then check if bases in CG context are modified, wouldn't we then expect a count closer to 19 million cytosines found, regardless of them being methylated or not?

I suspect that I might be misunderstanding the basecalling or modified basecalling process, and I would greatly appreciate any insights or explanations.

[ERROR] double added base mod calls for base C and read X,potentially a logic error!

I am using modkit v0.1.6 on some bam files generated using the Promethion 22.12.5 release (Guppy v6.4.6), with modified hac basecalling.

I am getting the above error "double added base mod calls for base C and read 1abba18a-5c31-52ca-8b95-0230713aa7c3,potentially a logic error!" on some reads when running modkit as follows:

/modkit pileup --preset traditional --ref /mnt/ix1/Resources/GenomeRef/Homo_sapiens/Ensembl/GRCh38_no_alt/Sequence/WholeGenomeFasta/hs38_naa.fna -t 20 --log-filepath P10227_30544.D01.merged.modkit.log P10227_30544.D01.merged.sorted.bam P10227_30544.D01.merged.modkit.bed

I'm just using merged bams by using samtools merge on all the bam files created by guppy during the run. Any ideas why this is happening?

binaries are not up-to-date

I'm trying to install modkit v0.1.11 from the pre-compiled binaries. I realized from the CHANGELOG.md and CARGO.toml files that it was indeed still v0.1.10. Can you please update the binaries to v0.1.11?

Unable to install on Ubuntu 20 (Illegal instruction in VM)

When attempting to instal modkit on Ubuntu 20 either from the tar or from source we are getting the following error:

Installing mod_kit v0.1.8 (/home/modkit)
Updating crates.io index
Illegal instruction (core dumped)

We tried going back and reinstalling possible prerequisites that could be causing the issue, but to no avail. It is working properly using the exact same commands for us on MacOS.

Thanks for your help,
Jack

Call-mods Error: thread 'main' panicked at 'called `Option::unwrap()` on a `None` value'

Hi,

I’ve been trying to run the call-mods module on a .bam file to ‘filter out’ reads containing lower methylation probabilities prior to visualization yet keep encountering the following error when running this command :
modkit call-mods 221110_ys18_Me_Naked_R10_all_6mAcalled.sorted.bam filt.bam --mod-threshold A:0.8 --mod-threshold a:0.9

thread 'main' panicked at 'called Option::unwrap() on a None value', /home/boegersome/.cargo/registry/src/github.com-1ecc6299db9ec823/rust-htslib-0.40.2/src/bam/header.rs:95:27

I’m admittedly very new to bioinformatics but it's unclear what the error could be from this message. I’ve also noticed that the ‘call-mods’ command works fine when running on a .bam file with little to no methylation signal, but always errors when running on a sufficiently modified sample.

pileup warning "double added base mod calls for base C..."

Hi all. Thank you for developing modkit. I am having the same warning as reported in #18 . I’m running modkit pileup (command at the bottom) using modkit v0.1.9 and I’m getting the next warnings:

[src/read_ids_to_base_mod_probs.rs::56][2023-06-16 11:30:54][DEBUG] double added base mod calls for base C and read e910bc1f-271e-42d1-a18b-b2ef58c48e1f;e5c15025-43f1-4093-b276-859c90e98349,potentially a logic error, please submit an issue.
[src/read_ids_to_base_mod_probs.rs::56][2023-06-16 11:30:54][DEBUG] double added base mod calls for base C and read bb36d4b8-1e66-49dd-a9da-f3b51959ccf5;fa4c1b4d-7e65-42ab-9422-87a077ca76e7,potentially a logic error, please submit an issue.

My bam file has only duplex reads with modifications in both strands as produced by remora duplex_from_pod5_and_bam. Interesting, all the read ids are listed in log file with the aforementioned warning. I also tried with a bam file that is made of simplex and duplex reads and only the ids of duplexes are logged in the file, also with the same warning. In all cases I have previously discarded there secondary, supplementary alignments, etc.

Are duplex reads with modifications supported?

Thanks

modkit pileup --cpg --threads 5 --seed 1234 --ref tmp.fasta --log-filepath tmp.log dx.bam dx.bedmethyl

differnt CpG sites when using different threads parameter

Hi, thanks for this nice tool.
I got totally different number of CpG sites for the same BAM file when using modkit pileup with --preset traditional and default threads and when using the same modkit pileup (--preset traditional) but with thread 50 (-t 50).
The difference is about 1 million sites!
Is there any explanation?
Thanks

Segfault on the cliveome

Whilst running on the cliveome aligned with Dorado, mapped with minimap2 and sorted with samtools I have run into the following segfault. Any chance you can put out a version of your binary compiled with full debug symbols or should I compile from scratch and try and reproduce?

#0  __memmove_avx_unaligned () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:222
#1  0x00005555555dd498 in cram_to_bam.isra ()
#2  0x00005555555e97da in cram_get_bam_seq ()
#3  0x00005555555d4436 in sam_read1 ()
#4  0x00005555555d5548 in sam_readrec ()
#5  0x00005555555c111c in hts_itr_next ()
#6  0x00005555555d9a51 in bam_plp64_auto ()
#7  0x00005555555d9ad1 in bam_plp_auto ()
#8  0x000055555552b03a in <mod_kit::mod_pileup::PileupIter as core::iter::traits::iterator::Iterator>::next ()
#9  0x0000555555528d96 in _ZN7mod_kit8commands12ModBamPileup3run28_$u7b$$u7b$closure$u7d$$u7d$28_$u7b$$u7b$closure$u7d$$u7d$28_$u7b$$u7b$closure$u7d$$u7d$28_$u7b$$u7b$closure$u7d$$u7d$17hb169f6e83b14f0b3E.llvm.16552777208159181741 ()
#10 0x00005555555277fe in rayon::iter::plumbing::Folder::consume_iter ()
#11 0x00005555554f3501 in rayon::iter::plumbing::bridge_producer_consumer::helper ()
#12 0x000055555550d9a3 in _ZN10rayon_core4join12join_context28_$u7b$$u7b$closure$u7d$$u7d$17h64310596cc7337a5E.llvm.316613399766671356 ()
#13 0x00005555555118e4 in rayon_core::registry::in_worker ()
#14 0x00005555554f35f8 in rayon::iter::plumbing::bridge_producer_consumer::helper ()
#15 0x000055555553d629 in _ZN83_$LT$rayon_core..job..StackJob$LT$L$C$F$C$R$GT$$u20$as$u20$rayon_core..job..Job$GT$7execute17hcb306c6da184137aE.llvm.14917117514892184753 ()
#16 0x00005555554bcaa3 in rayon_core::registry::WorkerThread::wait_until_cold ()
#17 0x000055555550da5d in _ZN10rayon_core4join12join_context28_$u7b$$u7b$closure$u7d$$u7d$17h64310596cc7337a5E.llvm.316613399766671356 ()
#18 0x00005555555118e4 in rayon_core::registry::in_worker ()
#19 0x00005555554f35f8 in rayon::iter::plumbing::bridge_producer_consumer::helper ()
#20 0x000055555553d629 in _ZN83_$LT$rayon_core..job..StackJob$LT$L$C$F$C$R$GT$$u20$as$u20$rayon_core..job..Job$GT$7execute17hcb306c6da184137aE.llvm.14917117514892184753 ()
#21 0x00005555554bcaa3 in rayon_core::registry::WorkerThread::wait_until_cold ()
#22 0x000055555550da5d in _ZN10rayon_core4join12join_context28_$u7b$$u7b$closure$u7d$$u7d$17h64310596cc7337a5E.llvm.316613399766671356 ()
#23 0x00005555555118e4 in rayon_core::registry::in_worker ()
#24 0x00005555554f35f8 in rayon::iter::plumbing::bridge_producer_consumer::helper ()
#25 0x000055555550d9a3 in _ZN10rayon_core4join12join_context28_$u7b$$u7b$closure$u7d$$u7d$17h64310596cc7337a5E.llvm.316613399766671356 ()
#26 0x00005555555118e4 in rayon_core::registry::in_worker ()
#27 0x00005555554f35f8 in rayon::iter::plumbing::bridge_producer_consumer::helper ()
#28 0x000055555553d629 in _ZN83_$LT$rayon_core..job..StackJob$LT$L$C$F$C$R$GT$$u20$as$u20$rayon_core..job..Job$GT$7execute17hcb306c6da184137aE.llvm.14917117514892184753 ()
#29 0x00005555554bcaa3 in rayon_core::registry::WorkerThread::wait_until_cold ()
#30 0x00005555558ca9a8 in rayon_core::registry::ThreadBuilder::run ()
#31 0x00005555558d12ea in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#32 0x00005555558ce431 in core::ops::function::FnOnce::call_once{{vtable.shim}} ()
#33 0x0000555555943893 in alloc::boxed::{impl#45}::call_once<(), dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global> () at library/alloc/src/boxed.rs:1987
#34 alloc::boxed::{impl#45}::call_once<(), alloc::boxed::Box<dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global>, alloc::alloc::Global> () at library/alloc/src/boxed.rs:1987
#35 std::sys::unix::thread::{impl#2}::new::thread_start () at library/std/src/sys/unix/thread.rs:108
#36 0x00007ffff7d0bb43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#37 0x00007ffff7d9da00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Error! zero reads in bam index

Hello, when I try the modkit pileup command using bam output from dorado basecaller modified_bases as input to modkit, it fails with the following errors:

modkit pileup -r /data/Homo_sapiens.GRCh38.106_no_alt_chrs.fa -t 150 dorado_0-3-0_dna_r10.4.1_e8.2_400bps_hac_v4.1.0_mod_bases.bam dorado_0-3-0_dna_r10.4.1_e8.2_400bps_hac_v4.1.0_mod_bases.bed
> attempting to sample 10042 reads from BAM
> Error! zero reads in bam index

The bam file was sorted and indexed using samtools. What am I doing wrong?

Error code: Error! didn't sample enough data, try a larger fraction of another seed

Hi!

I'm trying to get a bedMethyl file from some Nanopore modbam files that I generated with Guppy 6.4.6 using cfg: dna_r10.4.1_e8.2_400bps_modbases_5mc_cg_sup_prom.cfg.

Originally I was running into errors when giving the folder that the files are in, or as the files with a wildcard, so I ended up using samtools merge to merge all of my modbams and then index. This merged modbam file is 125GB. Now my current error is as follows:

modkit pileup 7254_wgs_meth_sorted.bam 7254_wgs_pileup.bed --log-filepath 7254_wgs_pileup.log --combine-mods

from the log file:
[src/logging.rs::54][2023-04-19 10:23:04][DEBUG] command line: modkit pileup 7254_wgs_meth_sorted.bam 7254_wgs_pileup.bed --log-filepath 7254_wgs_pileup.log --combine-mods [src/commands.rs::100][2023-04-19 10:23:04][INFO] sampling 10042 reads from BAM [src/interval_processor.rs::227][2023-04-19 10:23:04][INFO] sampled 0 probabilities [src/bin/main.rs::15][2023-04-19 10:23:04][ERROR] Error! didn't sample enough data, try a larger fraction of another seed

Any reason that it would be sampling 0 probabilities? I get an identical error when I try just a single indexed modbam file from the Guppy output, and no difference whether or not I include --combine-mods'

Thanks! Let me know if I am not using the tool correctly.

Q: `--filter-threshold FRAC` description

I do understand what --filter-percentile is actually doing.

I am not sure though what --filter-threshold is really doing, nor what it (potentially) changes in the results file.
Technically it sets a "threshold", okay.

Can you elaborate a bit on that parameter, describing what it is actually doing, what kind of threshold is set and how this influences the results in the results file(s)? Or if this parameter should be better left untouched in order to let modkit set the threshold?

This is a very basic question and the explanation will probably be quite simple, but somehow I didn't get it yet :-)

Mixed delimiter types in output bed file

I don't know if this is done on purpose, but the output bed file has mixed delimiters (tab and space) which makes it super annoying to parse. I'm using modkit v0.1.6 with --preset traditional as the only setting.

For example, this looks a little suspicious:

$ cat ../../00_samples/cases/P10227_30544.D01.merged.modkit.bed | head
chr1    10468   10469   m       3       .       10468   10469   255,0,0 3 100.00 3 0 0 0 0 0 1
chr1    10470   10471   m       1       .       10470   10471   255,0,0 1 100.00 1 0 0 0 0 0 0
chr1    10483   10484   m       4       .       10483   10484   255,0,0 4 100.00 4 0 0 0 0 0 0
chr1    10488   10489   m       4       .       10488   10489   255,0,0 4 100.00 4 0 0 0 0 0 0
chr1    10492   10493   m       2       .       10492   10493   255,0,0 2 100.00 2 0 0 0 1 0 1
chr1    10496   10497   m       2       .       10496   10497   255,0,0 2 100.00 2 0 0 0 0 0 0
chr1    10524   10525   m       2       .       10524   10525   255,0,0 2 100.00 2 0 0 0 0 0 0
chr1    10541   10542   m       2       .       10541   10542   255,0,0 2 100.00 2 0 0 0 0 0 0
chr1    10562   10563   m       3       .       10562   10563   255,0,0 3 66.67 2 1 0 0 0 0 0
chr1    10570   10571   m       3       .       10570   10571   255,0,0 3 66.67 2 1 0 0 0 0 0

Using verbose cat tells all:

$ cat -veT ../../00_samples/cases/P10227_30544.D01.merged.modkit.bed | head
chr1^I10468^I10469^Im^I3^I.^I10468^I10469^I255,0,0^I3 100.00 3 0 0 0 0 0 1$
chr1^I10470^I10471^Im^I1^I.^I10470^I10471^I255,0,0^I1 100.00 1 0 0 0 0 0 0$
chr1^I10483^I10484^Im^I4^I.^I10483^I10484^I255,0,0^I4 100.00 4 0 0 0 0 0 0$
chr1^I10488^I10489^Im^I4^I.^I10488^I10489^I255,0,0^I4 100.00 4 0 0 0 0 0 0$
chr1^I10492^I10493^Im^I2^I.^I10492^I10493^I255,0,0^I2 100.00 2 0 0 0 1 0 1$
chr1^I10496^I10497^Im^I2^I.^I10496^I10497^I255,0,0^I2 100.00 2 0 0 0 0 0 0$
chr1^I10524^I10525^Im^I2^I.^I10524^I10525^I255,0,0^I2 100.00 2 0 0 0 0 0 0$
chr1^I10541^I10542^Im^I2^I.^I10541^I10542^I255,0,0^I2 100.00 2 0 0 0 0 0 0$
chr1^I10562^I10563^Im^I3^I.^I10562^I10563^I255,0,0^I3 66.67 2 1 0 0 0 0 0$
chr1^I10570^I10571^Im^I3^I.^I10570^I10571^I255,0,0^I3 66.67 2 1 0 0 0 0 0$

unexpected behaviour from multiple modikit commands

Hello. I have been using modkit to process modification calls generated from dorado with the [email protected] model, and obtained unexpected results from several commands. In many cases, the number of supporting reads I can see in the BAM file does not agree with the results I get from modkit commands. I was hoping someone may be able to clarify how modkit behaves in several scenarios.

sample-probs only provides quality score percentile estimates for the canonical base in the thresholds.tsv file. From the documentation, I expected to see estimates for m and h also. I have copied the contents of the threshold.tsv file below. if I use the --hist`` argument, the output does include mandhin theprobabilities.tsv` file, which is confusing.

base percentile threshold
C 10 0.8300781
C 50 0.98046875
C 90 0.9941406

The command I run is
modkit sample-probs -t 20 mod_basecall_aligned.bam --out-dir sample_probs-on-target --hist --only-mapped --include-bed target-regions.bed
I use include-bed as this is from an adaptive sequencing run.

I have been trying to calculate the probability threshold estimates myself for each modified base using the output of extract however in the output, there appear to be reads missing for some modification calls that I can see in the BAM file via IGV. perhaps someone could clarify exactly how the probability thresholds could be calculated using the output from extract. this would also be helpful as I would like to visualize the distribution of probabilities for a given modification. my command is:
modkit extract mod_basecall_aligned.bam base-mod-data.tsv \ --include-bed target-regions.bed \ --mapped \ -t 20
using pileup with the combine-strands arguments appears to include reads in the output file that have been combined those from the + strand, however none from the negative strand. when i count the values in the 6th column of the output I get the output below:
. +
144134 713884

I am confused as to why no - strand only entires exist, as I can see these in the BAM file. Additionally, there are sites where I can see + and - strand aligned reads, however there is only an output for the + strand at that site in the output file. Perhaps someone could advise why this might be happening. My code:
modkit pileup mod_basecall_aligned.bam mod_basecall_aligned.bed \ --cpg \ --threads 40 \ --ref $ref_dir/Homo_sapiens.GRCh38.dna.primary_assembly.fa \ --include-bed target-regions.bed \ --combine-strands

Thanks in advance,

incomplete output using duplex and bed with modkit extract

Hi @ArtRand . I’m running modkit extract using duplex reads with modifications on both strands and using a bed file to restrict output to CpG’s (see below). The bed file was produced with the modkit motif-bed as seen in the documentation.
modkit extract dx.bam mods.tab --ref my_fasta.fa --include-bed my_cg_motifs.bed

However, the command only reports the information of one (e.g. position 155575) of the two cytosine bases (e.g position 155574 and 155575) within the CpG context for both strands. Output using bed:

1e309936-8d76-45f5-b1ab-1b81a856718e;f313897e-af24-4cb2-affe-befaee892059       22281   155575  1       +       -       47      10      22303   0.001953125     h       34      CCGAC   GTCGG   C
1e309936-8d76-45f5-b1ab-1b81a856718e;f313897e-af24-4cb2-affe-befaee892059       22281   155575  1       +       -       47      10      22303   0.001953125     m       34      CCGAC   GTCGG   C
1e309936-8d76-45f5-b1ab-1b81a856718e;f313897e-af24-4cb2-affe-befaee892059       22245   155611  1       +       -       47      10      22303   0.009765625     h       50      TCGGG   CCCGA   C
1e309936-8d76-45f5-b1ab-1b81a856718e;f313897e-af24-4cb2-affe-befaee892059       22245   155611  1       +       -       47      10      22303   0.001953125     m       50      TCGGG   CCCGA   C

This is not the case when I run the command without the bed file, both cytosine bases of the CpGs are reported (i.e. position 155574 and 155575) -in the cases where a CpG exists of course. Output not using bed:

1e309936-8d76-45f5-b1ab-1b81a856718e;f313897e-af24-4cb2-affe-befaee892059       22282   155574  1       -       -       47      10      22303   0.001953125     h       39      CCCGA   CCCGA   G
1e309936-8d76-45f5-b1ab-1b81a856718e;f313897e-af24-4cb2-affe-befaee892059       22282   155574  1       -       -       47      10      22303   0.001953125     m       39      CCCGA   CCCGA   G
1e309936-8d76-45f5-b1ab-1b81a856718e;f313897e-af24-4cb2-affe-befaee892059       22281   155575  1       +       -       47      10      22303   0.001953125     h       34      CCGAC   GTCGG   C
1e309936-8d76-45f5-b1ab-1b81a856718e;f313897e-af24-4cb2-affe-befaee892059       22281   155575  1       +       -       47      10      22303   0.001953125     m       34      CCGAC   GTCGG   C
1e309936-8d76-45f5-b1ab-1b81a856718e;f313897e-af24-4cb2-affe-befaee892059       22246   155610  1       -       -       47      10      22303   0.017578125     h       39      CTCGG   CTCGG   G

Is there any reason why could this be happening?
I notice that in the modkit extract command example in the documentation the --include-bed is not used but the --include-only option is, which will thrown an error in the latest versions.

Extra question:
I am puzzled about the modifications probabilities conversion. For instance, the last output example (position 155574) the converted probability is 0.001953125. However, if I convert this probability back ( 0.001953125*256), I get 0.5 and the bam encoded is 0 at this modification position. Is it because taking the middle point between the two integers (e.g. [0,1]) is the safest for remapping between the two scale probabilities ranges ( [0,255] -> [0,1] )?

Thank you for any information!

Difficulties to run modkit with the modbases tutorial

Hello, we are starting to apply ONT in our laboratory.
I'd like to apply the modkit to the chr20 data from the GIAB NA24385 sample. I managed to replicate the tutorial using modbam2bed (https://labs.epi2me.io/notebooks/Modified_Base_Tutorial.html), but I would like to apply modkit as indicated by nanopore.

However, I am getting the result: “processed 0 rows and skipped ~331448 reads.”

Has anyone managed to modkit this data or have any suggestions as to what might be wrong?

Please let me know if you need more information.

Thanks in advance,

Marcel

About the output of modkit

Hi, thank you so much for sharing this tool.

It seems that the output bedMethyl file will be the summary at the site level, which means each line corresponds to each site summarized from the reads covering this site. I just would like to know is there any way to create the output where each read for each line.

Because I want to know the status of each CG on each read. Although there is the status of each CG on each read in the modbam file, but there is not the coordinate of each CG, and MM/ML tag only provides the relative location to the sequence, not the position on the reference.

Thank you so much!

[call-mods] `--no-filtering` broken, use `--filter-threshold 0.0` instead

The modkit call-mods command does not properly handle --no-filtering. A work-around is to specify a filter threshold of 0.0 (with --filter-threshold 0.0 or --filter-threshold C:0.0, etc.). Will be fixed ASAP in 0.1.8.

Empty bed file

Hello!
I have been trying to run modkit pileup but I get no output.
Command I ran:

./dist/modkit pileup --log-filepath pileup.log ../mod.bases.basecalling/5mC.CpG.mod.bases.mitochondria.bam pileup.5mC.CpG.mitochondria.bed

Log file output:

[src/commands.rs::133][2023-05-25 03:33:23][INFO] sampling 10042 reads from BAM
[src/reads_sampler.rs::228][2023-05-25 03:33:23][DEBUG] sampled 0 mapped records, sampling unmapped records
[src/reads_sampler.rs::249][2023-05-25 03:33:23][DEBUG] sampled 10042 unmapped records
[src/reads_sampler.rs::256][2023-05-25 03:33:23][DEBUG] sampled 10042 records
[src/commands.rs::675][2023-05-25 03:33:23][WARN] Threshold of 0.6484375 for base C is low. Consider increasing the filter-percentile or specifying a higher threshold.
[src/commands.rs::843][2023-05-25 03:33:23][INFO] Done, processed 0 rows. Processed ~0 reads and skipped zero reads.

And I have checked and I have the correct tags in the BAM file:

c5c0d295-80d5-4469-8da8-3a1fd92fd843 4 * 0 0 * * 0 247 GGTAATACTTCGTTCAGTTACGTATTGCTTAGTTTTGCTGGGGGATAAGAGGGCTGTGGGCAAGAGTATGATTGTTGGTAGGATGATCTTTAGCATTGTAGAAAGTTTAGGTTGTGTAGGTGGTCAGAGCCATGTGTTCGTGCAGAAGCTACTAGTATGGCTTAGGCCTGTGCCAGCTTCGCATGTGGATGCAAGATCGGAAGAGCGTCGTGTAGGGAAAGAGCGTCGTGTAAGGAAAAGAGGGGAA '((('&%$''98:;<><<<9887::;+'''&%%')7>?>9222014A=BA=>>;<;<<<=?>;3('-.257:96666200113=?BCBCD?;:8:?=>>>@??>;97760/5587990/)((5;))&''36>?DCC@>777266A?<==@A;10-,+,..))+5;;=>?B??:333465778=;89;<=;87(666=;=?DDDA?==999:?<<37>54CA?<;::;<>821411,+-)(-+* MM:Z:C+h?,1,1,9,9,2,0,0,0,0;C+m?,1,1,9,9,2,0,0,0,0; ML:B:C,1,1,0,13,0,1,0,1,0,4,10,18,9,2,1,0,0,1 qs:i:13 mx:i:4 ch:i:288 rn:i:202 st:Z:2023-02-20T14:35:34Z f5:Z:FAW45780_pass_7fb25cdd_59cf256b_0.fast5 ns:i:4598 ts:i:330 sm:f:69.3264 sd:f:10.2165 sv:Z:med_mad RG:Z:59cf256be47a641beb65be04367cb11db5b86801_2021-05-17_dna_r9.4.1_minion_384_d37a2ab9

When I ran modkit summary it works but I get a warning:

./dist/modkit summary --no-sampling --no-filtering --log-filepath ./summary.log /home/viole/nanopore/mod.bases.basecalling/5mC.CpG.mod.bases.mitochondria.bam >  summary.modifications.stats

double added base mod calls for base C and read eae1c9de-30c3-4b23-bb98-b98bcece579b,potentially a logic error!

This warning is shown multiple times (> 100), but I get the stats for each modification:
bases C
total_reads_used 120725
count_reads_C 120725
base code pass_count pass_frac all_count all_frac
C m 1145242 0.0930746 1145242 0.0930746
C h 722809 0.058743183 722809 0.058743183
C - 10436509 0.8481822 10436509 0.8481822

I don't truly know what I am doing wrong, I should have a bed file with at least those 0.1% of modified bases?
I have also tried to run pileup with the --force-allow-implicit to at least get also the canonical C but I get the same log output (and an empty bed file):

[src/commands.rs::133][2023-05-25 03:34:02][INFO] sampling 10042 reads from BAM
[src/reads_sampler.rs::228][2023-05-25 03:34:02][DEBUG] sampled 0 mapped records, sampling unmapped records
[src/reads_sampler.rs::249][2023-05-25 03:34:03][DEBUG] sampled 10042 unmapped records
[src/reads_sampler.rs::256][2023-05-25 03:34:03][DEBUG] sampled 10042 records
[src/commands.rs::675][2023-05-25 03:34:03][WARN] Threshold of 0.6484375 for base C is low. Consider increasing the filter-percentile or specifying a higher threshold.
[src/commands.rs::843][2023-05-25 03:34:03][INFO] Done, processed 0 rows. Processed ~0 reads and skipped zero reads.

Could you pinpoint what I am doing wrong?
Thank you so much!
Violeta de Anca

Comparison between `modkit` and `modbam2bed`.

Following discussion already have in post #2, I did some tests on two of our samples. However, I observe different results among them. Note that for the simplicity,

the base calling is done with 5mC model only, hence no 5hmC is reported.
methylation analysis focuses on CpG only, and the results of both strands are aggregated.
the same cutoff (estimated by modkit) are used for both methods.

On puc19 control samplem both methods detected comparable methylation ratio, however the count statistics are very different. There is 50% difference on # of valid calls. More filtered calls, more nocalls from modbam2bed.

For modbam2bed, I run analysis and summary as follow. Note that 'cpg.acc' bed file is used as it is the one contains aggregated results.

> modbam2bed -m 5mC -t 32 -e --cpg --aggregate -f 0.6074219 neb_puc19.fa PAK67493_barcode19_5mC_sorted.bam > modbam2bed.puc19.CpG.pileup.bed 2>modbam2bed.puc19.CpG.pileup.log
> awk '$4!="nan" {can+=$12; mod+=$13; oth+=$16; total+=$10; filter+=$14; nocall+=$15} END{valid=can+mod+oth; print (can) " ("(can/valid)") CpG canonical\n" (mod) " (" (mod/valid) ") 5mCpG modified\n" (oth) " (" (oth/valid) ") 5hmCpG modified\n" (valid) " valid\n"(filter) " filtered, NA diff nt, NA del, " (nocall) " nocall, " (total) " total."}' modbam2bed.puc19.CpG.pileup.cpg.acc.bed
229649 (0.128886) CpG canonical
1552155 (0.871114) 5mCpG modified
0 (0) 5hmCpG modified
1781804 valid
151311 filtered, NA diff nt, NA del, 35149 nocall, 1986678 total.

For modkit, I run analysis and summary as follow.

> modkit pileup -t 3 --cpg --ref neb_puc19.fa --combine-strands --filter-threshold 0.6074219 --log-filepath puc19.CpG.pileup.log PAK67493_barcode19_5mC_sorted.bam puc19.CpG.pileup.bed
> awk '$4=="m" {can+=$13; mod+=$12; oth+=$14; valid+=$10; del+=$15; filter+=$16; diff+=$17; nocall+=$18} END{print (can) " ("(can/valid)") CpG canonical\n" (mod) " (" (mod/valid) ") 5mCpG modified\n" (oth) " (" (oth/valid) ") 5hmCpG modified\n" (valid) " total valid \n" (filter) " filtered, " (diff) " diff nt, " (del) " del, " (nocall) " nocall."}' puc19.CpG.pileup.bed
157106 (0.135189) CpG canonical
1005016 (0.864811) 5mCpG modified
0 (0) 5hmCpG modified
1162122 total valid
100066 filtered, 5921 diff nt, 5647 del, 15012 nocall.

output different context (CG, CHG, CHH) in separate files

Hi,
Is there any plan to enable bed output separately for each context (CG, CHG, CHH) as in modbam2bed? I currently use Bedtools intersect, but it is quite slow and I have multiplexed samples, so it is taking a long time. I think it would save time if this can be done in modkit too. Thanks

Option '--filter-threshold' of `summary`

Thanks for creating this exciting new tool. However, there seems to be a bug in the summary part.

I try to run summary with set filtering threshold, however it seems that it does not skip sampling process and it only processsed sampled reads (see total_reads_used)

modkit summary --filter-threshold 0.7324219 PAK67493_barcode02_sorted.bam
> sampling 10042 reads from BAM
> parsing user defined thresholds
> this output format will not be default in the next version, the table output (set with --table) will become default and this format will require the --tsv option
mod_bases       C
count_reads_C   10042
C_calls_modified_m      5735
C_frac_modified_m       0.1119155413316681
C_filtered_modified_m   1331
C_calls_unmodified      43073
C_frac_unmodified       0.8405471860120209
C_filtered_unmodified   2504
C_calls_modified_h      2436
C_frac_modified_h       0.04753727265631098
C_filtered_modified_h   1822
C_total_mod_calls       51244
C_total_filtered_mod_calls      5657
total_reads_used        10042

Only when I also use --no-filtering option, it processes all the reads.

modkit summary --no-filtering --filter-threshold 0.7324219 PAK67493_barcode02_sorted.bam
> performing no filtering
> parsing user defined thresholds
> this output format will not be default in the next version, the table output (set with --table) will become default and this format will require the --tsv option
mod_bases       C
count_reads_C   112489
C_calls_modified_m      54735
C_frac_modified_m       0.09675349468637974
C_filtered_modified_m   13888
C_calls_unmodified      484487
C_frac_unmodified       0.8564138189480234
C_filtered_unmodified   27845
C_calls_modified_h      26494
C_frac_modified_h       0.04683268636559687
C_filtered_modified_h   19498
C_total_mod_calls       565716
C_total_filtered_mod_calls      61231
total_reads_used        112489

I would assume that when --filter-threshold, no sampling is needed and all reads should be processed. Is this a bug or intended behavior?

Many thanks.

Extract mods option --exclude-bed (-v)

In the extract subcommand there's the option to exclude regions based on a bed file. The argument can either be --exclude-bed or the short version -v. I find it a bit weird that this -v exists since --include-bed does not have a short argument. And the other short arguments are quite general options (number of threads, interval size and help). Maybe a copy-paste remain from another part of the code?
Feel free to close this, since I could very well be wrong here and there's a good reason for this.

modkit settings for rerio flip-flop modified base models

Hello,
Could modkit be used to summarize the mod_mapping BAM file generated from deprecated rerio modified base model, such as this guy:
res_dna_r941_min_modbases-all-context_v001.cfg? I have run a test on a negative sample but didn't expect the 6mA percentage will 15%.

modkit summary -t32 RWPE-1_5mC_6mA_nct/mod_mappings.adjusted.sorted.bam
> sampling 10042 reads from BAM
> calculating threshold at 10% percentile
> calculated thresholds: A: 0.5136719C: 0.5175781
> this output format will not be default in the next version, the table output (set with --table) will become default and this format will require the --tsv option
mod_bases	A,C
count_reads_A	8772
count_reads_C	8772
A_calls_unmodified	14606843
A_frac_unmodified	0.8461740726279702
A_filtered_unmodified	1544135
A_calls_modified_a	2655377
A_frac_modified_a	**0.1538259273720298**
A_filtered_modified_a	333031
A_total_mod_calls	17262220
A_total_filtered_mod_calls	1877166
C_calls_unmodified	6990544
C_frac_unmodified	0.5479038391510863
C_filtered_unmodified	795181
C_calls_modified_m	5768162
C_frac_modified_m	0.4520961608489137
C_filtered_modified_m	364560
C_total_mod_calls	12758706
C_total_filtered_mod_calls	1159741
total_reads_used	8772

Thanks,

Options of 'summary'

For following two options of summray subcommand, it is unclear that it is use for threshold estimation or for the summary itself. I suggest that if it is for the threshold estimation, better change it accordingly (such as those in pileup) to make it clear.

-n, --num-reads <NUM_READS>                  Max number of reads to use, especially recommended when using a large BAM without an index. If an indexed BAM is provided, the reads will be sampled evenly
                                               over the length of the aligned reference. If a region is passed with the --region option, they will be sampled over the genomic region [default: 10042]
  
-f, --sampling-frac <SAMPLING_FRAC>          Instead of using a defined number of reads, specify a fraction of reads to sample, for example 0.1 will sample 1/10th of the reads