Coder Social home page Coder Social logo

bluenote-1577 / skani Goto Github PK

View Code? Open in Web Editor NEW
157.0 5.0 9.0 43.68 MB

Fast, robust ANI and aligned fraction for (metagenomic) genomes and contigs.

License: MIT License

Rust 79.35% Python 0.18% Shell 0.14% Roff 20.34%
average-nucleotide-identity bioinformatics metagenomics rust

skani's Introduction

skani - accurate, fast nucleotide identity calculation for MAGs, genomes, and databases

Introduction

skani is a program for calculating average nucleotide identity (ANI) and aligned fraction (AF) for DNA sequences (contigs/MAGs/genomes) and ANI > ~80%.

skani uses an approximate mapping method without base-level alignment to get ANI. It is magnitudes faster than BLAST-based methods and almost as accurate. skani offers:

  1. Accurate ANI calculations for MAGs. skani is accurate for incomplete and medium-quality metagenome-assembled genomes (MAGs). Pure sketching methods (e.g. Mash) may underestimate ANI for incomplete MAGs.

  2. Aligned fraction results. skani outputs the fraction of genome aligned.

  3. Fast computations. Indexing/sketching is ~ 3x faster than Mash, and querying is about 25x faster than FastANI (but slower than Mash).

  4. Efficient database search. Querying a genome against a preprocessed database of >65000 prokaryotic genomes takes seconds with a single processor and ~6 GB of RAM. Constructing a database from genome sequences takes minutes to an hour.

Updates

v0.2.2 - 2024-07-04

  • Added the --small-genomes preset that is an alias for -c 30 -m 200 --faster-small
  • Fixed some bugs

See the CHANGELOG for the skani's full versioning history.

Install

Option 1: Build from source

Requirements:

  1. rust programming language and associated tools such as cargo are required and assumed to be in PATH.
  2. A c compiler (e.g. GCC)
  3. make

Building takes a few minutes (depending on # of cores).

git clone https://github.com/bluenote-1577/skani
cd skani

# If default rust install directory is ~/.cargo
cargo install --path . --root ~/.cargo
skani dist refs/e.coli-EC590.fasta refs/e.coli-K12.fasta

# If ~/.cargo doesn't exist use below commands instead
#cargo build --release
#./target/release/skani dist refs/e.coli-EC590.fasta refs/e.coli-K12.fasta

See the Releases page for obtaining specific versions of skani.

Option 2: Conda (source version: 0.2.1)

Anaconda-Server Badge Anaconda-Server Badge

conda install -c bioconda skani

Option 3: Pre-built x86-64 linux statically compiled executable

We offer a pre-built statically compiled executable for x86-64 Linux systems. That is, if you're on an x86-64 Linux system, you can just download the binary and run it without installing anything.

For using the latest version of skani:

wget https://github.com/bluenote-1577/skani/releases/download/latest/skani
chmod +x skani
./skani -h

Important: the binary runs slightly slower (3-10%) most of the time, but it can be drastically slower on some tasks.

Quick start

# compare two genomes for ANI. skani is symmetric, so order does not affect ANI
skani dist genome1.fa genome2.fa 
skani dist genome2.fa genome1.fa 

# compare multiple genomes; all options take -t for multi-threading.
skani dist -t 3 -q query1.fa query2.fa -r reference1.fa reference2.fa -o all-to-all_results.txt

# compare individual fasta records (e.g. contigs)
skani dist --qi -q assembly1.fa --ri -r assembly2.fa  

# construct database and do memory-efficient search
skani sketch genomes_to_search/* -o database
skani search query1.fa query2.fa ... -d database

# use sketch from "skani sketch" output as drop-in replacement
skani dist database/query.fa.sketch database/ref.fa.sketch

# construct similarity matrix/edge list for all genomes in folder
skani triangle genome_folder/* > skani_ani_matrix.txt
skani triangle genome_folder/* -E > skani_ani_edge_list.txt

# we provide a script in this repository for clustering/visualizing distance matrices.
# requires python3, seaborn, scipy/numpy, and matplotlib.
python scripts/clustermap_triangle.py skani_ani_matrix.txt 

Tutorials and manuals

For more information about using the specific skani subcommands, see the guide linked above.

skani tutorials

Some common use cases and parameter settings are outlined in the cookbook.

Pre-sketched databases can be downloaded and quickly searched against. GTDB-R214 is currently supported.

See the advanced usage guide linked above for more information about topics such as:

  • optimizing sensitivity/speed of skani
  • optimizing skani for long-reads or contigs
  • making skani for memory efficient for huge data sets

Output

If the resulting aligned fraction for the two genomes is < 15%, no output is given.

In practice, this means that only results with > ~82% ANI are reliably output (with default parameters). See the skani advanced usage guide for information on how to compare lower ANI genomes.

The default output for search and dist looks like

Ref_file	Query_file	ANI	Align_fraction_ref	Align_fraction_query	Ref_name	Query_name
refs/e.coli-EC590.fasta	refs/e.coli-K12.fasta	99.39	93.95	93.37	NZ_CP016182.2 Escherichia coli strain EC590 chromosome, complete genome	NC_007779.1 Escherichia coli str. K-12 substr. W3110, complete sequence
  • Ref_file: the filename of the reference.
  • Query_file: the filename of the query.
  • ANI: the ANI.
  • Aligned_fraction_query/reference: fraction of query/reference covered by alignments.
  • Ref/Query_name: the id of the first record in the reference/query file.

The order of results is dependent on the command and not guaranteed to be deterministic when > 5000 query genomes are present. dist and search try to place the highest ANI results first.

Citation

Jim Shaw and Yun William Yu. Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nature Methods (2023). https://doi.org/10.1038/s41592-023-02018-3

Feature requests, issues

skani is actively being developed by me (Jim Shaw). I'm more than happy to accommodate simple feature requests (different types of outputs, etc). Feel free to open an issue with your feature request on the GitHub repository. If you catch any bugs, please open an issue or e-mail me (e-mail on my website).

Calling skani from rust or python

Rust API

If you're interested in using skani as a rust library, check out the minimal example here: https://github.com/bluenote-1577/skani-lib-example. The documentation is currently minimal (https://docs.rs/skani/0.1.0/skani/) and I guarantee no API stability.

Python bindings

If you're interested in calling skani from python, see the pyskani python interface and bindings to skani written by Martin Larralde. Note: I am not personally involved in the pyskani project and do not offer guarantees on the correctness of the outputs.

skani's People

Contributors

althonos avatar bluenote-1577 avatar fplazaonate avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

skani's Issues

skani versus bindash

Hello Jim,

In the main paper you compare skani with Mash, but not bindash, another important breakthrough of MinHash based on B-bit One Permutation MinHash with optimal densification (http://proceedings.mlr.press/v70/shrivastava17a.html), which is about 1000 time faster than Mash, both sketch and dist. This is due to the theoretical breakthrough of MinHash, from time complexity O(d*k) to O(d+k) using 2-universal hashing, where d is the nozero element in set while k is number of MinHashes, in practice, k is like 10^4 or something to have 99.9% ANI accuracy, while d is often, 3x10^6 or so, making k negligible. I understand that MinHash is somehow biased by set size, but in terms of speed, bindash should be the new standard for benchmark instead of the old k-bottom sketch behind Mash, which is 20 years ago.

Thanks,

Jianshu

Implementation of Thomas Wang's integer hash function

Hi.

I think there is a bug in the implementation of Thomas Wang's integer hash function. Specifically, the first line should be key = (!key).wrapping_add(key << 21); instead of key = !key.wrapping_add(key << 21);. I'm basing this off of Heng Li's C implementation here:
https://gist.github.com/lh3/974ced188be2f90422cc#file-inthash-c

And the Rust port description here:
https://aebou.rbind.io/post/a-rust-glimpse-at-thomas-wang-integer-hash-function/

In my testing, this does change the resulting hashes, though from my testing it doesn't seem to materially impact results in my application.

Cheers,
Donovan

skani outputting low number of comparisons

Hello developer!, I'm testing Skani against a set of MAGs that I made by making all-to-all comparisons, I ran skani with the default parameters as the following:

skani dist -t 20 -q mags/*.fa -r mags/*.fa -o skani_petase_mags/mags_dist.txt

given that I have 114 MAGs, I'm expecting to have 6421 rows (or pairwise combinations), so as I noticed, I'm getting only 178 comparisons. By looking at the documentation I see that you preset a minimum alignment fraction of 15%, which by understanding the program's sparse chainning algorithm, is a necessary step to detect orthologous regions.

When I set --min-af 0 I notice that I have more comparisons but still I'm getting a relatively low number of comparisons (206). so, the possible explanation for this is that there are no comparisons between all my MAGs because there is a null alignment fraction between the majority of those? or it might be that those comparisons fall outside the confidence interval in the ANI distribution?

bests,

Valentín.

needletail buffer-redux

Hello Jim,

Do you mind update either bio or needletail to rely on a new maintained version of buf-redux (called buffer-redux, here: https://crates.io/crates/buffer-redux/1.0.1) since I hav the following warning when compiling it with newest Rust stable:

warning: the following packages contain code that will be rejected by a future version of Rust: buf_redux v0.8.4, partitions v0.2.4

Update on crates.io would be greatly helpful also!

Thanks,

Jianshu

question on FracMinHash comparison with Minimzer+MinHash

Hi Jim,

I need to compare FracMinHash with Minimizer + MinHash to see their Jaccard estimation accuracy. The key step in both is the sketch size for MinHash and FracMinHash: FracMinHash is in the original space (sequence) while Minimizer MinHash is in the minimizer space (much smaller, sampling density is only 1/(2+w), where w is the minimizer window size, so for 3000 bp fragment for example, total number of minimizer kmers is only several hundred). Apparently, minimizer MinHash sketch size can be only several hundred (200-500 in practise in fastANI) while for FracMinHash, since we did not sample, if using the same sketch size 200, there is no way it can be as accurate as Minmizer MinHash, but if we use a larger sketch size, like 1000+ for the original sequence space in FracMinHash, it is then not a fair comparison since minimizer itself is somehow a sketching step like MinHash to extract minimum hash value in a window. That is we are comparing 2 sketching algorithms with just one sketching algorithm. It is very difficult to determine theoretically what is the equivalent sketch size in Minimizer+MinHash all together so that we can use the same size for FracMinHash to compare. I think we need to prove first that the equivalent sketch size in Minimizer+MinHash is bounded by some range then we can use the same sketch size for FracMinHash. Do you have any idea on this topic? Or it is clear that there is a theoretical analysis that one is better than the other if use the same sketch size.

Thank you,

Jianshu

Skani accuracy blow 85% ANI

Hello Jim,

I ran an all versus all test for a collection of genomes we often use to test ANI calculators, not very much, a total of 300 genomes, spanning several genus in phylum actinobacteria phylum in NCBI bacterial genome database (thus cover 75% to 100% ANI very well/even). This is what I saw for (1) FastANI and (2) Skani, versus orthoANI_usearch. It seems above 85% ANI, Skani is pretty good and correlates well with orthoANI, similar to that of FastANI. However, below 85%, Skani variation increases significantly while FastANI is still good until below 80%, variation increases but still good enough to be trusted until around 76%. I attached the figures below. Above 85% ANI, Mash works pretty well actually according to the FastANI paper and is much faster than both FastANI and Skani, despite no alignment fraction. I understand that minimizer with large sliding window will be problematic but for now with s=24, it is still good, no significant problem in practice. I am wondering what could be next to further improve FastANI/skani. FastANI limiting step is to finding homology via minimizer while Skani is doing this using the new seeding and chaining algorithm, I assume this is also the limiting step because MinHash and FracMinHash based distance/identity estimation will be extremely fast, considering recent cutting edge MinHash algorithms such as B-bit One Permutation MinHash with optimal densification (https://academic.oup.com/bioinformatics/article/35/4/671/5058094).

ANIu_vs_FastANI
aniu_vs_skani

Thanks,

Jianshu

Incorrect version number reported

Hi. It appears skani is not report the correct version number:

> cargo build --release
...
Compiling statrs v0.16.0
Compiling bio v1.0.0
Compiling skani v0.2.0 (/srv/home/uqdparks/git/skani)
Finished release [optimized] target(s) in 52.54s
uqdparks@reid:~/git/skani$ ./target/release/skani -V
skani 0.1.4

The compiling indicates this is v0.2.0 as does the Cargo.toml file.

(Feature Request) Host the GTDB r214.1 skani database

I'm following along on the tutorial here:
https://github.com/bluenote-1577/skani/wiki/Tutorial:-setting-up-the-GTDB-genome-database-to-search-against

I noticed that your other tool Sylph has a few databases built here:
https://github.com/bluenote-1577/sylph/wiki/Pre%E2%80%90built-databases

Would it be possible to host the Skani versions of these datasets as well? In particular, the GTDB r214.1.

Regarding your note here:

This workflow is magnitudes faster than GTDB-tk, a classification tool associated to the GTDB. However, GTDB-tk is much more sensitive if your assembled genome has no direct species representative in the database. Furthermore, skani does not put the genome on a tree.

Have you looked into any ANI thresholds that could be used to suggest novel species, genus, family, etc? For example, < 95% might be new species within the genus? Or is this too speculative here.

thread 'main' panicked at src/parse.rs:111:46: -l specified file could not be opened properly. Make sure this file exists. Exiting.: Os { code: 2, kind: NotFound, message: "No such file or directory" }

I'm not sure what's going on but I made sure that my file exists.

Any idea what's throwing this error?

Here's my skani version:

$ skani --version
skani 0.2.1

Here's my genome file paths file:

(VEBA-cluster_env) jovyan@jupyter-jolespin:~/Clustering/Genomes/Prokaryotic$ GENOMES="genomes.filepaths.contamination_lt10.list"
(VEBA-cluster_env) jovyan@jupyter-jolespin:~/Clustering/Genomes/Prokaryotic$ wc -l ${GENOMES}
58235 genomes.filepaths.contamination_lt10.list

Here's my skani command:

(VEBA-cluster_env) jovyan@jupyter-jolespin:~/Clustering/Genomes/Prokaryotic$ export RUST_BACKTRACE=full
(VEBA-cluster_env) jovyan@jupyter-jolespin:~/Clustering/Genomes/Prokaryotic$ skani triangle --sparse -t 16 -l {GENOMES} -o skani_output/skani_output.contamination_lt10.tsv --ci --min-af 15.0 -s 80.0 -c 125 -m 1000 --medium

Here's my error:

thread 'main' panicked at src/parse.rs:111:46:
-l specified file could not be opened properly. Make sure this file exists. Exiting.: Os { code: 2, kind: NotFound, message: "No such file or directory" }
stack backtrace:
   0:     0x55e0e8344540 - std::backtrace_rs::backtrace::libunwind::trace::he43a6a3949163f8c
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0x55e0e8344540 - std::backtrace_rs::backtrace::trace_unsynchronized::h50db52ca99f692e7
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x55e0e8344540 - std::sys_common::backtrace::_print_fmt::hd37d595f2ceb2d3c
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:67:5
   3:     0x55e0e8344540 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h678bbcf9da6d7d75
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x55e0e826ca5c - core::fmt::rt::Argument::fmt::h3a159adc080a6fc9
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/fmt/rt.rs:138:9
   5:     0x55e0e826ca5c - core::fmt::write::hb8eaf5a8e45a738e
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/fmt/mod.rs:1094:21
   6:     0x55e0e831372d - std::io::Write::write_fmt::h9663fe36b2ee08f9
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/io/mod.rs:1714:15
   7:     0x55e0e8345aee - std::sys_common::backtrace::_print::hcd4834796ee88ad2
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:47:5
   8:     0x55e0e8345aee - std::sys_common::backtrace::print::h1360e9450e4f922a
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:34:9
   9:     0x55e0e83456d3 - std::panicking::default_hook::{{closure}}::h2609fa95cd5ab1f4
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:270:22
  10:     0x55e0e83466e8 - std::panicking::default_hook::h6d75f5747cab6e8d
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:290:9
  11:     0x55e0e83466e8 - std::panicking::rust_panic_with_hook::h57e78470c47c84de
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:707:13
  12:     0x55e0e83461c2 - std::panicking::begin_panic_handler::{{closure}}::h3dfd2453cf356ecb
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:599:13
  13:     0x55e0e8346126 - std::sys_common::backtrace::__rust_end_short_backtrace::hdb177d43678e4d7e
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:170:18
  14:     0x55e0e8346111 - rust_begin_unwind
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:595:5
  15:     0x55e0e821b702 - core::panicking::panic_fmt::hd1e971d8d7c78e0e
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:67:14
  16:     0x55e0e821bb39 - core::result::unwrap_failed::hccb456d39e9c31fc
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/result.rs:1652:5
  17:     0x55e0e82eb615 - skani::parse::parse_params::hee7ed757d493d092
  18:     0x55e0e830bf96 - skani::main::hf0a8a944fff623ac
  19:     0x55e0e822a9c3 - std::sys_common::backtrace::__rust_begin_short_backtrace::h1d1e56655e7e787d
  20:     0x55e0e82f5d93 - main
  21:     0x7f77ba770d90 - <unknown>
  22:     0x7f77ba770e40 - __libc_start_main
  23:     0x55e0e822a905 - <unknown>
Aborted (core dumped)

[Question] Other uses of k-mer sketches created by skani?

I'm wondering if there are other uses for the k-mer sketches for each genome. For example, let's say you had 1000 genomes and built an index/sketch for each one. Do you know of any methods to determine the relative abundance of a sample by aligning to these sketches instead of the full genomes? I may be thinking of the problem incorrectly but that's why I thought I would reach out to ask if this was a possibility using skani sketches or if there are any tools you know of that can do this with custom databases?

skani for eukaryotic genomes

Hello Jim,

Another question about skani,which happens to be the weakness of FastANI, very abundant kmers when anchoring,will be filtered for fastANI. This is ok for prokaryotes because nearly all genes are single copy. However, eukaryotic genomes have many copies of genes. Therefore,FastANI can not be used for eukaryotic genomes. I am wondering whether the sparse chaining idea here is subjected to this problem.

Thanks,

Jianshu

Thomas wang's hash function

Hello Jim,

I noticed that the integer hash function is not from Heng Li's C implementation minimap2, but from this website (https://aebou.rbind.io/post/a-rust-glimpse-at-thomas-wang-integer-hash-function/), which is just copy the code from probminhash crate (https://github.com/jean-pierreBoth/probminhash/blob/master/src/invhash.rs), the first Rust implementation I know (also as the author of the blog mentioned). The code is exactly the same but just change from key_arg to kmer in the input string. Did you develop the Thomas Wang's hash function in Rust independently or the other way, since Rust, the wrap add to avoid over flow is very different from that of C. I think the idea was originally by Heng Li, but the actual rust implementation should also be acknowledged since ProbminHash was under the two licenses:

Licensed under either of

Apache License, Version 2.0, LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0
MIT license LICENSE-MIT or http://opensource.org/licenses/MIT

Thanks,

Jianshu

help info for use as a library

Hello Jim,

Is that possible that a detailed function of comparing 2 genomes using skani can be added to readme so that when users want to rely on skani as library in other tools et.al. (or in other words make it modular). I can see that it is somewhere in the dist.rs

Thanks,

Jianshu

[Feature request] Many-to-many ANI comparisons

I want to try replacing FastANI with Skani in my workflow. However, I need easy access to the many vs. many utility. For example, in FastANI I would do the following: fastANI --ql [QUERY_LIST] --rl [REFERENCE_LIST] -o [OUTPUT_FILE]

Can you add this functionality to Skani?

Question on Markers.bin file.

Can I regenerate the markers.bin file in case it got deleted accidentally instead of doing the whole sketching process?

Thank You!!

Question: How does skani handle ambiguous ("N") bases?

Hello,

I'm working on a tool that computes a pairwise distance matrix of many SARS-CoV-2 consensus sequences. However, these sequences almost always contain at least a few non-ATGC ambiguous bases that count toward edit distances like Levenshtein. Would these bases also contribute to a lower ANI in skani triangle? Ultimately, I need my tool to ignore characters other than A, T, G, C, and -, so I figured I'd ask how skani handles this before I try to reinvent the wheel.

Thanks for your help!
--Nick

Aligned fraction reported by triangle mode

Hi,

The aligned fraction (AF) between two genomes is not symmetrical. The search and dist modes handle this be reporting the AF with respect to both the reference and query. When running in triangle mode, only the lower triangle is provided? How should this be interpreted? Is it possible to return the full AF matrix to account for this measure not being symmetric?

Thanks,
Donovan

`--full-matrix` doesn't work with `--sparse`

I'm experimenting with skani to cluster plasmid/virus genomes at scale (see here). To get diagonal values I started using --full-matrix, but I noticed that those values are not showing up when using --sparse. Is this intentional?

Also, it could be useful if there was a parameter that enables the diagonal in the output. I understand why this is not enabled by default (#31), but there are some cases where it is useful to have it. For example:

curl -L https://ccb-microbe.cs.uni-saarland.de/plsdb/plasmids/download/plsdb.fna.bz2 \
    | seqkit seq --only-id --upper-case \
    > plsdb.fna

skani triangle -t 16 --sparse -i -m 150 -c 30 -s 70 plsdb.fna > skani_output.tsv

awk 'NR>1 && $3>=95 && ($4>=85 || $5>=85) {
    printf("%s\t%s\t%.4f\n", $6, $7, $3 * ($4 > $5 ? $4 : $5) / 10000)
}' skani_output.tsv > edges.tsv

pyleiden edges.tsv clusters.txt

In the example above, sequences that don't have ANI and AF higher than the thresholds for any other sequence in the FASTA will not be in the network at all and will be missing from the input. Of course there are other ways of including them, but having the diagonal values (in this case, using --sparse) would make things easier.

skani as library restricted to 1 thread

We have a tool called Galah (https://github.com/wwood/galah) that uses ANI methods to cluster genomes. We implemented skani based on the example (https://github.com/bluenote-1577/skani-lib-example), here: https://github.com/wwood/galah/blob/e9e59a8c9b95f0fcbf034d8d8d505a4c631e69e2/src/skani.rs#L68. Galah also uses rayon, initialised here: https://github.com/wwood/galah/blob/e9e59a8c9b95f0fcbf034d8d8d505a4c631e69e2/src/main.rs#L32.

When we run Galah using finch/fastANI, it uses the provided threads but when we run it with skani, it is restricted to 1 thread. Any ideas?

Rescue_small question and potential bug

It looks like rescue_small rescues small query sketches but not small reference sketches (see

if query_sketch.marker_seeds.len() < 20 && rescue_small{
). This is a bug, right?

Also, do you have a rough estimate for the number of bp that corresponds to query_sketch.marker_seeds.len() < 20?

Do not use full header in the output

Description:

skani currently outputs the full sequence header in its results. When the header is long and contains spaces, this output becomes difficult to read and parse. Displaying only the sequence identifier (the portion before the first white space) would align skani's output with the conventions of most other tools.

Proposed Solution:

Introduce an optional parameter to enable this behavior, allowing users to toggle between displaying the full header and only the sequence identifier.

ANI values not symmetric?

Hi. I was under the impression that the ANI values produced by skani should be symmetric. This is generally the case, but we have found a few counter examples. The most extreme of these is below:

> skani dist -q GCF_009730655.1_ASM973065v1_genomic.fna.gz -r GCF_018799045.1_ASM1879904v1_genomic.fna.gz
[00:00:00.000] (7f195de73800) INFO   skani dist -q GCF_009730655.1_ASM973065v1_genomic.fna.gz -r GCF_018799045.1_ASM1879904v1_genomic.fna.gz
[00:00:00.107] (7f195de73800) INFO   Generating sketch time: 0.10736146
[00:00:00.120] (7f195de73800) INFO   Learned ANI mode detected. ANI may be adjusted according to a regression model trained on MAGs.
Ref_file	Query_file	ANI	Align_fraction_ref	Align_fraction_query	Ref_name	Query_name
GCF_018799045.1_ASM1879904v1_genomic.fna.gz	GCF_009730655.1_ASM973065v1_genomic.fna.gz	93.86	79.68	79.68	NZ_CP076373.1 Shewanella indica strain Colony474 chromosome	NZ_CP046378.1 Shewanella algae strain RQs-106 chromosome, complete genome
[00:00:00.120] (7f195de73800) INFO   ANI calculation time: 0.012946309

> skani dist -r GCF_009730655.1_ASM973065v1_genomic.fna.gz -q GCF_018799045.1_ASM1879904v1_genomic.fna.gz
[00:00:00.000] (7fd2d979a800) INFO   skani dist -r GCF_009730655.1_ASM973065v1_genomic.fna.gz -q GCF_018799045.1_ASM1879904v1_genomic.fna.gz
[00:00:00.106] (7fd2d979a800) INFO   Generating sketch time: 0.106732115
[00:00:00.119] (7fd2d979a800) INFO   Learned ANI mode detected. ANI may be adjusted according to a regression model trained on MAGs.
Ref_file	Query_file	ANI	Align_fraction_ref	Align_fraction_query	Ref_name	Query_name
GCF_009730655.1_ASM973065v1_genomic.fna.gz	GCF_018799045.1_ASM1879904v1_genomic.fna.gz	94.93	79.70	79.70	NZ_CP046378.1 Shewanella algae strain RQs-106 chromosome, complete genome	NZ_CP076373.1 Shewanella indica strain Colony474 chromosome
[00:00:00.119] (7fd2d979a800) INFO   ANI calculation time: 0.012245825

Am I mistaken that the ANI values should be symmetric or is this a weird corner case / bug that breaks this assumption?

Thanks,
Donovan

How to make a Heatmap with skani matrix output

Hello,

I compared MAGs using the following commands:

# 1. Generating a similarity matrix:
 skani triangle REFERENCE_LIST/* -t 5 > skani_matrix.tsv


# 2. Visualizing the clustered heatmap:
python clustermap_triangle.py skani_matrix.tsv 3

However, I am unable to create a heatmap using the output (Aligned fraction matrix written to skani_matrix.af). Any suggestions?

Identical sequences have pairwise AF lower than 100%

I'm testing skani to compute the pairwise ANI of all sequences within a FASTA file where each sequence is a genome. I noticed that when a genome is compared to itself, the resulting AF is lower than 100%. For example:

$ ./skani dist --qi --ri -r test.fna -q test.fna -o skani_output.tsv
Ref_file   Query_file   ANI      Align_fraction_ref   Align_fraction_query   Ref_name                  Query_name
--------   ----------   ------   ------------------   --------------------   -----------------------   -----------------------
test.fna   test.fna     100.00   86.91                86.91                  2001200001.a:2001201486   2001200001.a:2001201486

It could be because the sequence is short?

$ seqkit fx2tab -i -n -l test.fna
2001200001.a:2001201486	1230

Reducing -c improved the estimate:

$ ./skani dist -c 1 --qi --ri -r test.fna -q test.fna -o skani_output.tsv
Ref_file   Query_file   ANI      Align_fraction_ref   Align_fraction_query   Ref_name                  Query_name
--------   ----------   ------   ------------------   --------------------   -----------------------   -----------------------
test.fna   test.fna     100.00   98.13                98.13                  2001200001.a:2001201486   2001200001.a:2001201486

When I tried to change -m I got an error:

thread 'main' panicked at 'We currently don't allow c > 10', src/params.rs:153:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[1]    8146 abort      ./skani dist -m 10 --qi --ri -r test.fna -q test.fna -o skani_output.tsv

support for arm64 neon

Hello skani team,

Many thanks for this amazing tool further accelerating ANI calculation. I saw that compiling on macOS arm64 failed because of:

error[E0432]: unresolved import std::arch::x86_64

and all SIMD operations. I think for the nightly rust, ARM neon is fully supported and add support will be very useful of new Mac users. We are the FastANI team here at Georgia Tech, together with fast genomic search tools: https://github.com/jianshu93/gsearch

Thanks,

Jianshu

ARM64 support for skani by default

It was brought to my attention in issue #2 that two rust library imports may break ARM64 platforms: jemalloc - https://github.com/gnzlbg/jemallocator and std::arch::x86_64 which is used by skani for SIMD k-mer sketching.

Currently, the no-simd branch seems to work for compiling on ARM64 platforms. This has no SIMD (AVX2) instructions, and sketching will be about 25-30% slower, but otherwise should build fine.

I'll keep the no-simd branch as updated as possible for now. In the future, we will need an automatic way of disabling the std::arch::x86_64 library/jemalloc for platform specific builds.

Questions on AAI verus skaai?

Hello Jim,

Average Amino acid identity (so called AAI), similar to that of ANI but based on amino acid sequences is another way to measure genomic distance smaller than 70~80% ANI. I noticed that skani does support faa (amino acid format as input, predicted via gene prediction software like prodigal/FragGeneScan) kmer processing (and AAI output) but I do not see benchmarks against blastp-based AAI (so alignment based) in the paper. Does it correlate well with blastp based AAI? Since AAI search/alignemnt is much more expensive than ANI, skani will then be another faster tool to approximate blastp-based AAI.

Thanks,

Jianshu

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.