dib-lab / kspider Goto Github PK

View Code? Open in Web Editor NEW

8.0 5.0 1.0 21.94 MB

A simple yet powerful sequence clustering tool.

Home Page: https://dib-lab.github.io/kSpider/

License: MIT License

CMake 2.83% C++ 74.31% Shell 0.71% Python 21.56% SWIG 0.58%

dna-sequences dna protein skipmers clustering

kspider's Introduction

@dib-lab/kSpider

📖 Table of Contents

➤ Introduction

kSpider is a user-friendly command line interface program to perform sequence clustering. First, it creates an index using kProcessor for the source sequences. Second, it constructs a pairwise containment matrix through a single iteration over the index. Finally, it builds a graph from the pairwise matrix and applies a connected-components graph algorithm to extract the clusters with a user-defined containment threshold.

Documentations are hosted at https://dib-lab.github.io/kSpider

➤ Quick Installation (pip)

pip install kSpider

➤ Manual build / Development

Install dependencies

sudo apt-get install g++ swig cmake python3-dev zlib1g-dev libghc-bzlib-dev python3-distutils libboost-all-dev

git clone https://github.com/dib-lab/kSpider.git
cd kSpider
git submodule update --init --recursive
cmake -Bbuild
cmake --build build
bash build_wrapper.sh

➤ Authors


Mohamed Abuelanin	Tamer Manosur

➤ License

Licensed under MIT License.

kspider's People

Contributors

Stargazers

Watchers

Forkers

dbretina

kspider's Issues

updated usage example

I tried to follow the usage example outlined at https://dib-lab.github.io/kSpider/, but the instructions no longer work. Specifically, the indexing step seems to have changed from kSpider index_kmers... to kSpider index, with many of the arguments in the example no longer options to the new command. Would you be willing to provide updated instructions for how to cluster with kSpider? My use case is clustering isoforms in a de novo transcriptome when we have no knowledge of which genes each isoform/contig encodes. All of my transcripts are in a single FASTA file and I would like to predict which encode the same isoforms by clustering.

CI Improvement

Instead of building the whole project for each python version, build once and produce multiple wheels.

Consider serializing the pairwise hashtable

Currently, the hashtable is exported as CSV instead of binary object. Serializing in a binary file will be more efficient in the clustering and further steps.

Test kmer sizes < 15

Add sourmash as a plugin

Since kSpider's command line interface is implemented in Python, we can add sourmash as a plugin.

Implement export modes

Newick Format
Distance Matrix TSV

Is containment threshold enough?

Do we need to set a number_of_shared_kmers threshold alongside the containment % threshold?

Add the brute-force mode directly into kSpider

in https://github.com/sourmash-bio/parallel-pairwise I implemented a multithreading brute-force script to pairwise-compare sourmash sigs. After resolving #20 I will add the brute-force version to kSpider.

Link kSpider updates with sourmash docs

sourmash-bio/sourmash#2271

Indexing does not serialize correctly when using relative path

For example, kSpider index ../index_dir will not serialize the index correctly.

Include kmer abundance to allow calculating Bray-Curtis dissimilarity

https://www.wikiwand.com/en/Bray%E2%80%93Curtis_dissimilarity

Allow concurrent construction of pairwise hashtable.

Zoomable circle packing

An idea for clustering visualization

https://www.kaggle.com/code/arthurtok/zoomable-circle-packing-via-d3-js-in-ipython/notebook

Get list of nodes with most undirected edges

@ctb asked:

what’s the simplest way of finding the sketches in GTDB that have the most large overlaps with other sketches? (like, biggest cluster at highest threshold?)

Community detections

https://github.com/vtraag/louvain-igraph

https://github.com/vtraag/leidenalg

https://github.com/kharchenkolab/leidenAlg

https://github.com/barahona-research-group/PyGenStability

Convert the clustering Python code to C++

The clustering script is currently written in Python, inefficient in large pairwise matrices. The adopted clustering technique relies on the graph-connected components.

I can consider working with retworkx, and I hope it's thread safe. I can also consider other C++ graph libs.

Introduce strongly connected components in the clustering

image from Wikipedia

Add mode to assign new kmers to the nearest cluster

Split parsing and indexing/pairwise

This will unify the input data type for kSpider.

Sourmash sigs, FASTA, FASTQ will be converted into binary file. The binary files will be used then either to perform the brute-force comparison or indexing then kSpider original pairwise comparisons.

pairwise matrix validation in CI

Explore other clustering methods

kSpider supports a single method of clustering, which is graph clustering. Each sample/genome represents a node, and each edge represents a containment percentage between the two connected samples. An edge should only be created if it's above the predefined user threshold. Connected components of the undirected graph are considered to be clusters.

I will list here other clustering methods that might work.

Update docs

kSpider now supports new modes and features that need to be documented.

Add support for clustering Sourmash Signatures

Thoughts,

I don't think we need to add Sourmash API as a dependency, we just need to implement the signatures/zipped_signatures reader and use the hash values as kmers. If the signatures contain count, we can use it for count-based trimming (i.e. removing singletons). Will suppose the signatures are already scaled down but will add it also as an option.

CC @ctb

updated installation instructions

Hello! I recently tried to install kspider and wasn't able to using the installation instructions. From my mac (M1 running rosetta), I created a new conda env and then tried to pip install kspider. I got errors from missing dependencies and ended up with issues related to clang. Would you be willing to provide updated installation instructions or a conda environment file with all of the dependencies need to get kspider to install from pip?