Coder Social home page Coder Social logo

medvedevgroup / eskemap Goto Github PK

View Code? Open in Web Editor NEW
2.0 3.0 1.0 11.54 MB

Sources for sketch-based all-hits read mapping paper

License: GNU General Public License v3.0

C++ 12.18% C 45.49% Makefile 0.59% Jupyter Notebook 3.78% Python 9.69% Roff 1.95% JavaScript 12.10% Cython 1.21% TeX 12.61% Perl 0.06% Shell 0.13% Gnuplot 0.21%

eskemap's Introduction

ESKEMAP - Exact SKEtch-based read MAPping

This directory contains the source code of ESKEMAP and the documentation of experiments we performed in our paper.

Publication

Schulz, T., Medvedev, P.: ESKEMAP: exact sketch-based read mapping. Algorithms for Molecular Biology. (2024)

Description

ESKEMAP is an algorithm for sketch-based, all-hits read mapping of genomic sequences. Using the sketches of a read and a reference genome, it finds all positions inside the reference genome's sketch that score high enough with respect to a similarity measure and a threshold depending on the read's length.

Installation

ESKEMAP can be used with the sketching approach implemented in minimap2. A modified version of its source code is located inside the subdirectory minimap2. It can be compiled using cmake.

cd <corer_directory>/minimap2
make

Afterwards, ESKEMAP can be compiled from the directory src.

cd <corer_directory>/src
make

Usage

./eskemap

displays the command line interface:

eskemap [-hn] [-p PATTERN_FILE] [-s TEXT_FILE] [-k KMER_LEN] [-r HASH_RATIO] [-b BLACKLIST] [-c COM_HASH_WGHT] [-u UNI_HASH_WGHT] [-t HOM_THRES] [-d DECENT] [-i INTERCEPT] [-N NESTING]

Find sketch-based pattern similarity in text.

Required parameters:
   -p   --pattern  Pattern sequences file (FASTA format)
   -s   --text     Text sequence file (FASTA format)

Optional parameters with required argument:
   -k   --ksize             K-mer length to be used for sketches (default 9)
   -w   --windowsize        Window size for minimizer sketching approach (default 10)
   -r   --hashratio         FracMin hash ratio to be used for sketches (default 0.1)
   -b   --blacklist         File containing hashes to ignore for sketch calculation
   -c   --commonhashweight  Weight to reward common hashes (default 1)
   -u   --uniquehashweight  Weight to punish unique hashes (default 1)
   -t   --hom_thres         Homology threshold (default 0)
   -d   --decent            Decent required for dynamic threshold selection
   -i   --intercept         Intercept required for dynamic threshold selection
Optional parameters without argument:
   -n   --normalize  Normalize scores by length
   -N   --nesting    Output nested homologies
   -h   --help       Display this help message

A common program call of ESKEMAP would look like:

eskemap -p reads.fa -s reference.fa -k 15 -w 10 -b highFreqKmers.txt -d 0.8 -i 12 -N

Here, we let ESKEMAP find mappings for all read sequences stored inside reads.fa (in FASTA format) to the sequence stored inside reference.fa (in FASTA format). Minimizer sketches (see Roberts et al. and Schleimer et al. for details) are used with k=15 and window size w=10. All k-mers listed in highFreqKmers.txt are excluded from all sketches. Score thresholds are calculated with respect to a read's length using a line function with decent 0.8 and intercept 12. All mappings including nested ones (-N) are reported.

Experiments

This section documents how the experiments described in our paper can be reproduced.

Requirements

Our experiments are partly documented as a snakemake workflow that allows to rerun those parts using exact program calls. The remaining steps of our experiment including postprocessing for result analysis and plot generation can be deduced/rerun from a Jupyter Notebook. For a reproduction of the whole experiment, both tools need to be installed on your local system. Additionally, the Python package Biopython and a running version of minimap2 and Winnowmap2 are also required.

Accuracy evaluation also requires a running version of BLAST with binaries available from your environment.

We also used the API of Edlib and parasail to implement a small script to calculate read mapping positions on the basis of alignments. After Edlib and parasail are installed on your system, the script can be installed by changing into the subdirectory FindSimSeqs and executing make.

cd FindSimSeqs
make

Data

Human Chromosome Y

The T2T reference assembly of human chromosome Y (Accession number NC_060948.1) was downloaded from NCBI.

It is expected to lie inside simulations/genomes. The subdirectory structure has to be initially created.

Simulated Reads

In order to create the exact same set of reads we used for our experiments, run:

mkdir -p simulations/reads
python3 scripts/simReads.py -dp 10 -lmn 100 -lmx 1000000 -lavg 9000 -ls 7000 -r simulations/genomes/t2thumanChrY.fasta -sr 0.00010909090909090909 -dr 0.0009818181818181818 -ir 0.0009090909090909091 -sd 7361077429744071834 -o simulations/reads/t2thumanChrY_sr0.00010909090909090909_dr0.0009818181818181818_i0.0009090909090909091_sd7361077429744071834_lmn100_lmx1000000_lavg9000_ls7000_dp10.fasta

This will create a subdirectory structure simulations/reads containing the unfiltered, whole set of reads stored as a FASTA file.

For filtering the read sets according to edlib's mapping results, these results first have to be generated. This can be done using the snakemake workflow:

snakemake simulations/edlibMappings/t2thumanChrY_sr0.00010909090909090909_dr0.0009818181818181818_i0.0009090909090909091_sd7361077429744071834_lmn100_lmx1000000_lavg9000_ls7000_dp10_ri0-69400.er

How to create a reads file only containing those reads for which edlib could find only up to 20 different, non-overlapping mapping positions is documented inside the Jupyter Notebook Experiments.ipynb (Section Read Filtering). Rerunning the respective notebook part will create a file ending with ..._dp10_rm20.fasta inside the directory simulations/reads.

Reproduction Workflow

Exact program calls of each program we used for mapping the reads are documented inside a snakemake workflow. The workflow may even be run to reproduce the whole experiment if all dependencies are satisfied (see Requirements) and the necessary input data is provided (see Data).

Before running the workflow a few configuration steps need to be done to ensure that program binaries and the input reference sequence are found. Dummy paths of program binaries for minimap2, Winnowmap2 and meryl the k-mer counter shipped with Winnowmap have to be replaced inside the file config.yaml:

# PLEASE ADJUST THE FOLLOWING PARAMETERS -------------------------------------------------------------------------------------------

#Programs/binaries that shall be used
minimap2Bin: "path/to/minimap2/binary"
merylBin: "path/to/meryl/binary"
WinnowmapBin: "path/to/Winnowmap2/binary"
#Basename of reference file used for mapping (without path and file suffix)
ref: "t2thumanChrY"

#-----------------------------------------------------------------------------------------------------------------------------------

[...]

The name of the reference sequence (human chromosome Y) is expected to be t2thumanChrY. If named differently, this also has to be adapted inside the configuration file.

Afterwards, the workflow can be run by just typing:

snakemake

Executing the workflow a second time will run BLAST on the result files produced by all mapping tools. Its output is required for the mapping accuracy evaluation (see below).

Notebook Analysis

After running the snakemake workflow, all remaining steps of the experiment can be performed by executing the Jupyter Notebook Experiments.ipynb (Section Result Evaluation).

Support

For any questions, feedback or problem, please feel free to file an issue or contact the developers via email and we will get back to you as soon as possible.

Licenses

eskemap's People

Contributors

tischulz1 avatar pesho-ivanov avatar

Stargazers

 avatar  avatar

Watchers

James Cloos avatar Paul Medvedev avatar  avatar

Forkers

schaudge

eskemap's Issues

Error: "Text index consists of several parts! We cannot handle this yet"

Hi Author,

I executed the following command:

./eskemap -p ./data/reads.fasta -s ./data/ref.fasta -k 9 -w 10 -b ./highAbundKmersMiniK15w10Lrgr100BtStrnds.txt -r 0.1 -c 1 -u 1 -t 0.9
However, I encountered the following error message:

ERROR: Text index consists of several parts! We cannot handle this yet

I am unsure about the cause of this error and would greatly appreciate your guidance on how to resolve it. If you need any additional information or details about my setup, please let me know.

Thank you in advance for your time and assistance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.