Coder Social home page Coder Social logo

minda's Introduction

Minda

Note: This tool is under active devlopment.

Minda is a tool for evaluating structural variant (SV) callers that

  • standardizes VCF records for compatibility with both germline and somatic SV callers,
  • benchmarks against a single VCF input file, or
  • benchmarks against an ensemble call set created from multiple VCF input files.

Installation

Clone the repository and install the dependencies via conda:

git clone https://github.com:KolmogorovLab/minda
cd minda
conda env create --name minda --file environment.yml
conda activate minda
./minda.py

Quick Usage

Benchmarking several vcfs against a truth set vcf:

./minda.py truthset --base truthset.vcf --vcfs caller_1.vcf caller_2.vcf caller_3.vcf --out_dir minda_out

Creating an ensemble from several vcfs and benchmarking against ensemble calls:

./minda.py ensemble --vcfs caller_1.vcf caller_2.vcf caller_3.vcf --out_dir minda_out

Inputs and Parameters

Required

Truthset

--out_dir        path to out directory
--base           path of base VCF
--tsv | --vcfs   tsv file path
                    -OR-
                 vcf file path(s)

Ensemble

--out_dir        path to out directory
--tsv | --vcfs   tsv file path
                    -OR-
                 vcf file path(s)
--min_support |  minimumn number of callers required to support an ensemble call
--conditions        -OR-
                 specific conditions to support a call

Optional

--bed            path to bed file for filtering records with BedTool intersect
--filter         filter records by FILTER column; default="['PASS']"
--min_size       filter records by SVLEN in INFO column
--tolerance      maximum allowable bp distance between base and caller breakpoint; default=500
--sample_name    name of sample
--vaf            filter out records below a given VAF treshold
--multimatch     allow more than one record from the same caller VCF to match a single truthset/ensemble record
VCF Input

Minda standardizes input VCFs by decomposing every SV into start and end records. Records are handled in one of two following ways:

  1. For records having a CHROM:POS pattern in the ALT field, the #CHROM and POS fields are considered the start. Minda then searches for the end record matching the ALT field among other records. Alternatively, the MATEID from the INFO field may be used to find the end record. If no end record is found, the details from the ALT field are used to create one.
  2. All other records Minda considers start records. The corresponding end records use the start #CHROM and POS is calculated by adding the start POS with absolute value of SVLEN or is extracted from the END integer in the INFO field. Minda has been tested on VCFs produced by
  • Severus
  • SAVANA
  • nanomonsv
  • Sniffles2
  • cuteSV
  • SVIM
  • GRIPSS
  • manta
  • SvABA.

If you encounter issues with these or other VCF files, please let us know.

TSV Input

The --tsv file has one required column and up three columns. The columns should be as follows:

  1. VCF paths (required)
  2. caller name
  3. prefix
If a caller name is not provided, the name listed in the source field of the VCF will be used. If more than one VCF with the same caller name is provided, prefixes disambiguate ID and column names in Minda output files. In the case where prefixes are not provided by the user, Minda automatically assigns a letter prefix in ascending alphabetically order (i.e. A, B, C, etc.).

An example of TSV contents:

/path/to/severus_ONT.vcf     Severus     ONT
/path/to/severus_PB.vcf      Severus     PB
/path/to/manta.vcf           manta       ILL
Specific Conditions

The --conditions parameter enables specific user-defined conditions to be met for each ensemble call. Input a list in double quotation marks that contains:

  1. a (nested) list of caller names, each name in single quotation marks with prefixes, if necessary
  2. an operator in single quoation marks
  3. a number

For example, from the TSV contents above, to require that an ensemble call be one for which both ONT and PB agree, when using --tsv input, specify:

"[['ONT_Severus', 'PB_Severus'], '>=', 2]"

OR when using --vcfs or --tsv input:

"[[caller_names[:2], '>=', 2]"

To combine multiple conditions, add '&' or '|' between each condition. For example, to require at least one long-read call and one short-read call to agree, specify for --tsv input:

"[[['ONT_Severus', 'PB_Severus'], '>=', 1], '&', [['ILL_manta'], '==', 1]]"

OR for --vcfs or --tsv input:

"[[caller_names[:2], '>=', 1], '&', [caller_names[2:], '==', 1]]"
VAF Filtering
Note: This requires preprocessing of VCF file. See scripts.

To run Minda with the --vaf parameter, ensure the VCF files have a VAF value in the INFO field.

Output Files

Both truthset and ensemble output:

  • tp.tsv for each caller
  • fp.tsv for each caller
  • fn.tsv for each caller
  • support.tsv - lists which callers called which truthset/ensemble records
  • results.txt - for each caller, lists the overall precision, recall, F1 scores, as well as the number of TP, FN, FP calls overall and by SVTYPE and SVLEN
  • removed_records.txt - list of caller IDs of records not evaluated after removing singletons and filtering by FILTER, SVLEN, VAF

ensemble also outputs:

  • ensemble.vcf

License

Severus is distributed under a BSD license. See the LICENSE for details.

Citation

Ayse Keskus, Asher Bryant, Tanveer Ahmad, Byunggil Yoo, Sergey Aganezov, Anton Goretsky, Ataberk Donmez, Lisa A. Lansdon, Isabel Rodriguez, Jimin Park, Yuelin Liu, Xiwen Cui, Joshua Gardner, Brandy McNulty, Samuel Sacco, Jyoti Shetty, Yongmei Zhao, Bao Tran, Giuseppe Narzisi, Adrienne Helland, Daniel E. Cook, Andrew Carroll, Pi-Chuan Chang, Alexey Kolesnikov, Erin K. Molloy, Irina Pushel, Erin Guest, Tomi Pastinen, Kishwar Shafin, Karen H. Miga, Salem Malikic, Chi-Ping Day, Nicolas Robine, Cenk Sahinalp, Michael Dean, Midhat S. Farooqi, Benedict Paten, Mikhail Kolmogorov. "Severus: accurate detection and characterization of somatic structural variation in tumor genomes using long reads." medRxiv 2024, https://doi.org/10.1101/2024.03.22.24304756.

Credits

Minda is being developed in the Kolmogorov Lab at the National Cancer Institute.

Key contributors:

  • Asher Bryant
  • Ayse Keskus
  • Mikhail Kolmogorov

Contact

If you experience any problems or would like to make a suggestion, please submit an issue. To contact the developer directly, email [email protected].

minda's People

Contributors

asherbryant avatar mikolmogorov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

xtmgah qinqian

minda's Issues

Feature request: split minda_results.txt

Hi, thanks for developing this very useful tool!

I have a minor request: would it be possible to split the minda_results.txt into tabular output that's easily readable for downsteam analysis? It's a little tricky to parse the file as-is -- it contains multiple tables and the spacing is variable between columns. I think you could split into three tables: overall, sv type results, and sv length results, with an extra column in the latter two for TP/FN/FP.

Error with --vaf Parameter

Hello, I encountered an error related to the --vaf parameter while using Minda.
I would like to inquire about how to configure it.
I've tried values such as 0.1, 0.5, and 0, but they all result in the following error.

[2024-03-26 02:17:27] INFO: DECOMPOSING LP_PAO RECORDS... 
[2024-03-26 02:17:27] INFO: Original number of records: 130 
[2024-03-26 02:17:27] INFO: Number of after filtering by FILTER column: 130 
[2024-03-26 02:17:28] INFO: Number of unique indices: 39 
[2024-03-26 02:17:28] INFO: 38 paired records and 92 unpaired records found... 
[2024-03-26 02:17:28] INFO: Number of paired records paired by ALT column: 19 19 
[2024-03-26 02:17:28] INFO: Number of unpaired records paired by MATE_ID: 0 0 
[2024-03-26 02:17:28] INFO: Number of unpaired records paired by INFO column: 92 92 
[2024-03-26 02:17:28] INFO: Number of singleton records dropped: 0 
[2024-03-26 02:17:29] INFO: Number of decomposed records after pairing: 111 111 
[2024-03-26 02:17:29] INFO: Number of records after VAF filtering: 0 0 
[2024-03-26 02:17:29] INFO: Total number of decomposed records: 0 0 
[2024-03-26 02:17:29] INFO: DECOMPOSING truthset RECORDS... 
[2024-03-26 02:17:29] INFO: Original number of records: 49 
[2024-03-26 02:17:29] INFO: Number of after filtering by FILTER column: 49 
[2024-03-26 02:17:29] INFO: Number of unique indices: 1 
[2024-03-26 02:17:29] INFO: No paired records found... 
[2024-03-26 02:17:29] INFO: Number of paired records paired by ALT column: 0 0 
[2024-03-26 02:17:29] INFO: Number of unpaired records paired by MATE_ID: 0 0 
[2024-03-26 02:17:29] INFO: Number of unpaired records paired by INFO column: 45 45 
[2024-03-26 02:17:29] INFO: Number of singleton records dropped: 4 
[2024-03-26 02:17:29] INFO: Number of decomposed records after pairing: 45 45 
[2024-03-26 02:17:29] INFO: Number of records after VAF filtering: 0 0 
[2024-03-26 02:17:29] INFO: Total number of decomposed records: 0 0 
Traceback (most recent call last):
 File "./minda/minda.py", line 26, in <module>
   main()
 File "./minda/minda.py", line 22, in main
   sys.exit(main())
 File "./minda/minda/main.py", line 211, in main
   args.func(args)
 File "./minda/minda/main.py", line 141, in run
   results = get_results(decomposed_dfs_list, support_df, caller_names, args.out_dir, args.sample_name, max_len, args.tolerance, args.vaf,args.command, args)
 File "./minda/minda/stats.py", line 91, in get_results
   stats_dfs = _get_stats_df(tp_df, fn_df, fp_df, paired_df, base_df, caller_name, max_len, out_dir, sample_name)
 File "./minda/minda/stats.py", line 53, in _get_stats_df
   precision = tp/(tp+fp)
ZeroDivisionError: division by zero

Program operates normally when the --vaf parameter is unspecified.
Benchmark used is from COLO829 new benchmark and the VCF for comparison is generated by Severus.

Best,
Sin-Dian

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.