millanek / dsuite Goto Github PK

Fast calculation of Patterson's D (ABBA-BABA) and the f4-ratio statistics across many populations/species

C++ 63.33% C 0.19% Makefile 0.17% Python 36.32%

dsuite's Introduction

Dsuite

Publication:
Malinsky, M., Matschiner, M. and Svardal, H. (2021) Dsuite ‐ fast D‐statistics and related admixture evidence from VCF files. Molecular Ecology Resources 21, 584–595. doi: https://doi.org/10.1111/1755-0998.13265
Free to view author link
Malawi cichlid data used in the manuscript: VCF file; SETS.txt file
Simulated 20-species data used in the manuscript: VCF file; SETS.txt file; TREE_FILE.nwk (input tree)

There is also a very detailed tutorial that I prepared with input from @mmatschiner.

A pre-print describing the --ABBAclustering option for analyses of gene-flow among divergent species:
Koppetsch, T., Malinsky, M. and Matschiner, M. (2024) Towards reliable detection of introgression in the presence of among-species rate variation. bioRxiv 2023.05.21.541635; doi: https://doi.org/10.1101/2023.05.21.541635

Quickstart:

Commands:
           Dtrios                  Calculate D (ABBA-BABA) and f4-ratio statistics for all possible trios of populations/species
           DtriosCombine           Combine results from Dtrios runs across genomic regions (e.g. per-chromosome)
           Dinvestigate            Follow up analyses for trios with significantly elevated D:
                                   calculates f_d, f_dM, and d_f in windows along the genome
           Fbranch                 Calculate D and f statistics for branches on a tree that relates the populations/species

Experimental:
           Dquartets               Calculate D (ABBA-BABA) and f4-ratio statistics for all possible quartets of populations/species
                                   (no outgroup specified)

Usage:
a) Dsuite Dtrios [OPTIONS] INPUT_FILE.vcf SETS.txt
b) Dsuite Dquartets [OPTIONS] INPUT_FILE.vcf SETS.txt
c) Dsuite Dinvestigate [OPTIONS] INPUT_FILE.vcf.gz SETS.txt test_trios.txt
d) Dsuite Fbranch [OPTIONS] TREE_FILE.nwk FVALS_tree.txt

Input files:

Required files:

A VCF file, which can be compressed with gzip or bgzip. It can contain multiallelic loci and indels, but only biallelic SNPs will be used.
Population/species map SETS.txt: a text file with one individual per row and a tab separating the individual’s name from the name of the species/population it belongs to, as shown below:

Ind1    Species1
Ind2    Species1
Ind3    Species2
Ind4    Species2
Ind5    Species3
Ind6    Outgroup
Ind7    Outgroup
Ind8    xxx
...     ...
IndN    Species_n

If you want some individuals to be ignored, use the xxx keyword. Therefore, you don't have to subset your VCF file if you want to use only a subset of the samples in it.

For Dtrios, at least one individual needs to be specified to be the outgroup by using the Outgroup keyword as shown above.

All species/populations are treated equally in Dquartets - there should not be any outgroup.

Optional files:

A tree in Newick format. The tree should have leaf labels corresponding to the species/population names. Branch lengths can be present but are not used. To use Fbranch, the tree must be rooted using the Outgroup. Valid examples:
(Species2,(Species1,(Species3,Species4)));
(Species2:6.0,(Species1:5.0,(Species3:3.0,Species4:4.0)));
The test_trios.txt file for Dinvestigate. One trio of populations/species per line, separated by a tab in the order P1 P2 P3:

Species1    Species2    Species3
Species1    Species4    Species2
...         ...         ...

Piped VCF input:

It is possible to 'pipe' the genotype data into Dsuite Dtrios or Dsuite Dquartets from another program, such as bcftools, allowing custom filtering of the VCF file. Just use the stdin keyword in place of the VCF file name. It is also necessary to provide the number of lines in the filtered VCF via the -l option to the Dsuite programs. For example, to filter a VCF for overall mimimum depth of at least 1000 across all samples, you would use the following commands:

NUMLINES=$(bcftools view -i 'INFO/DP>1000' INPUT_FILE.vcf.gz | wc -l)  # to get NUMLINES
bcftools view -i 'INFO/DP>1000' INPUT_FILE.vcf.gz | Dsuite Dtrios -l $NUMLINES stdin SETS.txt

Installation

Main program:

To compile you must have a reasonably recent GCC (>=4.9.0) or clang compiler (on mac OS this comes with Command Line Tools) and the zlib compression library (https://www.zlib.net). Both should already be present on most systems.

$ git clone https://github.com/millanek/Dsuite.git
$ cd Dsuite
$ make

The Dsuite executable will be in the Build folder, so to run it type e.g. ./Build/Dsuite; this will show the available commands. To execute e.g. the Dtrios command, type ./Build/Dsuite Dtrios.

[Optional] Installing the python3 Fbranch plotting script

If you want to plot the results of the f-branch calcuation (see below), you will need to install the python script for this using setuptools. You need an internet connection as some python dependencies will be downloaded from pypi.org. It may be necessary to exit python or conda virtual environments for this to work correctly.

$ cd utils
$ python3 setup.py install --user --prefix=

The above should work on both mac and linux. Note that there is no text (not even whitespace) after the = above. If you want to use your own virtual environments, you can alternatively not run setup.py and just install the dependencies with pip or conda.

Commands (v0.5 r53):

Dsuite Dtrios - Calculate the D (ABBA-BABA) and f4-ratio statistics for all possible trios of populations/species

Usage: Dsuite Dtrios [OPTIONS] INPUT_FILE.vcf SETS.txt
Calculate the D (ABBA/BABA) and f4-ratio statistics for all trios of species in the dataset (the outgroup being fixed)
The results are as definded in Patterson et al. 2012 (equivalent to Durand et al. 2011 when the Outgroup is fixed for the ancestral allele)
The SETS.txt should have two columns: SAMPLE_ID    SPECIES_ID
The outgroup (can be multiple samples) should be specified by using the keywork Outgroup in place of the SPECIES_ID

Use 'stdin' for the VCF file when piping from another program into Dsuite via standard input
in this case it is necessary to provide the number of lines in the filtered VCF via the -l option
For example, to filter the VCF for overall mimimum depth of at least 1000 across all samples:
NUMLINES=$(bcftools view -i 'INFO/DP>1000' INPUT_FILE.vcf | wc -l)  # to get NUMLINES
bcftools view -i 'INFO/DP>1000' INPUT_FILE.vcf | Dsuite Dtrios -l $NUMLINES stdin SETS.txt

       -h, --help                              display this help and exit
       -k, --JKnum                             (default=20) the number of Jackknife blocks to divide the dataset into; should be at least 20 for the whole dataset
       -j, --JKwindow                          (default=NA) Jackknife block size in number of informative SNPs (as used in v0.2)
                                               when specified, this is used in place of the --JKnum option
       -r, --region=start,length               (optional) only process a subset of the VCF file; both "start" and "length" indicate variant numbers
                                               e.g. --region=20001,10000 will process variants from 20001 to 30000
       -t, --tree=TREE_FILE.nwk                (optional) a file with a tree in the newick format specifying the relationships between populations/species
                                               D and f4-ratio values for trios arranged according to the tree will be output in a file with _tree.txt suffix
       -o, --out-prefix=OUT_FILE_PREFIX        (optional) the prefix for the files where the results should be written
                                               output will be put in OUT_FILE_PREFIX_BBAA.txt, OUT_FILE_PREFIX_Dmin.txt, OUT_FILE_PREFIX_tree.txt etc.
                                               by default, the prefix is taken from the name of the SETS.txt file
       -n, --run-name                          (optional) run-name will be included in the output file name after the PREFIX
       --no-f4-ratio                           (optional) don't calculate the f4-ratio
       -l NUMLINES                             (optional) the number of lines in the VCF input - required if reading the VCF via a unix pipe
       -g, --use-genotype-probabilities        (optional) use probabilities (GP tag) or calculate them from likelihoods (GL or PL tags) using a Hardy-Weinberg prior
                                               the probabilities are used to estimate allele frequencies in each population/species
       -p, --pool-seq=MIN_DEPTH                (optional) VCF contains pool-seq data; i.e., each 'individual' is a population
                                               allele frequencies are then estimated from the AD (Allelic Depth) field, as long as there are MIN_DEPTH reads
                                               e.g MIN_DEPTH=5 may be reasonable; when there are fewer reads, the allele frequency is set to missing
       -c, --no-combine                        (optional) do not output the "_combine.txt" and "_combine_stderr.txt" files
       --ABBAclustering                        (optional) Test whether strong ABBA-informative sites cluster along the genome

Output:

The output files with suffixes BBAA.txt, Dmin.txt, and optionally tree.txt (if the -t option was used) contain the results: the D statistics, Zscore, unadjusted p-values, the f4-ratios, and counts of the BBAA, BABA, and ABBA patterns. Please read the manuscript for more details.

The output files with suffixes combine.txt and combine_stderr.txt are used as input to DtriosCombine. If you don't need to use DtriosCombine, you can safely delete these files.

ABBA-clustering test:

When testing for introgression among highly divergent species and/or in cases where introgression happened a long time ago (rule of thumb - millions of generations), there is a risk that the Dstatistic could show a false signal of introgression due to substitution rate variation among different branches on the phylogeny. Such substitution rate variation can lead to homoplasies appearing as ABBA sites.

Importantly, ABBA sites introduced by introgression would substantially cluster along the genome, while homoplasies would appear one by one. To test this, I have added the --ABBAclustering option. The more significant clustering of ABBA sites the more confidence you can have that a gene-flow event is real and not a false positive caused by homoplasies. Two p-values are produced by this test: the clustering_sensitive value and the clustering_robust-val1 value. As the names suggest, the "sensitive" test has greater power but can produce some false positives due to mutation rate variation along the genome. The "robust" test has lower statistical power, but it is robust to mutation rate variation along the genome.

For additional details please see:
Koppetsch, T., Malinsky, M. and Matschiner, M. (2024) Among-species rate variation produces false signals of introgression. bioRxiv 2023.05.21.541635. doi: https://doi.org/10.1101/2023.05.21.541635

DtriosParallel

We provide a python script for parallel execution at <Dsuite_path>/utils/DtriosParallel. The usage is analogous to Dsuite Dtrios, except that the order of SETS.txt and INPUT_FILE.vcf is swapped in the command line so that the user can optionally provide multiple VCF files (whitespace separated). The script will autmatically call DtriosCombine to combine results of all VCF files into a single set of results files.

$ ./utils/DtriosParallel --help
usage: DtriosParallel [-h] [-k JKNUM] [-j JKWINDOW] [-t TREE] [-n RUN_NAME]
                      [-l NUMLINES] [-g] [-p --pool-seq=MIN_DEPTH] [-c]
                      [--cores CORES] [--keep-intermediate]
                      [--logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                      [--dsuite-path DSUITE_PATH]
                      [--environment-setup ENVIRONMENT_SETUP]
                      SETS.txt INPUT_FILE.vcf [INPUT_FILE.vcf ...]

This python script automates parallelisation of Dsuite Dtrios/ Dsuite
DtriosCombine. The usage is analogous to Dsuite Dtrios but computation is
performed on multiple cores (default: number of available CPUs). ATTENTION:
The order of SETS.txt and INPUT_FILE.vcf is swapped compared to Dsuite Dtrios.
This is so that multiple VCF input files can be provided. All vcf files should
have the same samples, (e.g., different chromosomes of the same callset).
Output_files are placed in the same folder as as the the SETS.txt file and
named DTParallel_<SETS_basename>_<run_name>_combined_BBAA.txt etc. This script
should run on most systems with a standard python installation (tested with
python 2.7 and 3.6).


positional arguments:
  SETS.txt              The SETS.txt should have two columns: SAMPLE_ID
                        SPECIES_ID The outgroup (can be multiple samples)
                        should be specified by using the keyword Outgroup in
                        place of the SPECIES_ID
  INPUT_FILE.vcf        One or more whitespace separated SNP vcf files.

optional arguments:
  -h, --help            show this help message and exit
  -k JKNUM, --JKnum JKNUM
                        (default=20) the number of Jackknife blocks to divide
                        the dataset into; should be at least 20 for the whole
                        dataset
  -j JKWINDOW, --JKwindow JKWINDOW
                        Jackknife block size in number of informative SNPs (as
                        used in v0.2) when specified, this is used in place of
                        the --JKnum option
  -t TREE, --tree TREE  a file with a tree in the newick format specifying the
                        relationships between populations/species D and
                        f4-ratio values for trios arranged according to the
                        tree will be output in a file with _tree.txt suffix
  -n RUN_NAME, --run-name RUN_NAME
                        run-name will be included in the output file name
  -l NUMLINES           (optional) the number of lines (SNPs) in the VCF
                        input(s) - speeds up operation if known. If N
                        INPUT_FILE.vcf files are provided, there must be N
                        comma-separated integers provided without whitespace
                        between them.
  -g, --use-genotype-probabilities
                        (optional) use probabilities (GP tag) or calculate
                        them from likelihoods (GL or PL tags) using a Hardy-
                        Weinberg prior
  -p --pool-seq=MIN_DEPTH, --pool-seq --pool-seq=MIN_DEPTH
                        (default=20) the number of Jackknife blocks to divide
                        the dataset into; should be at least 20 for the whole
                        dataset
  -c, --no-combine      (optional) do not run DtriosCombine to obtain a single
                        combined results file
  --cores CORES         (default=CPU count) Number of Dsuite Dtrios processes
                        run in parallel.
  --keep-intermediate   Keep region-wise Dsuite Dtrios results.
  --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}, -v {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Minimun level of logging.
  --dsuite-path DSUITE_PATH
                        Explicitly set the path to the directory in which
                        Dsuite is located. By default the script will first
                        check whether Dsuite is accessible from $PATH. If not
                        it will try to locate Dsuite at ../Build/Dsuite.
  --environment-setup ENVIRONMENT_SETUP
                        Command that should be run to setup the environment
                        for Dsuite. E.g., 'module load GCC' or 'conda

Output:

Output are _BBAA, _Dmin files analogus to Dtrios/DtriosCombine and are placed in the same folder as as the the SETS.txt file and named DTParallel_<SETS_basename>_<run_name>_combined_BBAA.txt etc.

DtriosCombine - Combine results from Dtrios runs across genomic regions (e.g. per chromosome)

Usage: Dsuite DtriosCombine [OPTIONS] DminFile1 DminFile2 DminFile3 ....

Combine the BBAA, ABBA, and BABA counts from multiple files (e.g per-chromosome) and output the overall D stats,
p-values and f4-ratio values

       -h, --help                              display this help and exit
       -o, --out-prefix=OUT_FILE_PREFIX        (optional) the prefix for the files where the results should be written
                                               output will be put in OUT_FILE_PREFIX_combined_BBAA.txt, OUT_FILE_PREFIX_combined_Dmin.txt, OUT_FILE_PREFIX_combined_tree.txt etc.
                                               by default, the prefix is "out"
       -n, --run-name                          (optional) run-name will be included in the output file name after the PREFIX
       -t , --tree=TREE_FILE.nwk               (optional) a file with a tree in the newick format specifying the relationships between populations/species
                                               D and f4-ratio values for trios arranged according to the tree will be output in a file with _tree.txt suffix
       -s , --subset=start,length              (optional) only process a subset of the trios

Output:

As for Dtrios, there are output files with suffixes BBAA.txt, Dmin.txt, and optionally tree.txt (if the -t option was used). They contain the overall combined results: the D statistics, Zscore, unadjusted p-values, the f4-ratios, and counts of the BBAA, BABA, and ABBA patterns.

Dinvestigate - Follow up analyses for trios with significantly elevated D: calculates D, f_d and f_dM in windows along the genome

Usage: Dsuite Dinvestigate [OPTIONS] INPUT_FILE.vcf.gz SETS.txt test_trios.txt

Outputs D, f_d (Martin et al. 2014 MBE), f_dM (Malinsky et al., 2015), and d_f (Pfeifer & Kapan, 2019) in genomic windows
The SETS.txt file should have two columns: SAMPLE_ID    POPULATION_ID
The test_trios.txt should contain names of three populations for which the statistics will be calculated:
POP1   POP2    POP3
There can be multiple lines and then the program generates multiple ouput files, named like POP1_POP2_POP3_localFstats_SIZE_STEP.txt

       -h, --help                              display this help and exit
       -w SIZE,STEP --window=SIZE,STEP         (required) D, f_D, f_dM, and d_f statistics for windows containing SIZE useable SNPs, moving by STEP (default: 50,25)
       -n, --run-name                          run-name will be included in the output file name

Fbranch - A heuristic approach designed to aid the interpretation of many correlated f4-ratio results

Usage: Dsuite Fbranch [OPTIONS] TREE_FILE.nwk FVALS_tree.txt
Implements the 'f-branch' type calculations developed by Hannes Svardal for Malinsky et al., 2018, Nat. Ecol. Evo.
Uses the f4-ratio (f_G) values produced by Dsuite Dtrios (or DtriosCombine) with the --tree option; this is the output of Dtrios with the "_tree.txt" suffix
To use  Fbranch, the tree in TREE_FILE.nwk must be rooted with the Outgroup.
Output to stdout

      -p, --pthresh                           (default=0.01) fb scores whose associated p-value is less than 
      -Z, --Zb-matrix                         (optional)  output the equivalent of fb-statistic, but with Z-scores to assess statistical significance
                                              this will be printed below the f-branch matrix
      -h, --help                              display this help and exit

Output:

The f-branch statistic in matrix-like format. Use the plotting function below to display the f-branch statistic.

Plotting Fbranch

The output of Dsuite Fbranch can be plotted with ./utils/dtools.py (see installation instructions above).

usage: dtools.py [-h] [-n RUN_NAME] [--outgroup OUTGROUP] [--use_distances]
                 [--ladderize]
                 fbranch.txt tree.newick

Plot f-branch statistic as produced by Dsuite. Produces .png and .svg files.

positional arguments:
  fbranch               Path to file containing f-branch matrix as produced by
                        Dsuite Fbranch.
  tree                  Path to .newick tree file as given to Dsuite Fbranch.

optional arguments:
  -h, --help            show this help message and exit
  -n RUN_NAME, --run-name RUN_NAME
                        Base file name for output plots. (default: fbranch)
  --outgroup OUTGROUP   Outgroup name in newick file. (default: Outgroup)
  --use_distances       Use actual node distances from newick file when
                        plotting tree. (default: False)
  --ladderize           Ladderize the input tree before plotting. (default:
                        False)

Running dtools.py yields a .png and an .svg file of the f-branch statistic along the input tree. You can edit the .svg file in a vector graphics editor (e.g., inkscape) to your liking. See Malinsky et al. 2018 Fig. 3 and the Dsuite paper for examples and interpretation of f-branch plots.

(experimental) Dsuite Dquartets - Calculate the D (ABBA-BABA) and f4-ratio statistics for all possible quartets of populations/species (no outgroup)

Usage: Dsuite Dquartets [OPTIONS] INPUT_FILE.vcf SETS.txt
Calculate the D (ABBA/BABA) and f4-ratio (f_G) statistics for all quartets of species in the dataset (there is no outgroup)
The results are as definded in Patterson et al. 2012
The SETS.txt should have two columns: SAMPLE_ID    SPECIES_ID

-h, --help                              display this help and exit
-k, --JKnum                             (default=20) the number of Jackknife blocks to divide the dataset into; should be at least 20 for the whole dataset
-j, --JKwindow                          (default=NA) Jackknife block size in number of informative SNPs (as used in v0.2)
                                        when specified, this is used in place of the --JKnum option
-r, --region=start,length               (optional) only process a subset of the VCF file
-t, --tree=TREE_FILE.nwk                (optional) a file with a tree in the newick format specifying the relationships between populations/species
                                        D and f4-ratio values for trios arranged according to the tree will be output in a file with _tree.txt suffix
-o, --out-prefix=OUT_FILE_PREFIX        (optional) the prefix for the files where the results should be written
                                        output will be put in OUT_FILE_PREFIX_BBAA.txt, OUT_FILE_PREFIX_Dmin.txt, OUT_FILE_PREFIX_tree.txt etc.
                                        by default, the prefix is taken from the name of the SETS.txt file
-n, --run-name                          (optional; default=quartets) run-name will be included in the output file name after the PREFIX
--no-f4-ratio                           (optional) don't calculate the f4-ratio
-l NUMLINES                             (optional) the number of lines in the VCF input - required if reading the VCF via a unix pipe

Parallelisation with DtriosParallel

This python script, included in the ./utils/ subfolder, automates parallel runs of Dsuite Dtrios across multiple cores on one computer and automatically combines the results using Dsuite DtriosCombine. The usage is analogous to Dsuite Dtrios (currently with more limited options) but computation is performed on multiple cores (default: number of available CPUs). It should run on most systems with a standard python installation (tested with python 2.7 and 3.6).

DtriosParallel [-h] [--cores CORES] [-k JKNUM] [-j JKWINDOW] [-t TREE]
                      [-n RUN_NAME] [--keep-intermediate]
                      [--logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                      [--dsuite-path DSUITE_PATH]
                      [--environment-setup ENVIRONMENT_SETUP]
                      INPUT_FILE.vcf SETS.txt


positional arguments:
  INPUT_FILE.vcf
  SETS.txt              The SETS.txt should have two columns: SAMPLE_ID
                        SPECIES_ID The outgroup (can be multiple samples)
                        should be specified by using the keyword Outgroup in
                        place of the SPECIES_ID

optional arguments:
  -h, --help            show this help message and exit
  --cores CORES         (default=CPU count) Number of Dsuite Dtrios processes
                        run in parallel.
  -k JKNUM, --JKnum JKNUM
                        (default=20) the number of Jackknife blocks to divide
                        the dataset into; should be at least 20 for the whole
                        dataset
  -j JKWINDOW, --JKwindow JKWINDOW
                        Jackknife block size in number of informative SNPs (as
                        used in v0.2) when specified, this is used in place of
                        the --JKnum option
  -t TREE, --tree TREE  a file with a tree in the newick format specifying the
                        relationships between populations/species D and
                        f4-ratio values for trios arranged according to the
                        tree will be output in a file with _tree.txt suffix
  -n RUN_NAME, --run-name RUN_NAME
                        run-name will be included in the output file name
  --keep-intermediate   Keep region-wise Dsuite Dtrios results.
  --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}, -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Minimun level of logging.
  --dsuite-path DSUITE_PATH
                        Explicitly set the path to the directory in which
                        Dsuite is located. By default the script will first
                        check whether Dsuite is accessible from $PATH. If not
                        it will try to locate Dsuite at ../Build/Dsuite.
  --environment-setup ENVIRONMENT_SETUP
                        Command that should be run to setup the environment
                        for Dsuite. E.g., 'module load GCC' or 'conda
                        activate'

Change log:

Selected updates (full update history is accessible on gitHub):
v0.5 r53:   Changed --ABBAclustering and provide propoer documentation for this test  
v0.5 r48:   --KS-test-for-homoplasy now works accurately for sample sizes >= 16,000  
v0.5 r47:   --KS-test-for-homoplasy output now has more accurate p-values (one-sample KS-test for uniformity)
v0.5 r46:   Support for arbitrary ploidy in Dtrios
v0.5 r45:   BUG FIX for r44 where the trio orientation in the "_tree.txt" output files was wrong (P1 and P2 swapped). This is fixed now.
v0.5 r44:   Major update:   - code re-factoring, including proper subsampling for f4-ratio calculations
                            - First implementation of the Kolgomorov-Sminov test for homoplasy (--KS-test-for-homoplasy) in Dtrios; still somewhat experimental and works only in the "_BBAA.txt" output
                            - Very small p-values now don't get rounded to 0 but are bounded at 2.3e-16
v0.4 r43:   First implementation of the pool-seq (-p) option in in Dtrios 
v0.4 r28:   Merged DtriosParallel from https://github.com/feilchenfeldt and refreshed documentation
v0.3 r27:   Added the -o (--out-prefix) option to allow more flexibility in naming output files
v0.3 r25:   Added the Dquartets program - D and f4-ratio calculation without any outgroup, for all quartets of populations/species
v0.3 r24:   Z-scores and site pattern counts (BBAA, ABBA, BABA) are now in the output of Dtrios, DtriosCombine, and Dquartets 
v0.3 r23:   Allow piped (stdin) VCF format input to Dtrios; this facilitates e.g. pre-filtering and/or bcf input using bcftools 
v0.3 r22:   Adding the d_f statistic (Pfeifer & Kapan, 2019) to Dinvestigate
v0.3 r21:   Automatic estimation of Jackknife window size to get a desired number of blocks
            Progress update in %
            f4-ratios are calculated by default by Dtrios 
            Updated documentation
v0.2 r20:   Subset option returns to DtriosCombine
v0.2 r19:   Fixed a bug in Fbranch where P1 and P2 positions where A in P2 positions and B in P1 positions were not considered
v0.2 r18:   Full implementation of D and f4-ratio in line with the Patterson et al. 2012 definitions 
            (affects only analyses where the outgroup allele is not fixed)
v0.2 r15:   Ironed bugs in Dinvestigate, truly useable from this point  
v0.2 r6:    First Fbranch version
v0.1 r1:    First workable Dsuite release 8th May 2019

dsuite's People

Contributors

Stargazers

Watchers

dsuite's Issues

Dtrios do not calculate D and f-ratio for all possible trios

I have a VCF file including 4 populations (one of that is an outgroup). I used Dsuite program (Dtrios option) for the introgression test. Based on Dsutie paper "The first program, Dtrios calculates the sums in Equation and outputs genome-wide statistics including the D, its associated p-value, and the f4-ratio statistic, for all trios of populations or species". Except for the outgroup, I have three populations and I expected the D statistics to be calculated for all 6 possible trios, while it was calculated for only one trio without any error or warning.

Can anyone help me find the problem?

some question about Dtrios and Dqartets

Hi, @millanek
Nowadays,I use Dsuite to calculate D-statistics.I meet some trouble while using the command Dtrios and Dquartets.
While I use the Dtrios to calculate D-statistics,one species occuring in position P2 or P3 introgresses with many other species with significant D-statistic(p<0.01).However,while using Dquartets,that species only occurs in position P4,having no introgression with other species.It looks so strange,could you please tell me why these happen?
Besides,I input my tree while using the command Dquartets ,but i find some trios in the file "out_quartets_tree.txt" are not according to the relationship of my tree.
I sent the similar question to you via gmail,but my account was lost and cannot find it back.
Thank you for your help.
Best wishes!

compile error on ubuntu 18.04

Hi Milan

after struggling with admixtools, your new software looks really useful.

I am having troubling installing, same error on two different systems. Can you spot the error? GCC is 7.5.0 and Zlib is 1.2.11

The error I got after "make" is below.

Thanks

g++ -c -std=c++11 Dsuite.cpp -o Build/Dsuite.o
g++ -c -std=c++11 Dsuite_utils.cpp -o Build/Dsuite_utils.o
g++ -c -std=c++11 D.cpp -o Build/D.o
g++ -c -std=c++11 gzstream.cpp -o Build/gzstream.o
g++ -c -std=c++11 Dmin.cpp -o Build/Dmin.o
g++ -c -std=c++11 Dmin_combine.cpp -o Build/Dmin_combine.o
g++ -c -std=c++11 Dsuite_fBranch.cpp -o Build/Dsuite_fBranch.o
g++ -c -std=c++11 Dquartets.cpp -o Build/Dquartets.o
g++ -c -std=c++11 Dsuite_common.cpp -o Build/Dsuite_common.o
g++ -c -std=c++11 kstest.cpp -o Build/kstest.o
In file included from kstest.cpp:21:0:
kstest.h:29:26: error: ‘int64_t’ was not declared in this scope
double ks_test(std::list<int64_t> s1, std::list<int64_t> s2, std::ostream& output, bool printDebug);
^~~~~~~
kstest.h:29:33: error: template argument 1 is invalid
double ks_test(std::list<int64_t> s1, std::list<int64_t> s2, std::ostream& output, bool printDebug);
^
kstest.h:29:33: error: template argument 2 is invalid
kstest.h:29:49: error: ‘int64_t’ was not declared in this scope
double ks_test(std::list<int64_t> s1, std::list<int64_t> s2, std::ostream& output, bool printDebug);
^~~~~~~
kstest.h:29:56: error: template argument 1 is invalid
double ks_test(std::list<int64_t> s1, std::list<int64_t> s2, std::ostream& output, bool printDebug);
^
kstest.h:29:56: error: template argument 2 is invalid
kstest.h:29:67: error: ‘std::ostream’ has not been declared
double ks_test(std::list<int64_t> s1, std::list<int64_t> s2, std::ostream& output, bool printDebug);
^~~~~~~
In file included from kstest.cpp:22:0:
Dsuite_utils.h: In member function ‘void TrioDinfo::testIfStrongSitesUniformlyDistributed()’:
Dsuite_utils.h:593:94: error: cannot convert ‘std::__cxx11::list’ to ‘int’ for argument ‘1’ to ‘double ks_test(int, int, int&, bool)’
KSpvalForStrongSites = ks_test(uniABBAvals, linearStrongPosList, std::cerr, false);

How do I download example files

The example files could not be downloaded.
I want to download the example files Malawi cichlid data used in the manuscript: VCF file; SETS.txt file
Simulated 20-species data used in the manuscript: VCF file; SETS.txt file; TREE_FILE.nwk (input tree),
which are mentioned in README.md.

However, these files cannot be downloaded. How do I download these files?

The *__combine.txt file format is different from the tutorials.

Hi,

In the thirpart tutorial (https://github.com/mmatschiner/tutorials/tree/master/analysis_of_introgression_with_snp_data), there are 6 columns in samples__combine.txt file, but I got 12 columns(like below) when I use ./Dsuite Dtrios, can you explain the meanings of these 12 columns?
cat T90_T111__combine.txt
G1 G2 G3 54351.5 27674.1 27854.6 71440.7 33485 43411.3 71204.7 43988.2 33408.6

Best,
Kun

Defining the quartets/trios

Dear Dr. Milan Malinsky

I encountered a problem when calculating all the possible D statistics for a large set of populations (45 ingroups and 1 outgroup). After running Dtrios with all the 46 populations, I got 14190 lines of output. So the number of trios is defined as choose(45,3), as the outgroup is fixed.

However, it is not a complete combination of trios. There should be 45 different choices for P3, 44 for P2 and 43 for P1. Thus the total number of possible trios shall be 454443 = 85140. I am not sure if this is the correct way of generating all the possible trios, could you give some suggestions?

Best,
Xueyun

Dsuite: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found

Hi @millanek ,
When I try to run the program ./Dsuite , I get this report:

./Dsuite: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by ./Dsuite)

so how could I fix this problem, thanks in advance.

Best Regards,
Yung-Chien

Can Dinvestigate sliding window by length not SNP number

Hi, @millanek

Can Dinvestigate sliding window by length (such as vcftools, 25kb window, 10kb step)?

Zhigui Bao

unlinked SNPs from target enrichment data for Suite

Hello,

I am using D-suite with SNPs extracted form a target enrichment data set. Have run the analysis using both the full SNPs (~13,000 SNPs) and "unlinked SNPs" (thinned every 100 bp; 1,041 SNPs). From the D-suite publication I gather that for the relatively small number of unlinked SNPs, the proportion of cases where the strongest inferred f-branch signal corresponds the correct simulated gene flow recipient and donor branches is low but using the full sNP dataset would violate the assumption of independence of the loci. I get very different results with both datasets after running D-trios and Fbranch and would like to know your opinion on weather these analyses should be conducted at all with these type of data, and which dataset would be best.

I was also wondering if in that case, it would be best to not infer which branches and involved in geneflow, but instead, show how after the BH correction, there are multiple p-values that look significant. I am working on study investigating ILS and introgression as a source of gene-tree conflict in an recent Andean radiation.

Thank you!!

Patterson is misspelled in the "About" section.

missing data

Dear Millanek,

A question about the data on the vcf.
How are missing info about the genotype handled by the software?

For example:
In my vcf I have positions such as this :

chr1 123385 . G A 267.58 . AC1=0;AC=2;AF1=0;AN=26;DP4=1223,1123,35,10;DP=2521;FQ=-137.987;MQ0F=0;MQ=60;MQSB=1;SF=0,1,2,3,4,5,6,7,8,9,11,12,13;SGB=-0.693147;VDB=0.110291 GT:SP:AD:ADF:PL:DP:ADR 0/0:0:37:21:.,.,.:37:16 0/0:0:235:112:.,.,.:235:123 0/0:0:197:95:.,.,.:197:102 0/0:0:265:150:.,.,.:265:115 0/0:0:199:105:.,.,.:199:94 0/0:0:106:63:.,.,.:106:43 0/0:0:89:44:.,.,.:89:45 0/0:0:182:98:.,.,.:182:84 0/0:0:197:100:.,.,.:197:97 0/0:0:242:131:.,.,.:242:111 . 0/0:0:294:153:.,.,.:294:141 1/1:0:0,45:0,35:255,135,0:45:0,10 0/0:0:303:151:.,.,.:303:152

the fourth genotype from the end is a dot (.), this position is missing against one sample. Is the position chr1 123385 going to be used for the measure of the D statistics? or it is going to be discarded ?

Thanks,

Best Regards,

Nicolò Tellini

mixed ploidy dataset

Hello,
I am wondering if we can apply this method to a dataset including diploid and tetraploid populations.
Should I specify the ploidy level of each population? If yes, how?
Thanks in advance,
Nélida

Urgent help: Installing problem

Hello,
Thank you for this promising tool. I am excited to get all the D-statistics with vcf file alone.

I have been trying to compile it on a conda environment for two weeks now, but this kept failing.
I installed gcc 4.9 (https://anaconda.org/serge-sans-paille/gcc_49) on a new condo environment, but I keep getting this error
./Build/Dsuite: /usr/lib64/libstdc++.so.6: version GLIBCXX_3.4.21' not found (required by /Build/Dsuite).I tried installing newer gcc versions (5.5, 6, 7.2) all via conda, but I keep getting the same error. When I checkedstrings /usr/lib64/libstdc++.so.6 | grep GLIBCXX`
I got

GLIBCXX_3.4
GLIBCXX_3.4.1
GLIBCXX_3.4.2
GLIBCXX_3.4.3
GLIBCXX_3.4.4
GLIBCXX_3.4.5
GLIBCXX_3.4.6
GLIBCXX_3.4.7
GLIBCXX_3.4.8
GLIBCXX_3.4.9
GLIBCXX_3.4.10
GLIBCXX_3.4.11
GLIBCXX_3.4.12
GLIBCXX_3.4.13
GLIBCXX_3.4.14
GLIBCXX_3.4.15
GLIBCXX_3.4.16
GLIBCXX_3.4.17
GLIBCXX_3.4.18
GLIBCXX_3.4.19
GLIBCXX_DEBUG_MESSAGE_LENGTH

Can you please assist.
Your urgent response is highly appreciated.
Thanks,

Dsuite installation

Dear Milla,

Encountered a problem while installing D-suite in our Linux server. Here is the error:

Dmin_combine.cpp:139:30: error: use of undeclared identifier 'isnan'
if (!isnan(thisBBAA_localD)) BBAA_local_Ds.push_back(thisBBAA_localD);
^
Dmin_combine.cpp:142:30: error: use of undeclared identifier 'isnan'
if (!isnan(thisBABA_localD)) BABA_local_Ds.push_back(thisBABA_localD);
^
Dmin_combine.cpp:145:30: error: use of undeclared identifier 'isnan'
if (!isnan(thisABBA_localD)) ABBA_local_Ds.push_back(thisABBA_localD);
^
3 errors generated.
Makefile:13: recipe for target 'Build/Dmin_combine.o' failed
make: *** [Build/Dmin_combine.o] Error 1

I have gcc ver. 5.4 and clang ver. 3.8. installed in our network. What do you think is the problem?

Thanks!

Apollo

Three questions about the interpretation of Dsuite results

Hi
I have three simple questions about the results of Dsuite

When based on the results, we realized that there is gene flow between P2 and P3 (Z-score > +3; p-value < 0.05 for example). How can we determine the direction of this gene flow? Can we reach this answer based on the results of the Dsuite or should we get help from other programs?
In the output of Dtrios, does the f4-ratio range from zero to one, and for example, the number 0.035 mean that there is 3.5 percent admixture between P2 and P3?
why the f4-ratio and D static are different for the two below trios

P1	P2	P3	Dstatistic	Z-score	p-value	f4-ratio	BBAA	ABBA	BABA
Bc1	Bc2	Dro2	0.191536	6.08768	1.14559e-09	0.0297122	2.21398e+06	210055	142524
Dro1	Dro2	Bc2	0.313128	11.5109	2.3e-16	0.0289916	2.31764e+06	135908	71090.7

In both cases, isn't the presence of introgression between Bc2 and Dro2 checked?

Specifying the labels in sets file

Heys,

I'm doing some introgression analyses and I have a crucial question. I have several individuals per species, having a total of three species plus the outgroup and I am interested in studying the introgression at the species level. In the sets file, should I specify the individual label or the species label? If I specify the individual label for each sample, I obtain a bigger results matrix with more fine-scaled data, but I'm wondering if I act like this I miss the general pattern. What do you recommend me / what do you do?

Thanks in advance!!

Data filtering

Heys,

I'm using Dsuite to investigate if there was introgression in my dataset and I'm experiencing different results depending on the filtering settings (mainly in the amount of missing data and MAF filterings). I know that Dsuite allows missing data, but what should be the correct filterings for it? Depending on the percentage, the introgressions varies! There is an existing rule on the filtering parameters to use?

Thanks in advance,
Gabriel

pool-seq

Hello,

I was wondering if there was anyway to incorporate pool-seq data into the programme? by using allele counts from the AD field in vcfs resulting from pool-seq? I am aware of the new Poolfstat but that doesn't have the ability to implement f_dM for finding specific introgressed loci using windows across the genome.

'std::length_error' bug?

Dear Milan,

Thank you for publishing this software, which I hope will be very useful for many! I just tried to run it on our cluster after installation with gcc/4.9.2, and with zlib/1.2.8 modules loaded. I run into a potential bug with different datasets, including the settings file and a shortened version of the Malinsky_et_al_2018_LakeMalawiCichlids_scaffold_0.vcf.gz file associated with the bioRxiv preprint (the first 3,012 sites). However, with the full file Malinsky_et_al_2018_LakeMalawiCichlids_scaffold_0.vcf.gz, the program worked like a charm.

I always used the following command(s):
Dsuite Dtrios Malinsky_et_al_2018_LakeMalawiCichlids_scaffold_0.vcf.gz sets.txt
Dsuite Dtrios Malinsky_et_al_2018_LakeMalawiCichlids_scaffold_0_subset3012.vcf.gz sets.txt

The STDOUT/STDERR then looked like this:

Going to calculate 62196 Dmin values
Done permutations
Processed 10000 variants in 35.78secs
GettingCounts 0 calculation 0secs
Processed 20000 variants in 73.1secs
GettingCounts 0 calculation 0.01secs
Processed 30000 variants in 108.12secs
GettingCounts 0 calculation 0secs
(...)
Done processing VCF. Preparing output files...
terminate called after throwing an instance of 'std::length_error'
  what():  vector::_M_default_append
Aborted

Could this be a bug in the code or a problem with accepting gzipped / bgzipped or uncompressed VCF files or with my installation? Thank you in advance for your help with resolving this issue!

Best wishes,
David

Tree file creates an error

Hi Milan,

Thanks for creating this program. I'm having a problem when I try and use a tree file. I see a previous issue with the same error message (fixed by ensuring species names in the map file and tree file are the same) but no matter what variants on species names or tree file I try I get the same error. Could there be another reason causing this?

The error is:
There are 8 sets (excluding the Outgroup) Going to calculate D and f4-ratio values for 56 trios Out of Range error: map::at species[i]: Species1
My tree file is:
(Outgroup,((Species1(Species2)),(Species4(Species5,Species6,Species8,Species7,Species3))));

And finally my map file is (formatted in two columns but with tab spacing - I think it displays differently here):
bb29 Species1
bb30 Species1
tpl40 Species2
t15 Species2
sbre460 Outgroup
sbre116871wga Outgroup
lh802 xxx
lh806 xxx
S303 xxx
S304 xxx
WNAC11 Species3
WNAC14 Species3
WNAC16 Species3
WNAC19 Species3
WNAC8 Species3
wnac25 Species3
WNAP7 Species4
WNAP8 Species4
WNAP9 Species4
AZ108 Species5
AZ109 Species5
AZ2 Species5
AZ3 Species5
AZ40 Species5
AZ41 Species5
43GE Species6
T19 Species6
19G Species6
36GPC Species6
37GPC Species6
BS10 Species7
BS11 Species7
BS6 Species7
BS7 Species7
BS8 Species7
TtTUS9 Species8
TtTUS3 Species8
TtTUS4 Species8
TtTUS14 Species8

Any help is much appreciated. Thank you!

no p-values calculated for any trios after recent update

Hello,

When using the latest code (the one available on 20.06.2022) for running Dsuite on my dataset it seems that no p-values can be calculated for any of the trios and the analysis can not continue. The code was working at the beginning of this year for the same dataset and I wonder if some of the latest changes in the code are giving this issue, perhaps that of "- Very small p-values now don't get rounded to 0 but are bounded at 2.3e-16".

Here is a copy of the output I get:

There are 51 sets (excluding the Outgroup)
Going to calculate D and f4-ratio values for 20825 trios
Done permutations
The VCF contains 1619 variants
Going to use block size of 79 variants to get 20 Jackknife blocks
Done processing VCF. Preparing output files...

...
p-value could not be calculated for 20825 trios
it looks like there aren't enough ABBA-BABA informative variants for these trios
If this was a run for a subset of the genome (e.g. one chromosome), you may still get p-values for these trios from DtriosCombine

Dsuite/utils/dtools.py:3542: RuntimeWarning: invalid value encountered in double_scalars
  colors = np.concatenate([[[1, 1, 1, 1]], plt.cm.Reds(np.linspace(0., 1 * fmax / max_color_cutoff, 256))])

Using a previous version of the code (previous to to May 2022) worked again so I can run my analysis, but I would like to know if by using the previous code I would be missing some useful new updates. What are exactly the new changes implemented (those from May 7th to May 15) for? Also, would you have any other recommendation on how to solve the issue an be able to use the code with the latest updates?
Thank you

Best wishes,
Juan Manuel

Question about Dquartets

HELLO, thanks for your convinient tool,
i hava some questions when i perform the Dsuite.

i use the Dsuite Dquartets to get the all possible quaitets, but only with the limited ones, also the species should be the outgroup appeared at the inner branch?
another questions about the results: if i wanna prove the branch did not have introgression in the history, P value logically should be very large one and that also means the intrgression is not significant. am i right?

hope for your reply! thanks very much.

yours,
Yuan

Some problems about Dsuit

Dear Milan Malinsky,

Some problems occured when I uesd Dsuit.

First, the parameter -f of Dsuite Dtrios. After the run of Dsuite Dtrios whith -f, the f4, f_d, and f_dM statistics did not appear in the output files. And there is no Z-score in output files.

Second, the output of 'Dsuite Dinvestigate' of 0.2 version is different with the output of 0.1 version. The f_d of version 0.2 is much bigger then D, but f_d is lesser then D in the output of version 0.1. And the windowStart of version 0.1 is scientific counting, this may be corrected in version 0.2. But the output files of version 0.2 has lesser lines then version 0.1. The output of 'Dsuite Dinvestigate' of the two version are completely different. Which one should I believe?

Best regards,
Yang Song

WARNING: Fewer than five blocks to calculate jackknife!!

Could not calculate p-values for the trio: Species2 Species3 Species4
it looks like there aren't enough ABBA-BABA informative variants for this trio.

What does it mean?

Question about the plot of F-branch

Hello,
I have some question about the plot of F-branch.

There are some grey cell in the heatmap, what are they represent? What is the difference between grey cell and white cell?
Some branchs in Y-axis are show in dotted line and without branch name. Are they subsampled from the real branch and used to calculate fG? If they are, which real branch are they sampled from?
For example, in the following plot, is the branch 6 sampled from p and q? The branch 5 is sampled from o? The branch 4 is sampled from n? The branch 3 is sampled from m? What about branch 2? Sampled from m, n, o, p, and q? And how about branch 1?

Error parsing tree when using Fbranch

Hi,

I am getting the following error when running the Fbranch command:

##command
Build/Dsuite Fbranch kivu_tree kivu_list_dsuite_Fbkivu_tree.txt > fbranch.txt

##error
ERROR: The tree string could not be parsed correctly! The remaining unparsed tree string is:
((((((A.elegans,(P.graueri,EdwardSpecies)),((((Y.kamiranzovu,(P.paucidens,Micro)),(N.olivaceous,internalNode0X)),Lit_lig_red),(L.occultidens,M.crebidens))),P.vittatus),A.aeneocolor),(P.nyereri,H.squamipinnis)),A.alluaudi);

I am using the exact same tree in newick format I used to run Dtrios and the kivu_list_dsuite_Fbkivu_tree.txt output file does have f_G values. So the tree seems to work when setting up the possible calculations, but fails in this step.

I have tried removing the outgroup, or even running a small subset of the tree but I continue to get the same parsing error.

Any ideas on what I can try? Thanks so much for the terrific package.

Best

Chad

main.NewickError: Unexisting tree file or Malformed newick tree structure. You may want to check other newick loading flags like 'format' or 'quoted_node_names'.

Dear friend, I run the "python3 dtools.py -n reduced species_sets_with_geneflow_Fbranch.txt Newick " and "python3 dtools.py species_sets_with_geneflow_Fbranch.txt Newick " with the sam error ,and the newick tree was

and the fbrance file was

could you point the detail about how to deal with the bug?
Thank you

When running Dsuite I met a problem

Hi！
I met this: What might be causing this？And I use the vcf file from (https://github.com/mmatschiner/tutorials/blob/master/analysis_of_introgression_with_snp_data/README.md#simulation), It works normally.
There are 8 sets (excluding the Outgroup)
Going to calculate D and f4-ratio values for 56 trios
Done permutations
The VCF contains 7362509 variants
Going to use block size of 368124 variants to get 20 Jackknife blocks
/slurmState/slurmSpool/slurmd/job317505/slurm_script: line 9: 414566 Aborted (core dumped) /data/01/user164/workspace/try/Dsuite/Build/Dsuite Dtrios /data/01/user164/workspace/Eospalax/Dsuite/0607.vcf.gz /data/01/user164/workspace/Eospalax/Dsuite/Dtrios_tab

ABBA pattern more than the BBAA pattern

Hi,
I used the super-matrix and coalescent methods to construct a species tree by 4000 single copy orthologs. When I counted ABBA analysis, I found that ABBA (628,561) pattern more than the BBAA pattern (301,888). What do you think the factors generate conflict between this two sets?

Best wishes,
Zhiyang Zhang

Compile error in Fedora34

I had a problem compiling Dsuite in Fedora34 getting this error:
Dsuite_utils.cpp:74:63: error: ‘numeric_limits’ is not a member of ‘std’
After some googling, I was able to resolve the problem by adding the following two lines to Dsuite_utils.cpp
#include <stdexcept>
#include <limits>
Could you add these two lines there permanently, please?
Tomas

P1 and P2 are inverted across different combination.

Dear millanek,

Thanks for such useful software.
I have one question concerning the outputs.

I am using Dtrios and forcing the output using a newick file in input which is organized as follow:
(Outgroup,(Species1,(Species2,Species3)));

The script runs using a for cycle in which every iteration replace Specie2 whit a new candidate group.
For example:
(Outgroup,(Species1,(Species2A,Species3)));
(Outgroup,(Species1,(Species2B,Species3)));
(Outgroup,(Species1,(Species2C,Species3)));

Nevertheless, in the outputs in "__tree.txt" sometimes Species2 and Species3 are inverted.

For example:

P1 | P2 | P3 | Dstatistic | Z-score | p-value | f4-ratio | BBAA | ABBA | BABA
Species3 | Species2B | Species1 | 0.143933 | 7.99989 | 6.23E-16 | 0.0196811 | 12149.8 | 1063 | 795.5
Species2A | Species3 | Species1 | 0.147353 | 7.04704 | 9.14E-13 | 0.0136954 | 11767.1 | 688.125 | 511.375

This leads to a difficult comparison between the data because for example in the ABBA field (Species3 | Species2B | Species1) the B alleles correspond to the alternative alleles in Species2B | Species1 while in (Species2A | Species3 | Species1) the B alleles (under ABBA) correspond to the alternative alleles in Species3 | Species1 .

So I was wondering does exist a way to force the calculation of the output keeping the populations fixed so that Species3 is always P2 ?

Thank you for your time,
Hope the question is clear.

Best regards,

Nicolò Tellini

Issue with Species map?

Hello Milan,

Thanks for making such an awesome program. I ran through the tutorial and now I am trying to get my own data to work, however I am having some issues with the population/species map. When I have a newick tree it does not seem to work and I get the following error:

There are 14 sets (excluding the Outgroup)
Going to calculate 364 Dmin values
Out of Range error: map::at:  key not found
species[i]: inyo1

However when I remove the newick tree it appears to be wokring. Any thoughts?

Thanks!

No output file generated when a path included in "-n" option of "Dinvestigate"

Hello,
When I use Dsuite (Version: 0.4 r34) Dinvestigate, and set -n out/chr1 . There is no output file generated in the folder "out" or the current directory. Seems if a path included in -n, the output file can not generated scuccessfully. Thus the output file can only generated in the current directory?

Yudong

cannot plot fbranch statistics

This is a fantastic set of tools. However, I am hitting a snag I cannot seem to solve. I can run

.../Dsuite Fbranch -Z Allprunedr2-50thinned100.nex.partsroot2.txt SETS_tree-guided_tree.txt > fbranch1.txt

and I get the file with lots of values and 'nan's
but then when I try the next command:

.../utils/dtools.py fbranch1.txt Allprunedr2-50thinned100.nex.partsroot2.txt

I get the following error:

Traceback (most recent call last):
File "/home/yote/bin/Dsuite/utils/dtools.py", line 3654, in
sys.exit(main())
File "/home/yote/bin/Dsuite/utils/dtools.py", line 3624, in main
fb[fb<0] = 0
File "/home/yote/.local/lib/python3.9/site-packages/pandas-1.4.2-py3.9-linux-x86_64.egg/pandas/core/ops/common.py", line 70, in new_method
return method(self, other)
File "/home/yote/.local/lib/python3.9/site-packages/pandas-1.4.2-py3.9-linux-x86_64.egg/pandas/core/arraylike.py", line 48, in lt
return self._cmp_method(other, operator.lt)
File "/home/yote/.local/lib/python3.9/site-packages/pandas-1.4.2-py3.9-linux-x86_64.egg/pandas/core/frame.py", line 6934, in _cmp_method
new_data = self._dispatch_frame_op(other, op, axis=axis)
File "/home/yote/.local/lib/python3.9/site-packages/pandas-1.4.2-py3.9-linux-x86_64.egg/pandas/core/frame.py", line 6973, in _dispatch_frame_op
bm = self._mgr.apply(array_op, right=right)
File "/home/yote/.local/lib/python3.9/site-packages/pandas-1.4.2-py3.9-linux-x86_64.egg/pandas/core/internals/managers.py", line 302, in apply
applied = b.apply(f, **kwargs)
File "/home/yote/.local/lib/python3.9/site-packages/pandas-1.4.2-py3.9-linux-x86_64.egg/pandas/core/internals/blocks.py", line 402, in apply
result = func(self.values, **kwargs)
File "/home/yote/.local/lib/python3.9/site-packages/pandas-1.4.2-py3.9-linux-x86_64.egg/pandas/core/ops/array_ops.py", line 283, in comparison_op
res_values = comp_method_OBJECT_ARRAY(op, lvalues, rvalues)
File "/home/yote/.local/lib/python3.9/site-packages/pandas-1.4.2-py3.9-linux-x86_64.egg/pandas/core/ops/array_ops.py", line 73, in comp_method_OBJECT_ARRAY
result = libops.scalar_compare(x.ravel(), y, op)
File "pandas/_libs/ops.pyx", line 107, in pandas._libs.ops.scalar_compare
TypeError: '<' not supported between instances of 'str' and 'int'

The files are attached below:
Allprunedr2-50thinned100.nex.partsroot2.txt
SETS_tree-guided_tree.txt
fbranch1.txt

Plotting fbranch z-scores

Hi Dsuite team,
I am trying to plot the fbranch statistics as well as the corresponding z-scores using the dtools.py script (as in your Malawi cichlid manuscript).

When I do this just with the fbranch statistic as follows:

Dsuite Fbranch ./collapsed_tree_no_bootstraps_rotated.txt out_combined_tree.txt > fbranch_collapsed.txt"
../Dsuite/utils/dtools.py fbranch_collapsed.txt ./collapsed_tree_no_bootstraps_rotated.txt

It works great and the heatmap is outputted (although without indication of fb statistical significance). However, when I try to use the z-scores :

Dsuite Fbranch -Z ./collapsed_tree_no_bootstraps_rotated.txt out_combined_tree.txt > fbranch_collapsed_z.txt"
tail -n 69 fbranch_collapsed_z.txt > z_scores.txt
../Dsuite/utils/dtools.py z_scores.txt ./collapsed_tree_no_bootstraps_rotated.txt

I get another heatmap scales for the z-score i.e. up to ~16 but I have had little success altering the matrix to show 'significant'/'non-significant'/nan

I'm not sure if the script was intended to add the significance dots as in your paper (apologies if not) but any advice on how you parsed the z-scores correctly to add dots to your paper manuscript would be greatly appreciated,
All the best,
Rishi De-Kayne

collapsed_tree_no_bootstraps_rotated.txt
fbranch_collapsed_z.txt
fbranch_collapsed.txt
z_scores.txt

ERROR: The tree string could not be parsed correctly!

Hello,
When I use Dsuite (Version: 0.4 r34) Fbranch, the tree can not be parsed correctly.

The input tree was as follows. Is this error due to the tree topology? What should I do to fix this error?
(Outgroup,((CA_BIG,CA_THIN),(KG_AGL,(IR_URI,(IR_MOU,(((EU-MOU,(EU-OA,(DE-MRN,AU-MRN))),AM-OA),(((IR-OA,(((((((CN-TAN,(CN-YNS,(CN-TIB,CN-OLA))),CN-STH),CN-HU),CN-WZM),CN_CLB),CN-BYK),EA-OA)),TR-OA),AF-OA)))))));

Best,
Hong Cheng

the Dsuite_fBranch.h:63:84: error: no matching function for call to ‘regex_replace(std::string&, std::regex&, const char [1])’ string treeNoBranchLengths = std::regex_replace(treeString,branchLengths,"");

when I compile，its show the error， how can I slove this.

g++ -c -std=c++11 Dsuite.cpp -o Build/Dsuite.o
In file included from Dsuite.cpp:13:0:
Dsuite_fBranch.h: In constructor ‘Tree::Tree(std::string)’:
Dsuite_fBranch.h:63:84: error: no matching function for call to ‘regex_replace(std::string&, std::regex&, const char [1])’
string treeNoBranchLengths = std::regex_replace(treeString,branchLengths,"");
^
Dsuite_fBranch.h:63:84: note: candidates are:
In file included from /usr/include/c++/4.8.2/regex:62:0,
from Dsuite_utils.h:20,
from Dsuite.cpp:9:
/usr/include/c++/4.8.2/bits/regex.h:2162:5: note: template<class _Out_iter, class _Bi_iter, class _Rx_traits, class _Ch_type> _Out_iter std::regex_replace(_Out_iter, _Bi_iter, _Bi_iter, const std::basic_regex<_Ch_type, _Rx_traits>&, const std::basic_string<_Ch_type>&, std::regex_constants::match_flag_type)
regex_replace(_Out_iter __out, _Bi_iter __first, _Bi_iter __last,
^
/usr/include/c++/4.8.2/bits/regex.h:2162:5: note: template argument deduction/substitution failed:
In file included from Dsuite.cpp:13:0:
Dsuite_fBranch.h:63:84: note: deduced conflicting types for parameter ‘_Bi_iter’ (‘std::basic_regex’ and ‘const char*’)
string treeNoBranchLengths = std::regex_replace(treeString,branchLengths,"");
^
In file included from /usr/include/c++/4.8.2/regex:62:0,
from Dsuite_utils.h:20,
from Dsuite.cpp:9:
/usr/include/c++/4.8.2/bits/regex.h:2182:5: note: template<class _Rx_traits, class _Ch_type> std::basic_string<_Ch_type> std::regex_replace(const std::basic_string<_Ch_type>&, const std::basic_regex<_Ch_type, _Rx_traits>&, const std::basic_string<_Ch_type>&, std::regex_constants::match_flag_type)
regex_replace(const basic_string<_Ch_type>& __s,
^
/usr/include/c++/4.8.2/bits/regex.h:2182:5: note: template argument deduction/substitution failed:
In file included from Dsuite.cpp:13:0:
Dsuite_fBranch.h:63:84: note: mismatched types ‘const std::basic_string<_Ch_type>’ and ‘const char [1]’
string treeNoBranchLengths = std::regex_replace(treeString,branchLengths,"");
^
make: *** [Build/Dsuite.o] Error 1

Can not make a Dsuite

Hello,
I can not make a Dsuite to compile properly.
gcc (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008
g++ -c -std=c++11 Dsuite.cpp -o Build/Dsuite.o
In file included from Dsuite.cpp:10:
Dsuite_utils.h:75:88: warning: dynamic exception specifications are deprecated in C++11 [-Wdeprecated]
75 | er> double median(RandAccessIter begin, RandAccessIter end) throw(median_of_empty_list_exception) {
| ^~~~~

In file included from Dsuite.cpp:10:
Dsuite_utils.h:381:29: warning: dynamic exception specifications are deprecated in C++11 [-Wdeprecated]
381 | void calculateFinalDs() throw(const char*) {
| ^~~~~
g++ -c -std=c++11 Dsuite_utils.cpp -o Build/Dsuite_utils.o
In file included from Dsuite_utils.cpp:9:
Dsuite_utils.h:75:88: warning: dynamic exception specifications are deprecated in C++11 [-Wdeprecated]
75 | er> double median(RandAccessIter begin, RandAccessIter end) throw(median_of_empty_list_exception) {
| ^~~~~

In file included from Dsuite_utils.cpp:9:
Dsuite_utils.h:381:29: warning: dynamic exception specifications are deprecated in C++11 [-Wdeprecated]
381 | void calculateFinalDs() throw(const char*) {
| ^~~~~
g++ -c -std=c++11 D.cpp -o Build/D.o
In file included from D.h:12,
from D.cpp:9:
Dsuite_utils.h:75:88: warning: dynamic exception specifications are deprecated in C++11 [-Wdeprecated]
75 | er> double median(RandAccessIter begin, RandAccessIter end) throw(median_of_empty_list_exception) {
| ^~~~~

In file included from D.h:12,
from D.cpp:9:
Dsuite_utils.h:381:29: warning: dynamic exception specifications are deprecated in C++11 [-Wdeprecated]
381 | void calculateFinalDs() throw(const char*) {
| ^~~~~
g++ -c -std=c++11 gzstream.cpp -o Build/gzstream.o
g++ -c -std=c++11 Dmin.cpp -o Build/Dmin.o
In file included from Dmin.h:11,
from Dmin.cpp:9:
Dsuite_utils.h:75:88: warning: dynamic exception specifications are deprecated in C++11 [-Wdeprecated]
75 | er> double median(RandAccessIter begin, RandAccessIter end) throw(median_of_empty_list_exception) {
| ^~~~~

In file included from Dmin.h:11,
from Dmin.cpp:9:
Dsuite_utils.h:381:29: warning: dynamic exception specifications are deprecated in C++11 [-Wdeprecated]
381 | void calculateFinalDs() throw(const char*) {
| ^~~~~
g++ -c -std=c++11 Dmin_combine.cpp -o Build/Dmin_combine.o
In file included from Dmin_combine.h:12,
from Dmin_combine.cpp:9:
Dsuite_utils.h:75:88: warning: dynamic exception specifications are deprecated in C++11 [-Wdeprecated]
75 | er> double median(RandAccessIter begin, RandAccessIter end) throw(median_of_empty_list_exception) {
| ^~~~~

In file included from Dmin_combine.h:12,
from Dmin_combine.cpp:9:
Dsuite_utils.h:381:29: warning: dynamic exception specifications are deprecated in C++11 [-Wdeprecated]
381 | void calculateFinalDs() throw(const char*) {
| ^~~~~
g++ -c -std=c++11 Dsuite_fBranch.cpp -o Build/Dsuite_fBranch.o
In file included from Dsuite_fBranch.h:13,
from Dsuite_fBranch.cpp:9:
Dsuite_utils.h:75:88: warning: dynamic exception specifications are deprecated in C++11 [-Wdeprecated]
75 | er> double median(RandAccessIter begin, RandAccessIter end) throw(median_of_empty_list_exception) {
| ^~~~~

In file included from Dsuite_fBranch.h:13,
from Dsuite_fBranch.cpp:9:
Dsuite_utils.h:381:29: warning: dynamic exception specifications are deprecated in C++11 [-Wdeprecated]
381 | void calculateFinalDs() throw(const char*) {
| ^~~~~
g++ -std=c++11 Build/Dsuite.o Build/Dsuite_utils.o Build/D.o Build/gzstream.o Build/Dmin.o Build/Dmin_combine.o Build/Dsuite_fBranch.o -o Build/Dsuite -lz

It only creates *.o files

What can I do?

Thanks

Dsuite showing error with tree file

Dear Experts,

I'm trying to run Dsuite for the first time and I'm facing below error, although without using -t option and input tree the program is running well .

Dsuite Dtrios GB_snpS.vcf SETS.txt -t astal_species_TRoot.nwk -o treeGB

There are 112 sets (excluding the Outgroup)
Going to calculate D and f4-ratio values for 227920 trios
Out of Range error: map::at
species[i]: Species1
It seems that this species is in the SETS.txt file but can't be found in the tree. Please check the spelling and completeness of your tree file.

I checked the tree for spelling mistake or not but I didn't found anything like that . Also I checked for completeness of the tree it resulted that the The Binary Tree is not complete.
Therefore it is my humble request that it will be an immense help if someone kindly correct me and suggest how to overcome this issue.

Thanks,
Debajyoti

regex issue

Hello,

I am trying to install Dsuite on my university cluster I followed the install instructions and get the following error:

Dsuite_fBranch.h:63:84: error: no matching function for call to ‘regex_replace(std::string&, std::regex&, const char [1])’
         string treeNoBranchLengths = std::regex_replace(treeString,branchLengths,"");
                                                                                    ^
Dsuite_fBranch.h:63:84: note: candidates are:
In file included from /usr/include/c++/4.8.2/regex:62:0,
                 from Dsuite_utils.h:20,
                 from Dsuite.cpp:9:

Please can you advise how I proceed?

all the best.

Error when parsing tree with Fbranch

Hi,

Thanks for creating this package.

I am experiencing a similar issue as posted here (#13), but the solution was not provided:

##Command:
Dsuite Fbranch tree.nwk combined_tree.txt > f_branch.txt
##Error:
ERROR: The tree string could not be parsed correctly! The remaining unparsed tree string is:
(Outgroup,((AG-018,AG-024),((AA-006,AA-019),(AU-015,AA-015))));

I used the same tree file for the Dtrios and DtriosCombine commands and that successfully generated the combined_tree.txt file for the F_branch command. Sub-setting the data and removing the Outgroup didn't make any difference.

Any ideas what could be causing this error?

All the best,
Mike

Mixed haploid diploid datasets

Hi, I've been going over the formulas for the various statistics calculated by Dsuite. Evidently they are all tailored towards biallelic sites. I am therefore wondering if I am violating any assumptions by running these stats on a mixed haploid and diploid species level dataset (there is no way around it, we are sampling different life history states in species that alternate generations). At least two of the species for which we have haploid data appear to be playing a role in hybridization patterns, so I am keen to include them.

In the vcf file, the haploid species are genotyped as such. Dsuite offers no warnings and appears to calculate everything appropriately, and results appear to make biological sense (i.e. elevated D and f-ratios reflecting edges in a network showing shared genetic information at odds with ILS). So how is the haploid data treated, particularly in calculations that appear to explicitly demand biallelic sites (such as the f4-ratio)?

Really appreciate any insight on this before reading too much into results

Trev

different results with v0.3

I used Dsuite Dinvestigate from the version v0.1 and now I'm redoing my analysis with the new version, but I don't get the same results. I don't get the same number of windows and the positions of the windows are also different. I am using exactly the same dataset. What is the difference between the two versions? and why are the results different?

Floating Point exception

Hi,
I have prepared a vcf file and have removed all triallelic sites but am getting the following error message....
Would you know the best workaround

There are 5 sets (excluding the Outgroup)
Going to calculate D and f4-ratio values for 10 trios
Done permutations
Floating point exception (core dumped)

Counts don't contain info for split 1/2 of P3

Hi,

I ran the Dinvestigate, but most of the f_d and f_dm values seem to be 0.
I checked the log file and there were lot of Counts don't contain info for split 1 of P3 and Counts don't contain info for split 2 of P3 followed by P1 not finding the count for P3 allele.

Does this mean P3 allele information is not given so that Dsuite cannot count them in P1 and P2?
I am not too sure why would this happen as the vcf file seems to be in good format...
Could you tell me how I could fix this problem?

Kind regards,
Shane

Fbranch all 0 entries using v0.4

First of all, thanks for this useful and easy-to-use package!

I'm trying to calculate Fbranch statistics (using version 0.4, downloaded today) on a small tree with many significant D-statistics from Dtrios. When I run Fbranch, all of the entries are either 0 or nan. I know that nan's are comparisons that aren't applicable given my tree, but I am surprised that all of the rest of the entries are 0.

When I had previously run Fbranch using version 0.2 (I believe), several of the entries were non-zero, even with the same *_tree.txt input file and same tree.nwk tree. I am unsure about why the difference in version would change the results, even from the same Dtrios output-- and why my results from the new Fbranch are all zeroes. I also tried changing the p-value threshold, but all of the entries remain as 0s no matter what I try.

My command line for Fbranch is:
Dsuite Fbranch tree.nwk sets_021021_tree.txt > sets_021021_fbranch.txt

My tree is:
(Outgroup,(Lnil,((Lmar,Lang),(Lmic,Lsta))));

My sets_021021_tree.txt from Dtrios is:
P1 P2 P3 Dstatistic p-value f_G
Lang Lmar Lmic 0.135053 0 0.19006
Lmar Lang Lnil 0.0984446 0 0.0139214
Lmar Lang Lsta 0.233125 0 0.094233
Lmic Lang Lnil 0.109105 0 0.0159413
Lmic Lsta Lang 0.10392 0 0.13288
Lsta Lang Lnil 0.0922383 0 0.0136548
Lmic Lmar Lnil 0.0138281 7.77156e-16 0.00204724
Lsta Lmic Lmar 0.227573 0 0.248475
Lmar Lsta Lnil 0.00165132 0.312065 0.000271264
Lmic Lsta Lnil 0.0141116 1.49433e-08 0.00231811

And then the output from Fbranch is:
branch branch_descendants Outgroup Lnil Lmar Lang Lmic Lsta
b2 Lnil nan nan nan nan nan nan
b3 Lmar,Lang,Lmic,Lsta nan nan nan nan nan nan
b4 Lmar,Lang nan 0 nan nan nan nan
b5 Lmic,Lsta nan 0 nan nan nan nan
b6 Lmar nan 0 nan nan 0 0
b7 Lang nan 0 nan nan 0 0
b8 Lmic nan 0 0 0 nan nan
b9 Lsta nan 0 0 0 nan nan

Thanks in advance for help with this.

Best,
Jessica

Dsuit install error

Dear milla,
when install Dsuit in my Linux, the error like this:
Dmin.cpp: In function ‘int DminMain(int, char**)’:
Dmin.cpp:83:56: error: no matching function for call to ‘regex_replace(std::string&, std::regex&, const char [1])’
line = std::regex_replace(line,branchLengths,"");
^
I think it may be the difference between VS and g++, ''It's illegal to bind a non-const reference to a temporary. Historically however VS compilers have been less strict about this.''
How can I change this line:
'line = std::regex_replace(line,branchLengths,"") ' ?
Sorry for my ignorance about c++.
Thanks,
Cong

How does it treat heterozygous sites?

I haven't found any comment about that. Thanks.

BBAA.txt for only 1 trio

Hi,
I have been trying to use Dsuite but I have ran into some issues.
So in my vcf file I have 63 indivs belonging to 3 species/populations and 1 outgroup but when I run Dtrios I only get result for 1 trio and not all possible combinations...
Do you have any ideas why that might be?

Thanks so much for your help.

How to calculate Z-score using Dtrios' output?

Hello Milan,

As most published papers using Z-score (>3 or >2) to determine which trio is significant, I wonder if I can get the Z-score from the output file of Dtrios?
Is the "combine_stderr.txt" contains all the D-statistics calculated using Jackknife block under the arrangement that same with "combine.txt"? So, can I use these data to calculate Z-score?

millanek / dsuite Goto Github PK

dsuite's Introduction

Dsuite

Quickstart:

Input files:

Required files:

Optional files:

Piped VCF input:

Installation

Main program:

[Optional] Installing the python3 Fbranch plotting script

Commands (v0.5 r53):

Dsuite Dtrios - Calculate the D (ABBA-BABA) and f4-ratio statistics for all possible trios of populations/species

Output:

ABBA-clustering test:

DtriosParallel

Output:

DtriosCombine - Combine results from Dtrios runs across genomic regions (e.g. per chromosome)

Output:

Dinvestigate - Follow up analyses for trios with significantly elevated D: calculates D, f_d and f_dM in windows along the genome

Fbranch - A heuristic approach designed to aid the interpretation of many correlated f4-ratio results

Output:

Plotting Fbranch

(experimental) Dsuite Dquartets - Calculate the D (ABBA-BABA) and f4-ratio statistics for all possible quartets of populations/species (no outgroup)

Parallelisation with DtriosParallel

Change log:

dsuite's People

Contributors

Stargazers

Watchers

Forkers

dsuite's Issues

Thanks

Recommend Projects

Recommend Topics

Recommend Org