Coder Social home page Coder Social logo

autostreamtree's Introduction

autostreamtree

autoStreamTree is a Python package meant for analysing genomic variants (e.g., SNPs, microhaplotypes) within geospatial network contexts, such as river systems or other dendritic environments. It provides a way to visualize reach- or segment- level contributions given a matrix of pairwise distances.

An input geospatial network (as shapefile, geodatabase, or .gpkg) is first used to compute a minimally representative graph structure for all least-cost paths between your sampling locations. Pairwise genomic distances (e.g., Fst) are then 'fitted' to this graph by calculating the contribution of each graph segment, using the least-squares algorithm from Kalinowsky et al. (2008), similarly to the manner used to fit branch lengths within a neighbor-joining phylogeny.

autoStreamTree automates what was previously a manual process, while also integrating new functionality such as per-locus or per-site fitted distances, which can be used for downstream outlier analysis.

Table of Contents:

  1. Installation
  2. Usage
    1. Interface
    2. Inputs
    3. Genetic distances
    4. Outputs
  3. Example
  4. Downstream analysis
    1. Utility scripts
    2. Incorporating into workflows
  5. References
  6. Contributing Guidelines

1. Installation

Dependencies

autoStreamTree is a Python package which relies upon a number of dependencies:

- pandas
- numpy 
- geopandas
- pyogrio
- matplotlib
- seaborn
- pyproj
- networkx
- scikit-learn
- momepy
- mantel
- pysam
- sortedcontainers

autoStreamTree also uses methodology described in Kalinowski et al. 2008, if you use this software you should cite:

Kalinowski ST, MH Meeuwig, SR Narum, ML Taper (2008) Stream trees: a statistical method for mapping genetic differences between populations of freshwater organisms to the sections of streams that connect them. Canadian Journal of Fisheries and Aquatic Sciences (65:2752-2760)

Installation with conda/mamba

The recommended method of installation is with conda or mamba. First, create
and activate a new conda environment:

conda create -n streamtree python=3.10
conda activate streamtree

Then install the package:

mamba install -c ecoevoinfo -c conda-forge -c bioconda autostreamtree

If you require the development branch, you can then install from GitHub as
well:

git clone https://github.com/tkchafin/autostreamtree.git
cd autostreamtree 
pip install .

You should now be ready to go!

Note for Apple silicon (ARM) Macs

Currently there are some dependencies which can't be easily installed on Macs running the Apple ARM CPU architectures. If you are using an Apple Silicon (M1/M2) Mac there are some extra steps to run x86_64 (Intel-based) applications, which you can do via Rosetta 2. Here are the steps to set up and launch a terminal in Rosetta mode:

  1. Install Rosetta:

    • Open the Terminal app (found in Applications > Utilities).

    • Enter the following command and press Return:

      /usr/sbin/softwareupdate --install-rosetta --agree-to-license
      
    • If Rosetta is already installed, this command will have no effect. If it's not installed, you'll be prompted to agree to the license terms, and the installation will proceed.

  2. Find Terminal in Finder:

    • Open Finder.
    • Navigate to the Applications folder, then Utilities.
    • Find Terminal.
  3. Duplicate Terminal:

    • Right-click (or Ctrl-click) on Terminal.
    • Select Duplicate to create a copy of the Terminal application. This ensures you have a separate instance for Rosetta without altering your default Terminal.
  4. Enable Rosetta for the Duplicated Terminal:

    • Right-click (or Ctrl-click) on the duplicated Terminal.
    • Select Get Info.
    • In the Info window, check the box Open using Rosetta.
  5. Launch the Rosetta Terminal:

    • Open the duplicated Terminal application. This instance of Terminal will run under Rosetta, allowing you to run x86_64 architecture applications.

Once this is done, you can create a conda environment:

conda create -n streamtree
conda activate streamtree 
conda config --env --set subdir osx-64
conda install python=3.10

Then install autostreamtree:

conda install -c ecoevoinfo -c conda-forge -c bioconda autostreamtree

Running with Docker

Docker images are also built and hosted for all releases on DockerHub. To run using the latest release:

docker run --platform linux/amd64 -it tkchafin/autostreamtree:latest

If you require a certain release, you can reference the tag by changing the image name like so:

docker run -it tkchafin/autostreamtree:v1.1.2

2. Usage

Command-line interface

After installation, you can view the up-to-date help menu by calling the script with autostreamtree -h:


autostreamtree

Author: Tyler K Chafin, Biomathematics and Statistics Scotland
Description: Methods for analysing genetic distances in networks.

Mandatory arguments:
    -s, --shp       : Path to shapefile containing cleaned, contiguous stream reaches
                       (can also support geodatabase or GPKG files)
    -i, --input     : Input .tsv file containing sample coordinates
    -v, --vcf       : Input VCF file containing genotypes

General options:
    -o, --out       : Output prefix [default="out"]
    -O, --gdf_out   : Output driver for annotated geodataframe (options "SHP", "GPKG", "GDB")
    -C, --concat    : Concatenate all SNPs ("all"), by locus ("loc"), or not at all ("none")
    -n, --network   : Provide an already optimized network output from a previous run
    --overwrite     : Overwrite an input network (Only relevant with --network)
    -h, --help      : Displays help menu
    -r, --run       : Run which steps? Options:
        ALL         : Run all steps
        GENDIST     : Only calculate genetic distance matrix
        STREAMDIST  : Only compute pairwise stream distances
        DISTANCES   : Only compute GENDIST + STREAMDIST
        IBD         : GENDIST + STREAMDIST + Mantel test
        STREAMTREE  : GENDIST + STREAMDIST + fit StreamTree model
        RUNLOCI     : Run STREAMTREE fitting on each locus
    -p, --pop       : Pool individuals based on an input population map tsv file
        NOTE: The location will be taken as the centroid among individual samples
    -g, --geopop    : Pool individuals having identical coordinates
    -c, --clusterpop: Use DBSCAN algorithm to automatically cluster populations
    --reachid_col   : Attribute name representing primary key in shapefile [default="HYRIV_ID"]
    --length_col    : Attribute name giving length in kilometers [default="LENGTH_KM"]
    --seed          : Seed for RNG

Genetic distance options:
    -d, --dist      : Use which metric of distance? Options:
        Individual-based:
          PDIST     : Uncorrected p-distances [# Differences / Length]
        Frequency models (when using --pop):
          FST       : Weir and Cockerham's Fst formulation (=THETAst)
          LINFST    : [default] Rousset's (1997) Fst [=Fst/(1-Fst)]
          JOST      : Jost's (2008) D
          CHORD     : Cavalli-Sforza and Edwards (1967) chord distance
          --NOTE: Individual-based metrics can also be computed for
                    populations. You can set how these are aggregated w/ --pop_agg
          --NOTE: Multiple loci for PDIST
                  will be reported using the method defined in --loc_agg
    -G, --genmat    : Skip calculation and use the provided labeled .tsv matrix
    --coercemat     : [Boolean] Coerce negative values in input matrix to zero
    --het           : [Boolean] Count partial differences [e.g. ind1=T, ind2=W]
    --global_het    : Estimate Ht using global frequencies (default is averaged over pops)

DBSCAN options (only when --clusterpop):
    --min_samples   : Minimum samples per cluster [default=1]
    --epsilon       : Maximum distance (in km) within a cluster [default=20]

Aggregation options:
    -P, --pop_agg   : Define aggregator function for certain genetic distances in pop samples
    -L, --loc_agg   : Define aggregator function for aggregating locus-wise distances
        All of these can take the following options:
          ARITH     : [default] Use arithmetic mean
          MEDIAN    : Use median distance
          HARM      : Use harmonic mean
          ADJHARM   : Adjusted harmonic mean (see docs)
          GEOM      : Use geometric mean
          MIN       : Use minimum distance
          MAX       : Use maximum distance

IBD options:
    --perm          : Number of permutations for mantel test [def=1000]
    --and_log       : Also perform IBD steps with log geographic distances

StreamTree options (see Kalinowski et al. 2008) :
    -w, --weight    : Desired weighting for least-squares fitting:
        Options:
          FM67      : Fitch and Margoliash (1967) [w = 1/D^2]
          BEYER74   : Beyer et al. (1974) weights [w = 1/D]
          CSE67     : [default] Cavalli-Sforza & Edwards (1967) [w = 1]

The main workflow control argument is -r/--run, which tells which steps of the pipeline you want to run. The options for this are:

  • -r GENDIST: Only calculates a pairwise genetic distance matric for input samples
  • -r STREAMDIST: Only calculates a pairwise matrix of hydrologic (stream) distances
  • -r DISTANCES: Computes both of the above
  • -r IBD: Computes both distance matrices and performes a simple Mantel test for isolation-by-distance
  • -r STREAMTREE: Computes distance matrices and fits the StreamTree model
  • -r ALL: All of the above
  • -r RUNLOCI: All of the above, but fitting StreamTree models to each individual locus

Some of these are included only for convenience -- for example, there are much more comprehensive packages out there for computing genetic distance matrices, and I don't recommend the Mantel test as the sole means of testing for isolation-by-distance -- and in most cases I think -r ALL or -r STREAMTREE will be what you want.

Many of the runtime options relate to specific components of the required inputs, which I will discuss below.

Input files

Depending on the workflow you are running, some of these files may not be required (for example, computing genetic distances only will not require an input geodatabase).

Genotype data

The intended input format for the genotype data is the widely used VCF format. Some examples are provided in the example data/ directory, which were produced by the program ipyrad:

##fileformat=VCFv4.0
##fileDate=2021/03/21
##source=ipyrad_v.0.9.68
##reference=pseudo-reference (most common base at site)
##phasing=unphased
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=CATG,Number=1,Type=String,Description="Base Counts (CATG)">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  63burk03        63cburk01       >
RAD_0   28      loc0_pos27      G       A       13      PASS    .       GT:DP:CATG      0/0:81:0,0,0,81 >
RAD_0   76      loc0_pos75      C       T       13      PASS    .       GT:DP:CATG      0/0:81:81,0,0,0 >
...
...

Parsing is faster with compressed and indexed files (and will be more space efficient), so I recommend keeping these compressed with bgzip and indexed with tabix:

# compress
bgzip test_sub100.vcf 

# index
tabix test_sub100.vcf.gz

This file should be provided using -v/--vcf.

Individual metadata

Individual metadata will be needed in the form of coordinates, which can be passed as a simple tab-delimited text file with header (see example in data/test.coords):

sample  lat     long
63burk03        26.925414       90.396139
63cburk01       26.92383        90.39815
63cburk02       26.92383        90.39815
63cdikc01       27.2676 90.04778
63cdikc02       27.2676 90.04778
63cdikc03       27.2676 90.04778
63cdolk01       26.88353        90.34272
63cdolk03       26.88353        90.34272
63cdort03       26.86776        89.37437
...
...

For population-level analysis, there are currently three ways in which you can define populations for population-wise analysis. One option (-g/--geopop) will group any samples into populations which "snap" to the same stream node (see below).

A third option (-c,--clusterpop) will automatically cluster geographically similar individuals using the DBSCAN algorithm in scikit-learn, using great-circle geographic distances (i.e., this is not informed by stream distances calculated as a part of some workflows). Two relevant options are provided for manipulating the DBSCAN results:

	--min_samples	: Minimum samples per cluster [default=1]
	--epsilon	    : Maximum distance (in km) within a cluster [default=20]

or using an input file providing population assignments for each individual, again as a simple tab-delimited text file (see example data/test.popmap), passed via -p/--pops:

63burk03        burk
63cburk01       burk
63cburk02       burk
63dakp01        dakp
63dakp02        dakp
63dakp03        dakp
63dakp04        dakp
63dakp05        dakp
...
...

Geodatabase

The input stream network can be provided as a shapefile, geodatabase, or GPKG file, all passed uing the -s/--shp option. There are a number of requirements for this file in order for the result to create a valid network. I highly recommend using the existing global stream datasets provided by the HydroLab group at McGill University, specifically the HydroAtlas or free-flowing rivers dataset as these are already properly formatted for use, and the additional variables included will make downstream analysis very easy.

If for some reason you cannot use the HydroRIVERS dataset, you will need to do some things first before loading your shapefile into autoStreamTree. First, you will need to include two variables in the attribute table of your shapefile: 1) REACH_ID (case sensitive) must provide a unique identifier to each stream reach; and 2) LENGTH_KM should give the length of each segment. Next, because sometime large stream layers will have small gaps in between streams, you will need to span any small gaps between streams which should be contiguous, and also dissolve any lines that overlap with one another so that any given section of river is represented by a single line. I will provide a tutorial for doing this in ArcMAP later, but for now there are some scripts in our complementary package that can help with these steps using the ArcPy API: https://github.com/stevemussmann/StreamTree_arcpy. Note that this package will also help you in running the original Stream Tree package on Windows, if you want to do so.

Note that a valid path is required between all sites in order to calculate pairwise stream distances. Thus, if you are analyzing data from multiple drainages which only share an oceanic connection, you will need to augment the shapefile. For example this could be accomplished by adding a vector representing the coastline to create an artificial connection among drainages.

Genetic distances

There are a number of existing (and more comprehensive) packages out there for computing pairwise genetic distance matrices. If you already have this (either at individual or population level), you can provide a labelled matrix as a tab-delimited text file with --genmat. An example of how this should be formatted may be found in data/test.popGenDistMat.txt.

For convenience, a number of options are built-in, which can be selected using the -d,--dist argument:

    -d, --dist      : Use which metric of distance? Options:
        Individual-based:
          PDIST     : Uncorrected p-distances [# Differences / Length]
        Frequency models (when using --pop):
          FST       : Weir and Cockerham's Fst formulation (=THETAst)
          LINFST    : [default] Rousset's (1997) Fst [=Fst/(1-Fst)]
          JOST      : Jost's (2008) D
          CHORD     : Cavalli-Sforza and Edwards (1967) chord distance
          --NOTE: Individual-based metrics can also be computed for
                    populations. You can set how these are aggregated w/ --pop_agg
          --NOTE: Multiple loci for PDIST, and JC69 distances
                  will be reported using the method defined in --loc_agg

For most use cases, I would suggest -d LINFST which will compute a "linearised" version of Weir and Cockerham's Theta-ST.

Optionally, the user can also opt to aggregate individual-based distance measures, either those provided (p-distances) or from an input matrix that is only at the individual level. This can be provided using the --pop_agg argument, with any of the following options available:

Aggregation options:
    -P, --pop_agg   : Define aggregator function for certain genetic distances in pop samples
    -L, --loc_agg   : Define aggregator function for aggregating locus-wise distances
        All of these can take the following options:
          ARITH     : [default] Use arithmetic mean
          MEDIAN    : Use median distance
          HARM      : Use harmonic mean
          ADJHARM   : Adjusted harmonic mean (see docs)
          GEOM      : Use geometric mean
          MIN       : Use minimum distance
          MAX       : Use maximum distance

Another useful feature is the ability to concatenate the input variant data -- this can be done either globally (for example if you want to only compute a global p-distance) with -C all, or if you have phased data used to form "pseudo" microhaplotypes using -C loc, which will group variants using the CHROM field in the input VCF. Note that if using a program such as ipyrad (e.g., for RADseq data), the entries in this field will represent 'independent' loci -- if you have data aligned to a genome and want to define loci in some way you will need to insert this information into the CHROM field as a pre-processing step.

Outputs

If running the full workflow, the first thing autoStreamTree will do upon reading your input geodatabase is to calculate a minimally reduced sub-network which collapses the input river network into continuous reaches (="edges"), with nodes either representing sample localities or junctions. Because the full river network will likely contain many branches and contiguous reaches which do not contain samples, these are removed to speed up computation. The underlying metadata will be preserved, and the final output will consist of an annotated shapefile containing an EDGE_ID attribute which tells you how reaches were dissolved into contiguous edges in the graph, and a FittedD attribute giving the least-squares optimized distances.

The reduced sub-network will be plotted for you in a file called out.subGraph.pdf.

Here, the total cumulative stream length (in km) is plotted along edges (NOTE: Any natural curvature in the river is not preserved in this plot), with sample sites as blue dots and junctions as black dots. A geographically accurate representation, coloring individual streams to designate different dissolved edges, will be provided as out.streamsByEdgeID.pdf.

After fitting genetic distances, autoStreamTree will create several other outputs. First, a table called out.reachToEdgeTable.txt will give a tab-delimited map of how REACH_ID attributes were dissolved into contiguous edges. Second, a tabular and graphical representation of how fitted pairwise distances compare to the raw calculates (or user-provided) pairwise distances: out.obsVersusFittedD.txt and out.obsVersusFittedD.pdf

Finally, the fitted distances per stream edge will be output both as an added column to the original shapefile attribute table (out.streamTree.shp and out.streamTree.txt), and also as a plot showing how distances compare across all streams.

3. Example

If you have just installed autoStreamTree and are running it for the first time, I recommend you first run the example analysis using the provided files in the autostreamtree/data directory. These include all necessary inputs, and a geodatabase. To run the full workflow on the example data, simply use:

# Change directories to the autoStreamTree repository
cd autostreamtree 

# run the analysis
autostreamtree -s autostreamtree/data/test.shp -i autostreamtree/data/test.coords -v autostreamtree/data/test_sub100.vcf.gz -p autostreamtree/data/test.popmap -r ALL --reachid_col "HYRIV_ID" --length_col "LENGTH_KM" -o test

This will produce a number of output text files and plots using the prefix provided with -o.

4. Downstream analysis

autoStreamTree is designed to aid in the downstream analysis of genetic differentiation, adaptation, and ecological aspects of aquatic species, although it is versatile enough to be applicable to any dataset structured as a network.

A key use case involves the examination of locus-wise genetic distances across network segments. This is efficiently done by importing the geodatabase output of autoStreamTree using the -r LOC option into R or a similar analytical platform. An example of analysing variation among fitted locus-wise distances as a function of environmental variables included in the HydroATLAS dataset is provided as a NextFlow pipeline, using forward-selection to reduce features and redundancy analysis (RDA) implemented in ecopy, can be found here.

Utility scripts

autoStreamTree is currently distributed with two utility scripts: networkDimensions.py and streeToDendrogram.py. We expect this list to grow, and if you need help writing any specific utilities using autoStreamTree outputs, please don't hesitate to get in touch!

As with the autoStreamTree CLI, when installed via pip or conda/mamba, these can be accessed via front-ends networkDimensions and streeToDendrogram, both having help menus displayed by calling with -h

networkDimensions

networkDimensions is a simple script to capture the spatial dimensions of your minimized network, including the total length, and the area of a the minimally bounding convex hull.

The options are:

(autostreamtree) tyler@Tylers-MacBook-Pro-2 autostreamtree % networkDimensions -h
usage: networkDimensions [-h] network_file

Calculate total lineage length and bounding box area of a network.

positional arguments:
  network_file  Path to the pickled Networkx graph file.

options:
  -h, --help    show this help message and exit

Simply provide the output .network file created by autoStreamTree.

streeToDendrogram

streeToDendrogram makes it easier to visualise the fitted distances as a dendrogram (rather than a colored spatial network). The options are:

(autostreamtree) tyler@Tylers-MacBook-Pro-2 autostreamtree % streeToDendrogram -h
usage: streeToDendrogram [-h] --shp SHP --points POINTS [--pop POP] [--edge_id EDGE_ID]
                         [--dist DIST] [--out OUT]

Plot a stream tree with fittedD attribute as dendrogram

options:
  -h, --help         show this help message and exit
  --shp SHP          Network as shapefile
  --points POINTS    Sample coordinates
  --pop POP          Population map file (optional)
  --edge_id EDGE_ID  Edge ID attribute.
  --dist DIST        Dist attribute.
  --out OUT          Output prefix

Incorporating into workflows

All utilities from autoStreamTree are intended to be easy to run within an automated pipeline. An example of building these into a SnakeMake workflow can be found in the Open Science Framework repository (doi: 10.17605/OSF.IO/4UXFJ).

For example, in the run_spd.smk file, you will find an example of fitting distances for regional subsets of samples for an empirical ddRADseq dataset, and then performing downstream environmental resistance modelling on each using ResistNet

5. References

Citations for autoStreamTree methods

Below is a full list of citations for the various methods used. Apologies to anyone I missed - please let me know if you notice any discrepancies.

  • Beyer WM, Stein M, Smith T, Ulam S. 1974. A molecular sequence metric and evolutionary trees. Mathematical Biosciences. 19: 9-25.
  • Cavalli-Sforza LL, Edwards AWF. 1967. Phylogenetic analysis: model and estimation procedures. American Journal of Human Genetics. 19: 233-257.
  • Ester M, Kriegel HP, Sander J, Xu X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. IN: Simoudis E, Han J, Fayyad UM. (eds.). Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press. pp. 226–231.
  • Felsenstein J. 2004. Inferring Phylogenies: Chapter 11. Sunderland: Sinauer.
  • Fitch WM, Margloiash E. 1967. Construction of phylogenetic trees. Science. 155: 279-84.
  • Hagberg A, Swart P, S Chult D. 2008. Exploring network structure, dynamics, and function using NetworkX. Los Alamos National Lab.(LANL), Los Alamos, NM
  • Hedrick PW. 2005. A standardized genetic differentiation measure. Evolution. 59: 1633–1638
  • Jordahl K. 2014. GeoPandas: Python tools for geographic data. URL: https://github.com/geopandas/geopandas.
  • Jost L. 2008. Gst and its relatives do not measure differentiation. Molecular Ecology. 17: 4015-4026.
  • Kalinowski ST, MH Meeuwig, SR Narum, ML Taper (2008) Stream trees: a statistical method for mapping genetic differences between populations of freshwater organisms to the sections of streams that connect them. Canadian Journal of Fisheries and Aquatic Sciences (65:2752-2760)
  • Kimura M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution. 16(2): 111-120.
  • Mantel N. 1967. The detection of disease clustering and a generalized regression approach. Cancer Research 27(2): 209-220.
  • Meirmans PG, Hedrick PW. 2011. Assessing population structure: Fst and related measures. Molecular Ecology Resources. 11: 5-18.
  • Nei M. 1972. Genetic distance between populations. American Naturalist. 106: 283-292.
  • Nei M. 1987. Molecular Evolutionary Genetics. Columbia University Press, New York
  • Nei M, Chesser RK. 1983. Estimation of fixation indices and gene diversities. Annals of Human Genetics 47(3): 253-259.
  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J. 2011. Scikit-learn: Machine learning in Python. The Journal of machine Learning research. 1(12):2825-30
  • Rossmann LA. DFLOW User's Manual. U.S. Environmental Protection Agency.[For description of zero-adjusted harmonic mean]
  • Rousset F. 1997. Genetic differentiation and estimation of gene flow from F-statistics under isolation by distance. Genetics. 145: 1219-28.
  • Tajima F, Nei M. 1984. Estimation of evolutionary distance between nucleotide sequences. Molecular Biology and Evolution 1:269-285
  • Tamura K, Nei M. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Molecular Biology and Evolution. 10(3):512-526.
  • Weir BS, Cockerham CC. 1984. Estimating F-statistics for the analysis of population structure. Evolution. 38: 1358-1370.

Other reading

Here are some recommended readings and resources:

  • Comte L, Olden JD. 2018. Fish dispersal in flowing waters: A synthesis of movement- and genetic-based studies. Fish and Fisheries. 19(6): 1063-1077.
  • Comte L, Olden JD. 2018. Evidence for dispersal syndromes in freshwater fishes. Proceedings Royal Society: B. 285(1871):
  • Grill, G., Lehner, B., Thieme, M. et al. 2019. Mapping the world’s free-flowing rivers. Nature. 569:215–221.
  • Linke S, Lehner B, Ouellet Dallaire C. et al. 2019. Global hydro-environmental sub-basin and river reach characteristics at high spatial resolution. Sci Data 6, 283
  • Meffe GK, Vrijenhoek RC. 1988. Conservation genetics in the management of desert fishes. Conservation Biology. 2(2):157-69.
  • Meirmans PG. 2012. The trouble with isolation by distance. Molecular Ecology 21(12): 2839-46.
  • Sere M, Thevenon S, Belem AMG, De Meeus T. 2017. Comparison of different genetic distances to test isolation by distance between populations. 2017. 119(2):55-63.
  • Thomaz AT, Christie MR, Knowles LL. 2016. The architecture of river networks can drive the evolutionary dynamics of aquatic populations. Evolution. 70(3): 731-739.
  • Tonkin JD, Altermatt F, Finn DS, Heino J, Olden JD, Pauls SU, Lytle DA. 2017. The role of dispersal in river network metacommunities: Patterns, processes, and pathways. Freshwater Biology. 61(1): 141-163.
  • Wright S. 1965. Isolation by distance. Genetics. 28: 114-138.

5. Contributing

If you are interested in contributing to autoStreamTree, first of all thank you! Any form of contribution is always welcome, be that bug reports, code review, bug fixes, or even adding new features! For reference, API documentation can be found at https://tkchafin.github.io/autostreamtree/, which will be re-built upon each update to the repository.

If you encounter a problem in using or installing autoStreamTree, please post to the GitHub Issues page. If you do so, please be sure to share the full steps which led to your problem, and the full output (including any error messages). In some instances it may be necessary to see your input files.

If you would like to contribute any changes to the code, just follow these steps!

  1. Fork the repository
  2. Make changes
  3. Submit a pull request

That's it! If you have an idea for a feature that would be helpful for your research, but aren't sure how to implement it, this can be logged using the GitHub Issues page linked above and I'll do my best to help!

autostreamtree's People

Contributors

tkchafin avatar

Stargazers

 avatar

Watchers

 avatar  avatar

autostreamtree's Issues

[JOSS REVIEW] Suggestions for Documentation

This is a part of the JOSS review (openjournals/joss-reviews#6160)

  1. A statement of need

The opening section of the README seems inadequate for a comprehensive statement of need.

`autoStreamTree` is a Python package meant for analysing genomic variant (e.g., SNPs, microhaplotypes) differentiation in geospatial networks such as riverscapes.
An input geospatial network (as shapefile, geodatabase, or .gpkg) is first used to compute a minimally representative graph structure for all least-cost paths between your sampling locations. Pairwise genomic distances (e.g., Fst) are then 'fitted' to this graph by calculating the contribution of each graph segment, using the least-squares algorithm from Kalinowsky et al. (2008).
`autoStreamTree` automates what was previously a manual process, while also integrating new functionality such as per-locus or per-site fitted distances, which can be used for downstream outlier analysis.

I recommend that the authors enhance it by incorporating the statement of need from their paper. An introduction section specifically tailored for the statement of need in the README would be beneficial.

  1. Installation instructions

There appears to be a redundancy in the installation instructions. The content found here:

autostreamtree/README.md

Lines 53 to 60 in e86d8af

Simple installation of the development version may be done simply as:
```
git clone https://github.com/tkchafin/autostreamtree.git
cd autostreamtree
pip install .
```

is repeated later:

autostreamtree/README.md

Lines 78 to 85 in e86d8af

If you require the development branch, you can then install from GitHub as \
well:
```
git clone https://github.com/tkchafin/autostreamtree.git
cd autostreamtree
pip install .
```

I suggest removing the first instance to avoid duplication.

Regarding ARM Mac installations, I am unable to test them personally. However, I noticed the commands for mamba installation:

autostreamtree/README.md

Lines 125 to 128 in e86d8af

conda create -n streamtree
conda activate streamtree
conda config --env --set subdir osx-64
conda install python=3.10 mamba

According to the latest guidelines, installing mamba via conda is not recommended and should only be installed in the base environment.

Additionally, mamba now supports ARM64, so updating the installation instructions for mamba would be appropriate.

  1. Example usage

When attempting the example analysis with this command:

autostreamtree -s autostreamtree/data/test.shp -i autostreamtree/data/test.coords -v autostreamtree/data/test_sub100.vcf.gz -p autostreamtree/data/test.popmap -r ALL --reachid_col "HYRIV_ID" --length_col "LENGTH_KM" -o test

I encountered the following error:

Traceback (most recent call last):
  File "/home/xinhuang/software/mambaforge/envs/autostreamtree-dev/bin/autostreamtree", line 8, in <module>
    sys.exit(main())
  File "/home/xinhuang/software/mambaforge/envs/autostreamtree-dev/lib/python3.10/site-packages/autostreamtree/cli.py", line 28, in main
    np.random.seed(clock_seed)
  File "numpy/random/mtrand.pyx", line 4806, in numpy.random.mtrand.seed
  File "numpy/random/mtrand.pyx", line 250, in numpy.random.mtrand.RandomState.seed
  File "_mt19937.pyx", line 168, in numpy.random._mt19937.MT19937._legacy_seeding
  File "_mt19937.pyx", line 182, in numpy.random._mt19937.MT19937._legacy_seeding
ValueError: Seed must be between 0 and 2**32 - 1

This issue may need to be addressed for successful example replication.

  1. Functionality documentation

I was unable to locate any API documentation. It's important for the authors to add this as it is a requirement of the journal.

  1. Automated tests

Running pytest resulted in several warnings

tests/test_cli.py::test_autostreamtree_outputs
tests/test_cli.py::test_autostreamtree_output_shp
tests/test_cli.py::test_autostreamtree_outputs_gendist
tests/test_cli.py::test_autostreamtree_outputs_sdist
tests/test_functions.py::test_read_network
  /home/xinhuang/software/mambaforge/envs/autostreamtree-dev/lib/python3.10/site-packages/momepy/utils.py:252: UserWarning: Geometry is in a geographic CRS. Results from 'length' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

    gdf_network[length] = gdf_network.geometry.length

tests/test_cli.py::test_autostreamtree_outputs
tests/test_cli.py::test_autostreamtree_outputs
tests/test_cli.py::test_autostreamtree_output_shp
tests/test_cli.py::test_autostreamtree_output_shp
  /home/xinhuang/software/mambaforge/envs/autostreamtree-dev/lib/python3.10/site-packages/mantel/_test.py:217: DeprecationWarning: `np.math` is a deprecated alias for the standard library `math` module (Deprecated Numpy 1.25). Replace usages of `np.math` with `math`
    n = np.math.factorial(m)  # number of possible matrix permutations

tests/test_cli.py::test_autostreamtree_output_shp
  /home/xinhuang/software/mambaforge/envs/autostreamtree-dev/lib/python3.10/site-packages/pyogrio/raw.py:530: RuntimeWarning: Creating a 256th field, but some DBF readers might only support 255 fields
    ogr_write(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

Addressing these warnings could enhance the software's reliability.

Additionally, I recommend integrating code coverage reporting with a service like codecov. This would provide a clearer understanding of the extent to which the unit tests cover the codebase.

[JOSS-REVIEW] autostreamtree install fails due to pysam

Running in Docker container

docker run -it --rm condaforge/miniforge3

Followed installation instructions,

(streamtree) root@79da57b9716d:/# mamba install -c ecoevoinfo -c conda-forge -c bioconda autostreamtree

Looking for: ['autostreamtree']

ecoevoinfo/linux-aarch64 (check zst)                Checked  0.2s
ecoevoinfo/noarch (check zst)                       Checked  0.1s
warning  libmamba Could not parse mod/etag header
warning  libmamba Could not parse mod/etag header
bioconda/linux-aarch64 (check zst)                  Checked  0.1s
bioconda/noarch (check zst)                         Checked  0.2s
ecoevoinfo/linux-aarch64                           125.0 B @   1.3kB/s  0.1s
bioconda/linux-aarch64                             854.0 B @   5.7kB/s  0.2s
ecoevoinfo/noarch                                    1.4kB @   7.1kB/s  0.2s
conda-forge/noarch                                  13.3MB @  27.5MB/s  0.5s
conda-forge/linux-aarch64                           10.2MB @  16.0MB/s  0.7s
bioconda/noarch                                      5.1MB @   4.6MB/s  1.0s

Pinned packages:
  - python 3.10.*


Could not solve for environment specs
The following package could not be installed
└─ autostreamtree is not installable because it requires
   └─ pysam, which does not exist (perhaps a missing channel).

Update

  • Install pysam via pip but still failing to install autostreamtree
conda install pip
pip install pysam

JOSS-REVIEW

[JOSS REVIEW] Suggestions for Software Paper

This is a part of the JOSS review (openjournals/joss-reviews#6160)

  1. A statement of need

In the provided text, it seems the phrase "reach- level within the spatial network" should be corrected to "river- level within the spatial network."

Despite this, quantitative frameworks for modelling the relationships between evolutionary and ecological processes (e.g., through spatio-genetic associations) are predominantly focused on landscapes, and as such often involving mechanistic assumptions which translate poorly to networks. We address this limitation by providing a novel package, autoStreamTree, that facilitates network modeling of genome-scale data. It first computes a graph representation from spatial databases, then analyses individual or population-level genetic data to ‘fit’ distance components at the stream- or reach- level within the spatial network. Doing so within a network context allows the explicit coupling of genetic variation with other network characteristics (e.g., environmental covariates), in turn promoting a downstream statistical process which can be leveraged to understand how those features drive evolutionary processes (e.g., dispersal/ gene flow). We demonstrate the utility of this approach with a case study in a small stream-dwelling fish in western North America.

  1. State of the field

It is noted that a "State of the field" section is absent in the manuscript. As this is a requirement for the journal, it is recommended that the authors include this section to provide context and relevance of the work in the current research landscape.

  1. Quality of writing

In the following text, the hyphens in the words "de-rived" and "correla-tions" should be removed for proper spelling.

Pairwise genetic distances from VCF-formatted genotypes are de-rived among individuals, sites, or populations (via a priori user-specifications). Options for sequence- and frequency-based statistics are provided (`-d/--dist`). Mantel tests are available to quantify correla-tions among genetic/ hydrologic distance matrices. The primary workflow is a least-squares procedure analogous to that used to compute branch lengths within a neighbor-joining phylogenetic tree [@Kalinowski2008]. The procedure fits components of the genetic matrix to *k*-segments in a network, such that fitted distance values (*r*) for each segment separating two populations will sum to the observed pairwise matrix value. This provides a distance (*r*~k~) for each of *k*-segments as the genetic distance ‘explained’ by that segment.

A tilde is missing in the term (linearized *F*~ST) from the following text.

Runtimes are reported for a 2021 Macbook Pro, 16GB memory, 3.2GHz M1 CPU. Time required to calculate/extract a minimal sub-graph containing 118 dissolved edges from RiverATLAS (North America shapefile totaling 986,463 original vertices) was 35min. Computing pairwise hydrologic distances required an additional 3sec. Pairwise population genetic distances were computed in ~24min (linearized *F*~ST), with Mantel test and distance fitting requiring 11sec and 10sec, respectively. Re-running the entire pipeline per-locus (i.e., `-r RUNLOC`) took 3h 34min. Fitted-*F*~ST~ for autoStreamTree (\autoref{fig:example}) matched that re-calculated using the @Kalinowski2008 method (adjusted *R*^2 = 0.9955; *p* < 2.2e-16). However, due to runtime constraints and manual pre-processing for the latter, per-locus distances were not attempted. The pRDA selected four variables, with 221 SNPs and 9 edges as outliers (\autoref{fig:example}).

The correct notation should be (linearized *F*~ST~)

Regarding the mention of the U.S. Fish & Wildlife Service in the acknowledgment section, it would be clearer if the authors either provide direct links to the relevant websites or explicitly state that they are acknowledging the U.S. Fish & Wildlife Service without providing links.

For Figure 1, I recommend adding a label "F_ST distances" adjacent to the color bar to enhance reader comprehension of the color coding in Figure 1A.

  1. References

The reference citations [Hagberg2008] and [Weir1984] are missing the "@" symbol. They should be corrected to [@Hagberg2008] and [@Weir1984] in the following text, respectively.

autoStreamTree employs the Python networkx library [Hagberg2008] to parse geospatial input (i.e., large stream networks) into a graph structure with stream segments as edges, sampling locations as endpoints, and river junctions as nodes. Sample data comprise a tab-delimited table of latitude/longitude coordinates, genome-wide variant data in VCF format, and (optionally) a tab-delimited population map. The data structure ‘graph’ on which downstream computations are performed is built as follows: 1) Sample points are ‘snapped’ to nearest river network nodes (i.e., defining endpoints); 2) Shortest paths are identified between each set of endpoints [@Dijkstra1959]; and 3) A minimal network of original geometries, with contiguous edges derived by joining individual segments with junctions (nodes) retained that fulfill shortest paths.

Users may also provide pre-computed genetic distance matrices either directly at individual or locus levels. Built-in options are provided to concatenate single-nucleotide polymorphisms (SNPs) either globally, or per contig/scaffold. Individual-level statistics include uncorrected *p*-distances (i.e., proportion of nucleotide differences/alignment length), aggregated by site- or at population-level (e.g., as median, arithmetic mean, or adjusted harmonic mean [@Rossman1990]), or computed as distances via several frequency-based methods (e.g., Chord distance [@CavalliSforza1967]; *F*~ST~ [Weir1984]). autoStreamTree can also be computed per-locus by specifying `-r RUNLOCI`, and with `-c LOC` in the case of phased data to treat linked SNPs to microhaplotypes.

The reference Hagberg2008 is not listed in the reference section. It should be added to maintain citation accuracy. For the VCF format, the authors should consider citing the paper by Danecek et al. 2011 to provide a comprehensive background and reference for this format.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.