Exploration and testing of methods to identify paralogs in GBS data
This repository includes scripts, data, and drafts pertaining to methods for sorting GBS/RAD tags into paralogous loci. The dataset being used is from Miscanthus sacchariflorus, which contains both diploid and tetraploid individuals. Groups of tags were identified that aligned to one location in the Sorghum bicolor v3 reference genome, but two locations in the Miscanthus sinensis v7 reference genome, where those two locations are on appropriate chromosomes given the known synteny between S. bicolor and M. sinensis.
This code is archived on Zenodo:
marker_CSV/190513miscanthus_sorghum_match.csv
contains TagDigger output listing every
tag location from the M. sinensis reference and how it matches, if at all,
to the S. bicolor reference. SAM files for TagDigger were generated by
Bowtie2 as part of the TASSEL-GBSv2 pipeline and are available at the
Illinois Data Bank.
marker_CSV/190513paralogs.csv
is filtered from the above file, to only contain pairs
of M. sinensis tag locations that align to the same S. bicolor location,
and only with appropriate synteny.
marker_CSV/190515paralog_tags.csv
contains tag sequences for all tag locations listed
in the above file.
scripts/hindhe_by_ploidy.R
imports M. sacchariflorus read depths from a VCF and
plots the distribution of Hind/He by ploidy. This is an older figure used in
presentations but not included in the manuscript.
scripts/filter_markers.R
imports marker_CSV/190513miscanthus_sorghum_match.csv
and filters it down to loci that are clearly paralogous, outputting
marker_CSV/190513paralogs.csv
.
scripts/get_tag_seq.R
imports marker_CSV/190513paralogs.csv
and a large
SAM file containing alignments of M. sacchariflorus tags to the
M. sinensis reference, then exports the sequences of tags for all
selected markers to marker_CSV/190515paralog_tags.csv
.
scripts/import_tagtaxadist.R
imports marker_CSV/190515paralog_tags.csv
as
well as a list of ploidy by accession and a large TagTaxaDist file containing
read counts output by TASSEL. It imports read counts just for the desired
tags and just in diploid and tetraploid individuals, then exports the read
depth matrices to workspaces/190515counts_matrices.RData
.
scripts/H_stats_sorghum_miscanthus.R
imports workspaces/190515counts_matrices.RData
and marker_CSV/190515paralog_tags.csv
, then
estimates
scripts/optimize_temperature.R
examines the output of process_isoloci.py
to
evaluate the effectiveness of the tabu search algorithm for optimizing Hind/He.
Figures are plotted to compare Hind/He before and after the tabu search.
scripts/get_inbreeding.R
tests code to estimate inbreeding from preliminary
Hind/He distributions, and also plots Hind/He vs. ploidy, depth, and proportion
M. sinensis ancestry for the manuscript. This script generates Fig. 4 for the
main manuscript.
scripts/isoloci_fun_test.py
contains code for testing individual functions
within the variant calling pipeline.
scripts/snps_v_haps.R
imports variants from a VCF, with and without phasing
SNPs into haplotypes. The distribution and variance of Hind/He is then compared
between SNPs and haplotypes. A figure is generated for visualizing the
distribution. This script generates Fig. S3 for Additional File 1, as well as
results for Table 2 in the main manuscript regarding distance to genes, and
Fig. 3 in the main manuscript regarding the effect of Hind/He filtering on
minor allele frequency.
scripts/variance_and_bias.R
uses simulated data to explore the impact of
population and techinical parameters on the variance and bias of Hind/He.
This script generates Figs. 5 and 6 in the manuscript and Figs. S4 and S5 in
Additional File 1.
scripts/simulate_mapping_pops.R
uses simulated data to explore the variance
of Hind/He in diploid and tetraploid F1 mapping populations. This script
generates Fig. 7 of the main manuscript.
scripts/compare_approaches.R
compares effectiveness of various approaches
for filtering paralogs using simulated data. Functions for the approaches
can be found in scripts/HoHe.R
. This script generates the data found in
Table 3 in the main manuscript.
doc/using_hindhe.Rmd
contains a brief exploration of the Hind/He statistic
using an example dataset provided with polyRAD. This was superseded by the
tutorials incorporated into polyRAD, but is kept here for exploration of
read depth as it relates to Hind/He.