paralog_id

Exploration and testing of methods to identify paralogs in GBS data

This repository includes scripts, data, and drafts pertaining to methods for sorting GBS/RAD tags into paralogous loci. The dataset being used is from Miscanthus sacchariflorus, which contains both diploid and tetraploid individuals. Groups of tags were identified that aligned to one location in the Sorghum bicolor v3 reference genome, but two locations in the Miscanthus sinensis v7 reference genome, where those two locations are on appropriate chromosomes given the known synteny between S. bicolor and M. sinensis.

This code is archived on Zenodo:

File descriptions

CSV files

marker_CSV/190513miscanthus_sorghum_match.csv contains TagDigger output listing every tag location from the M. sinensis reference and how it matches, if at all, to the S. bicolor reference. SAM files for TagDigger were generated by Bowtie2 as part of the TASSEL-GBSv2 pipeline and are available at the Illinois Data Bank.

marker_CSV/190513paralogs.csv is filtered from the above file, to only contain pairs of M. sinensis tag locations that align to the same S. bicolor location, and only with appropriate synteny.

marker_CSV/190515paralog_tags.csv contains tag sequences for all tag locations listed in the above file.

Code

scripts/hindhe_by_ploidy.R imports M. sacchariflorus read depths from a VCF and plots the distribution of Hind/He by ploidy. This is an older figure used in presentations but not included in the manuscript.

scripts/filter_markers.R imports marker_CSV/190513miscanthus_sorghum_match.csv and filters it down to loci that are clearly paralogous, outputting marker_CSV/190513paralogs.csv.

scripts/get_tag_seq.R imports marker_CSV/190513paralogs.csv and a large SAM file containing alignments of M. sacchariflorus tags to the M. sinensis reference, then exports the sequences of tags for all selected markers to marker_CSV/190515paralog_tags.csv.

scripts/import_tagtaxadist.R imports marker_CSV/190515paralog_tags.csv as well as a list of ploidy by accession and a large TagTaxaDist file containing read counts output by TASSEL. It imports read counts just for the desired tags and just in diploid and tetraploid individuals, then exports the read depth matrices to workspaces/190515counts_matrices.RData.

scripts/H_stats_sorghum_miscanthus.R imports workspaces/190515counts_matrices.RData and marker_CSV/190515paralog_tags.csv, then estimates $H_{ind}$ and $H_E$ for all markers. The goal is to demonstrate that Hind/He can clearly distinguish one-copy and two-copy loci, i.e. those aligned to Miscanthus vs. those aligned to Sorghum. Figures are plotted to compare the distribution of Hind/He when tags are aligned to Miscanthus vs. Sorghum. This script generates Figs. 1 and 2 in the main manuscript and Figs. S1 and S2 in Additional File 1.

scripts/optimize_temperature.R examines the output of process_isoloci.py to evaluate the effectiveness of the tabu search algorithm for optimizing Hind/He. Figures are plotted to compare Hind/He before and after the tabu search.

scripts/get_inbreeding.R tests code to estimate inbreeding from preliminary Hind/He distributions, and also plots Hind/He vs. ploidy, depth, and proportion M. sinensis ancestry for the manuscript. This script generates Fig. 4 for the main manuscript.

scripts/isoloci_fun_test.py contains code for testing individual functions within the variant calling pipeline.

scripts/snps_v_haps.R imports variants from a VCF, with and without phasing SNPs into haplotypes. The distribution and variance of Hind/He is then compared between SNPs and haplotypes. A figure is generated for visualizing the distribution. This script generates Fig. S3 for Additional File 1, as well as results for Table 2 in the main manuscript regarding distance to genes, and Fig. 3 in the main manuscript regarding the effect of Hind/He filtering on minor allele frequency.

scripts/variance_and_bias.R uses simulated data to explore the impact of population and techinical parameters on the variance and bias of Hind/He. This script generates Figs. 5 and 6 in the manuscript and Figs. S4 and S5 in Additional File 1.

scripts/simulate_mapping_pops.R uses simulated data to explore the variance of Hind/He in diploid and tetraploid F1 mapping populations. This script generates Fig. 7 of the main manuscript.

scripts/compare_approaches.R compares effectiveness of various approaches for filtering paralogs using simulated data. Functions for the approaches can be found in scripts/HoHe.R. This script generates the data found in Table 3 in the main manuscript.

Documentation

doc/using_hindhe.Rmd contains a brief exploration of the Hind/He statistic using an example dataset provided with polyRAD. This was superseded by the tutorials incorporated into polyRAD, but is kept here for exploration of read depth as it relates to Hind/He.

lvclark / paralog_id Goto Github PK

paralog_id's Introduction

paralog_id

File descriptions

CSV files

Code

Documentation

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent