Coder Social home page Coder Social logo

solgenomics / snpbinner Goto Github PK

View Code? Open in Web Editor NEW
30.0 39.0 19.0 7.4 MB

SNPbinner is a utility for the generation of genotype crossover points and binmaps based on SNP data across recombinant inbred lines.

Python 49.45% Perl 2.00% R 48.55%
bioinformatics analysis snp snp-data recombination genotype hmm

snpbinner's Introduction

SNPbinner

SNPbinner is a Python 2.7 package and command line utility for the generation of genotype binmaps based on SNP genotype data across populations of recombinant inbred lines (RILs). Analysis using SNPbinner is performed in three parts: crosspoints, bins, and visualize.

Citing

SNPbinner can be cited as:

Gonda, I., H. Ashrafi, D.A. Lyon, S.R. Strickler, A.M. Hulse-Kemp, Q. Ma, H. Sun, K. Stoffel, A.F. Powell, S. Futrell, T.W. Thannhauser, Z. Fei, A.E. Van Deynze, L.A. Mueller, J.J. Giovannoni, and M.R. Foolad. 2019. Sequencing-based bin map construction of a tomato mapping population, facilitating high-resolution quantitative trait loci detection. Plant Genome 12:180010. doi:10.3835/plantgenome2018.02.0010

Table of Contents

Installation and Usage
Commands
    crosspoints
    bins
    visualize

Installation and Usage

SNPbinner requires Python 2.7. Python 3 is currently not supported.
The only non‑standard dependency of SNPbinner is Pillow, a PIL fork.

To install the SNPbinner utility, download or clone the repository and run

$ pip install REPO-PATH

Once installed, one can execute any of the commands below like so

$ snpbinner COMMAND [ARGS...]

Alternatively, without installing the package, one can execute any of the commands below using

$ python REPO-PATH/snpbinner COMMAND [ARGS...]

Commands

crosspoints

Description Usage Input Format Output Format

Description

crosspoints uses genotyped SNP data to identify likely crossover points. First, the script uses a pair of hidden Markov models (HMM) to predict genotype regions along the chromosome both with (3‑state) and without (2‑state) heterozygous regions. Then, the script identifies groupings of regions which are too short (based on a minimum distance between crosspoints set by the user). After that it follows the rules below to find crosspoints and merge away regions which are too short. The script then outputs the crosspoints for each RIL and the genotyped regions between them to a CSV file.


  1. If a group of alternating too‑short regions is long enough to be its own acceptably‑long genotype region, it will be treated as such and assigned the most likely genotype using the 3‑state HMM.
  2. If a group of alternating too‑short regions is surrounded by regions of the same genotype, all regions within that group are assigned the surrounding genotype.
  3. If a too‑short region has been genotyped as heterozygous by the 3‑state HMM, that section is replaced by the regions identified by the 2‑sate HMM.
  4. If the first or last too‑short region is neighboring an acceptably‑long heterozygous region, the whole grouping will be assigned the heterozygous genotype.
  5. If a group of alternating too-short regions is bounded by two homozygous regions, the leftmost or rightmost too-short region (whichever is shortest) will be merged with it's bounding homozygous region. This repeats until the group is empty, the contents having been merged into the two bounding regions.

Usage

Running the crosspoints command requires an input path, output path, and a minimum size argument. There are also three optional arguments which can be found in the table below.

$ snpbinner crosspoints --input PATH --output PATH (--min-length INT | --min-ratio FLOAT) [optional args]  
Required Arguments
Type Description
‑i ‑‑input PATH Path to a SNP TSV, multiple paths, or a glob (e.g. myGenome.chr*.tsv).
‑o ‑‑output PATH Path for the output CSV when there is a single input, or for a folder when there are multiple.
‑m ‑‑min‑length INT Minimum distance between crosspoints in basepairs. Cannot be used with min‑ratio.
‑r ‑‑min‑ratio FLOAT Minimum distance between crosspoints as a ratio. (0.01 would be 1% of the chromosome.) Cannot be used with min‑length.
Optional Arguments
Type Description
‑c ‑‑cross‑count FLOAT Used to calculate transition probability. The state transition probability is this value divided by the chromosome length. (default: 4)
‑l ‑‑chrom‑len INT The length of the chromosome/scaffold which the SNPs are on. If no length is provided (or multiple file are being processed), the last SNP is considered to be the last site on the chromosome.
‑p ‑‑homogeneity FLOAT Used to calculate emission probabilities. For example if 0.9 is used it is predicted that a region b‑genotype would contain 90% b‑genotype. (Default:0.9)

Input Format

Sample input file

Input should be formatted as a tab‑separated value (TSV) file with the following columns.
0 The SNP marker ID.
1 The position of the marker in base pairs from the start of the chromosome.
2+ RIL ID (header) and the called genotype of the RIL at each position.

Output Format

Sample output file

Output is formatted as a comma‑separated value (CSV) file with the following columns.
0 The RIL ID
Odd Location of a crosspoint. (Empty after the chromosome ends.)
Even Genotype in between the surrounding crosspoints. (Empty after the chromosome ends.)

bins

Description Usage Input Format Output Format

Description

bins takes the crosspoints predicted for each RIL and combines similar crosspoint locations to create a combined map of all crossover points across the RILs at a specified resolution. It then projects the genotype regions of the RIL back onto the map and outputs the average genotype of each RIL in each bin on the map. The procedure is as follows. It should be noted that, to insure the changes are obvious, the illustrations below are showing a map with very low resolution (bin size) and therefore there is significant loss of information. A smaller bin size would create a more accurate map.

  1. The script begins by combining the crosspoints from all lines, including duplicates occurring at the same location.
  2. Contiguous series of crosspoints are then grouped together if they are closer to a neighbor than the specified minimum bin size.
  3. One‑dimensional k‑means optimization is then used to find the best placement for the bin boundaries (steps 2 and 4 below). This is repeated for every possible number of boundaries that can fit in the span of each group. In order to account for the minimum bin‑size constraint, once a possible set of boundaries has been converged upon by the k‑means algorithm, each mean is adjusted to insure it is at least the minimum distance from it's neighbors (steps 3 and 4 below). If this enters a cycle instead of converging on a working solution, the script will accept the adjusted boundaries without the second optimization step. Otherwise, optimization continues until a solution is reached with appropriately spaced boundaries.
    This k=3 example finishes due to a cycle (steps 3‑5).
  4. For each group, the solution with a value of k leading to the least variance from the adjusted means are placed into a list of final boundaries. These boundaries are then used to create bins for the final binmap.
  5. Each RIL is then projected onto this bin and the results are output as a CSV. Bins are genotyped as whatever genotype represents a plurality of its contents.

Usage

Running the bins command requires an input path, output path, and a minimum size argument. Optionally, a binmap ID may also be provided.

$ snpbinner bins --input PATH --output PATH --min-bin-size INT [--binmap-id ID]
Required Arguments
Type Description
‑i ‑‑input PATH Path to a crosspoints CSV, multiple paths, or a glob (e.g. myGenome.chr*.crosp.csv).
‑o ‑‑output PATH Path for the output CSV when there is a single input, or for a folder when there are multiple.
‑l ‑‑min‑bin‑size INT Sets the minimum size (in bp) of each bin.
Optional Arguments
Type Description
‑n ‑‑binmap‑id ID If a binmap ID is provided, a header row will be added and each column labeled with the given string.

Input Format

bins uses the output from crosspoints.
For details, see the crosspointsOutput Format.

Output Format

Sample output file

Output is formatted as a comma‑separated value (CSV) file and has the following rows.
0 (Optional) The binmap ID
1 The start of each bin (in base pairs).
2 The end of each bin (in base pairs).
3 The center of each bin (in base pairs).
4+ RIL ID in the first cell, then the genotypes of each bin for that RIL.

visualize

Description Usage Input Format Output Format

Description

visualize plots the inputs and outputs of bins and crosspoints. It can be used to visually check the results of the above commands to help determine the best values for each of the parameters. It can accept three filetypes (SNP input TSV, crosspoint CSV, and bin CSV). It then parses the files and groups the data by RIL, creating an image for each. In each row of the resulting images, regions are colored red, green, or blue, for genotype a, heterozygous, or genotype b, respectively. The binmap is represented in gray with adjacent bins alternating dark and light. The script can accept any combination or number of files for each of the different filetypes.

Example

Usage

$ snpbinner visualize --out PATH [--bins PATH]... [--crosspoints PATH]... [--snps PATH]...
Required Arguments
Type Description
‑o ‑‑out PATH Folder to which the resulting images should be saved.
Optional Arguments
Type Description
‑b ‑‑bins PATH bins output file to be added to the visualization.
‑c ‑‑crosspoints PATH crosspoints output file to be added to the visualization.
‑s ‑‑snps PATH SNP (crosspoints input file) file to be added to the visualization.

snpbinner's People

Contributors

dauglyon avatar lukasmueller avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

snpbinner's Issues

crosspoints.py error ValueError: math domain error

Hi,
I was trying to use snpbinner crosspoints command with my SNP dataset. However, I am getting this following error :

Traceback (most recent call last):
File "/home/rama/.local/bin/snpbinner", line 9, in
load_entry_point('snpbinner==0.2.3', 'console_scripts', 'snpbinner')()
File "/home/rama/Downloads/softwares/SNPbinner-master/snpbinner/main.py", line 24, in main
program_run_dictprogram_to_run
File "/home/rama/Downloads/softwares/SNPbinner-master/snpbinner/crosspoints.py", line 9, in _crosspoints_batcher
crosspoints(input_path[0],output_path,predicted_homogeneity,predicted_cross_count,chrom_len,min_state_length,min_state_ratio)
File "/home/rama/Downloads/softwares/SNPbinner-master/snpbinner/crosspoints.py", line 86, in crosspoints
hmm_all = hmm_all)
File "/home/rama/Downloads/softwares/SNPbinner-master/snpbinner/crosspoints.py", line 94, in _find_crosspoints
cross_points = hmm_all.gapped_viterbi(snplist)
File "/home/rama/Downloads/softwares/SNPbinner-master/snpbinner/crosspoints.py", line 276, in gapped_viterbi
gap_transition_adjustment = log(gap_size)
ValueError: math domain error

When I am trying the same command with individual chromosome SNPs the command is working fine. I am a biologist, new to python. Please help me understand the problem.

Udita

Inconsistent Number of Individuals on Output

Hi David,
Following the last fix, it appears that while it will now work without an error for those offending contig files the output for those files now generates an inconsistent output for number of samples. It would be ultimately beneficial if we can impose a fix to have consistent output according to the input regarding number of samples. I am working on a downstream fix but I think it would be great to incorporate upstream for greater user utility.
Best,
Amanda

emssiom probabilities

Hi, I cannot understand why the sum(A1*) is not equal to 1 in the emission(heterogeneous) probabilities matrix. Please help me. Thanks in advance!

Maximum number markers handeled in SNPbinner

I am trying to use SNPbinner for binning of my rice RIL genotype data (170 Lines) and binding really helpful in my work. I have about 8000 SNPs on Chr1 but when I try to run first crosspoints command with more than 4000 SNPs it returns a following error. Is there a limit for maximum number of markers per chromosome which can handled by SNPbinner. Please let me know how can I deal with this issue.

File "/usr/local/bin/snpbinner", line 11, in
sys.exit(main())
File "/usr/local/lib/python2.7/site-packages/snpbinner/main.py", line 24, in main
program_run_dictprogram_to_run
File "/usr/local/lib/python2.7/site-packages/snpbinner/crosspoints.py", line 9, in _crosspoints_batcher
crosspoints(input_path[0],output_path,predicted_homogeneity,predicted_cross_count,chrom_len,min_state_length,min_state_ratio)
File "/usr/local/lib/python2.7/site-packages/snpbinner/crosspoints.py", line 86, in crosspoints
hmm_all = hmm_all)
File "/usr/local/lib/python2.7/site-packages/snpbinner/crosspoints.py", line 94, in _find_crosspoints
cross_points = hmm_all.gapped_viterbi(snplist)
File "/usr/local/lib/python2.7/site-packages/snpbinner/crosspoints.py", line 276, in gapped_viterbi
gap_transition_adjustment = log(gap_size)
ValueError: math domain error

Thanks
Anurag

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.