Coder Social home page Coder Social logo

isolate_parsers's Introduction

Breseqparser

These scripts consolidate and summarize a set of multiple samples with variants called using breseq.

Installation

The scripts can be installed using pip:

pip install isolateparser

Usage


usage: breseqparser [-h] [-i FOLDER] [-o OUTPUT] [--fasta] [-w WHITELIST]
                    [-b BLACKLIST] [-m SAMPLE_MAP] [--filter-1000bp]
                    [--reference REFERENCE_LABEL]
                    [--snp-categories SNP_CATEGORIES] [--regex REGEX]
                    [--single]

optional arguments:
  -h, --help            show this help message and exit
  -i FOLDER, --input FOLDER
                        The breseq folder to parse.
  -o OUTPUT, --output OUTPUT
                        Where to save the output files. Should just be the
                        prefix, the file extensions will be added
                        automatically.
  --fasta               Whether to generate an aligned fasta file of all snps
                        in the breseq VCF file.
  -w WHITELIST, --whitelist WHITELIST
                        Samples not in the whitelist are ignored. Either a
                        comma-separated list of sample ids for a file with
                        each sample id occupying a single line.
  -b BLACKLIST, --blacklist BLACKLIST
                        Samples to ignore. See `--whitelist` for possible
                        input formats.
  -m SAMPLE_MAP, --sample-map SAMPLE_MAP
                        A file mapping sample ids to sample names. Use if the
                        subfolders in the breseqset folder are named
                        differently from the sample names. The file should
                        have two columns: `sampleId` and `sampleName`,
                        separated by a tab character.
  --filter-1000bp       Whether to filter out variants that occur within
                        1000bp of each other. Usually indicates a mapping
                        error.
  --reference REFERENCE_LABEL
                        The sample that was used as the reference, if
                        available.
  --snp-categories SNP_CATEGORIES
                        Categories to use when concatenating SNPs into a fasta
                        file.
  --regex REGEX         Used to extract sample names from the given filename.
                        Currently Disabled
  --single              Indicates that there is only one sample. Used for
                        debugging.

Input

The scripts expect a folder of individual breseq runs, with each folder named after the isolate/sample. The scipts only require the output.vcf, annotated.gd, and index.html files located in each folder. Example folder:

    .breseq_folder
    |-- sample1
    |   |-- data
    |   |   |-- output.vcf
    |   |-- output
    |   |   |-- index.html
    |   |   |-- evidence
    |   |   |   |-- annotated.gd
    |-- sample2
    |   |-- data
    |   |   |-- output.vcf
    |   |-- output
    |   |   |-- index.html
    |   |   |-- evidence
    |   |   |   |-- annotated.gd
    |-- sample3
    |   |-- data
    |   |   |-- output.vcf
    |   |-- output
    |   |   |-- index.html
    |   |   |-- evidence
    |   |   |   |-- annotated.gd

Output

The scripts generate an excel file in the breseq run folder with 4 sheets: comparison, variant, coverage, and junction. The variant, coverage, and junction tables are just the concatenated tables of all samples in the breseq run.

Comparision table

A table in which every row represents a single mutation seen in the sample callset and samples are represented by columns with the alternate sequence for each sample.

Sample1 Sample2 Sample3 annotation description gene locusTag mutationCategory position presentIn presentInAllSamples ref seq id
GG GG GG intergenic (+65/+20) putative lipoprotein/putative hydrolase PFLU0045 - / - PFLU0046 PFLU0045/PFLU0046 small_indel 45881 3 1 G NC_012660
CC CC CC intergenic (+17/-136) microcin-processing peptidase 1. Unknown type peptidase. MEROPS family U62/hypothetical protein PFLU0872 - / - PFLU0873 PFLU0872/PFLU0873 small_indel 985333 3 1 C NC_012660
intergenic (+57/+21) hypothetical protein/putative helicase PFLU3154 - / - PFLU3155 PFLU3154/PFLU3155 small_indel 3447986 3 1 NC_012660
A A G M350I (ATG-ATA) putative GGDEF domain signaling protein PFLU3571 - PFLU3571 snp_nonsynonymous 3959631 2 0 G NC_012660
A A C T238P (ACC-CCC) hybrid sensory histidine kinase in two-component regulatory system with UvrY PFLU3777 - PFLU3777 snp_nonsynonymous 4173231 1 0 A NC_012660
G G GG coding (322/1476 nt) putative two-component system response regulator nitrogen regulation protein NR(I) PFLU4443 - PFLU4443 small_indel 4908233 1 0 G NC_012660

Aligned fasta files

The scripts also generates 3 fasta files (breseq.snp.fasta, breseq.amino.fasta, breseq.codon.fasta) with all nonsynonymous snps from each sample represented by the replacement bases, amino acids, and codons. Example:

>reference
GA
>Sample1
AA
>Sample2
AA
>Sample3
GC

isolate_parsers's People

Contributors

cdeitrick avatar pepepdodiu avatar

Watchers

 avatar

isolate_parsers's Issues

Exception not handled: KeyError: 'position'

Hi,

I tried to run breseqparser with 4 breseq results but failed. Please see below for the error message.

2021-06-12 07:56:38.896 | INFO     | isolateparser.workflow:update_tables:215 - Parsing 'DVT1579' ('DVT1579')
Traceback (most recent call last):
  File "/data/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2895, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'position'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/shc167/.local/bin/breseqparser", line 3, in <module>
    workflow.main()
  File "/home/shc167/.local/lib/python3.8/site-packages/isolateparser/workflow.py", line 391, in main
    isolateset_workflow.run(program_options.folder, program_options.reference_label, program_options.output)
  File "/home/shc167/.local/lib/python3.8/site-packages/isolateparser/workflow.py", line 63, in run
    variant_df, coverage_df, junction_df, summary_df = self.concatenate_callset_tables(parent_folder)
  File "/home/shc167/.local/lib/python3.8/site-packages/isolateparser/workflow.py", line 108, in concatenate_callset_tables
    self.update_tables(folder)
  File "/home/shc167/.local/lib/python3.8/site-packages/isolateparser/workflow.py", line 232, in update_tables
    snp_df, coverage_df, junction_df, gd_extra_df = breseq_output.run(
  File "/home/shc167/.local/lib/python3.8/site-packages/isolateparser/resultparser/breseq_folder_parser.py", line 164, in run
    index_df, coverage_df, junction_df = self.file_parser_index.run(sample_name, indexpath, set_index = self._set_table_index)
  File "/home/shc167/.local/lib/python3.8/site-packages/isolateparser/resultparser/parsers/parse_index.py", line 449, in run
    variant_table = self.variant_table_parser.run(sample_name, file_soup)
  File "/home/shc167/.local/lib/python3.8/site-packages/isolateparser/resultparser/parsers/parse_index.py", line 283, in run
    snp_df['position'] = snp_df['position'].apply(self._clean_position)
  File "/data/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py", line 2902, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/data/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
    raise KeyError(key) from err
KeyError: 'position'

Could you please fix it? Thank you very much!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.