Coder Social home page Coder Social logo

seq2geno's People

Contributors

alicemchardy avatar thkuo avatar

Watchers

 avatar  avatar

seq2geno's Issues

wildcards problem

In a rule, all the output files should use the same wildcards to avoid incorrect writing.

  • How to start correcting?
  • How to fit the current scripts to this requirement while ensuring the correct dependencies?
  • How to avoid wildcard multi-match problem? (ie. "{sample}.tmp.fa" may be incorrectly matched by "{sample}.fa")
    -> Try single and simple extensions at every level (folder)

Division and collaboration

Two main questions to solve:

  • How are the jobs divided?
  • Who can guide the changes into whom at what scale? (Who lead the evolution?)

Clean folder

  • Where to put the smk scripts?
  • Where to put the external scripts?
  • Where to put the required software that are not available via conda?
  • Working folder, results folder, and the program location
  • Is there any software based on snakemake?

reduce the number of temporary files

Too many temporary files are there. Snakemake, however, prefers information passed by files but not variables, because it allows the users to resume at any rule without rerunning those already done.

Wildcards independent from the output filenames

The wildcards in each rule should be COMPLETELY INDEPENDENT from user-defined filenames. Otherwise, setting the filenames is not really user-friendly, while these final output names are not important to our workflow. The wildcards in our scripts only determine dependencies between rules and the intermediate results.
In short, it should not be the filenames but the variables (eg. software, tmp_d...etc) listed in the config file that play the roles in determining the rule wildcards.

intermediate files to be removed

The dictionaries used by different files may be redundant
Review:

  • collect_rpg_data.R (for the expression table)

mutation_table.py \
-f dict.txt \
-a /data3/reference_sequences/Pseudomonas_aeruginosa_PA14_annotation_with_ncRNAs_07_2011_12genes.tab \
-o DNA-Pool1.tab
  • dict.txt: prefix of ".flt.vcf" (the file some_line+".flt.vcf") [to be deprecated]
  • .flt.vcf: generated above, which contains only snps [to be simplified]
  • Pseudomonas_aeruginosa_PA14_annotation_with_ncRNAs_07_2011_12genes.tab: gene locus information [to be deprecated]
@Pseudomonas_aeruginosa_PA14|NC_008463|6537649
@Strain Refseq_Accession        Replicon        Locus_Tag       Feature_Type    Start   Stop    Strand  Gene_Name       Product_Name
Pseudomonas_aeruginosa_PA14     NC_008463       Chromosome      PA14_00010      CDS     483     2027    +       dnaA    chromosomal_replication_initiation_protein
Pseudomonas_aeruginosa_PA14     NC_008463       Chromosome      PA14_00020      CDS     2056    3159    +       dnaN    DNA_polymerase_III_subunit_beta

Snp2Amino.py \
-f DNA-Pool1.tab \
-g /data3/reference_sequences/Pseudomonas_aeruginosa_PA14_ncRNA.gbk \
-o DNA_Pool1_final.tab

It doesn't generate syn table but an all table...
to be rewritten with pandas

  • DNA-Pool1.tab: created by mutation_table.py; containing four columns: "gene, pos, ref, alt"

id-dependent methods fail in case of duplicated ids in gbk

Because the current gbk (used previously by Ariane) includes duplicated gene ids (PA2570.1), which were likely subunits or paralogues, the counting script art2genecount.pl wasn't able to correctly count the read numbers.
Acceptable solution: the locus ids in the annotation file should all be unique
Be aware of them when developing the ng version

DESeq2 input

The expression levels (read counts) should be integers.
Change the setting of Salmon

legal characters

For example, some gene names in the roary outcome have a quote, which caused problem when launching mafft in the command

gff file names

Should assembler be included in the assembly name?
Review:

  • CONSTRUCT_ASSEMBLY.smk
  • COUNT_GPA.smk
  • CREATE_GPA_TABLE.smk

ambiguous wildcards

For example, the wildcards.mapper of {mapper}.vcf is 'bwa.snps' when matching the filtered outcome 'bwa.snps.vcf'

The input to Geno2Pheno

Besides creating the genomic results, also writing the input file of Geno2Pheno automatically.
This function should also be open to the precomputed files (listed in the config file)

Lambda, expand, and wildcards

The function expand of snakemake has nothing to do with the dependencies among rules, because it is done at the initialization stage. The other function lambda, however, can use wildcards.
Therefore, it seems better to use lambda when the required files are in the middle but not at the end of workflow.

flexible about reads layout

Automatically detect mate-pair, paired-end, and single reads and set proper options (ie. for mapping). Consequently, the script (snakemake rules) should also be able to conditionally set the command

samples table

Three columns with header specified:

  1. strains
  2. dna_reads
  3. rna_reads

The separator is tab (\t). For paired reads, the files should be both listed with the separator ',', such as fq1,fq2.

software outputs

Try later versions of the external softwares: roary, prokka, stampy, but ...

take care about the output formats

Version control of the external softwares by the publication

user options

Review the dependencies and check if any default value could escape the rules

check reference genome

Include one more option to check reference data:

  • The chromosome name in .fa and .gbk file should be same
  • The .gbk file should have features of CDS that includes a qualifiers of 'locus_tag'

functions about file validation

How the functions checking files interact with the main script?
Should they always return values?
Does the workflow exit at these functions or after something returned from them?

dryrun

With or without creating DAG plot when doing dry-run

the indel table

Problems to solve:

  • core_genes_50.txt
    implemented in makeGroupAln.py
  • currently not able to expand by genes
    solved with "dynamic" function of snakemake
  • vcf2indel.py
    keep Aaron's algorithm but make it communicate with snakemake more smoothly
  • generate_indel_features.py and roary_PA14_abricate.txt,
    Script rewritten and the intermediate file is no longer required, because the users need to provide one more file...

Strategy:

  • Understand the roles of the intermediate files
  • Rewrite the external scripts
  • Try to reduce the intermediate files

use bcftools

Use bcftools to filter the vcraints,

  • size of indel (be careful about cases of indel)
abs(strlen(ALT)-strlen(REF)) >= 0
  • certain allele
ALT="."
  • variant types
TYPE!~"mnp" & TYPE!~"snp"

Example,

cat test.vcf | bcftools filter --include \
'TYPE!~"mnp" & \
TYPE!~"snp" & \
abs(strlen(ALT)-strlen(REF)) >= 0'

Workflow independence

The six final output files:

  1. phylogeny
  2. syn SNPs table
  3. non-syn SNPs table
  4. indel table
  5. expression table
  6. gpa table

should all be optional. When a file is precomputed or not needed by the user, the corresponding rules except for those required by the other outputs should be turned off.

errors when opening reference annotation

MAKE_CONS.smk:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 6801: invalid start byte

which was likely caused by the windows-based working environment of Susanne's lab.
Fixed by changing the genbank opening line:

rec= SeqIO.read(open(gbk_f, 'r', encoding='windows-1252'), 'gb')

and should be back to a general usage

stampy

To have Stampy available, some problems need to be dealt with:

  • installation
  • environment (python2); can be solved by stacking the environments though
  • interaction with bwa: It's not recommended to conduct bwa from stampy command but to run bwa before using stampy. Our previous usage, however, directly launches bwa in the command.

user interface

Use a python script to pass user's options to snakemake?
If so, the snakemake API may be helpful

roary

Roary failed to create blast db files

more functions to include

More functions to include:

  • ancestral reconstruction about expr. levels
  • differential expression analysis

software choices

Multi-choice about software in:

  • DETECT_SNPS::mapping (bwa vs stampy)
  • DETECT_SNPS::create_vcf (freebayes vs samtools)

manage the commands

Put the external scripts in another folder that is accessible by the core snakefile

the environment

Use "script" function to execute python and r scripts instead of calling them with "shell".
The scripts are thus need to be rewritten to fit the environment (eg. python and r versions) and variables inherits.

path parsing

The old scripts cannot correct parse paths, although these scripts may be removed in the future...
For now, use less dangerous file and dir names

synonymous mutation detection

Multiple lines in a vcf may influence a same codon (tri-nucleotide). As the joint impact of point mutations may be different from that of the individual mutation, the detection of syn/non-syn should be modified.

rules about phenotype table

In phenotype table, multiple phenotypes can be listed as columns in the same file. Each phenotype should be binary, which means each value can only be 1 or 0. For n samples with m phenotypes, the shape of table should be (n+1)-by-(m+1), where the first column includes all the sample names and the header line contains all the phenotype labels.

rules about feature tables

For the feature tables, which includes those of

  • syn/non-syn SNPs
  • indels
  • gpa
  • expression levels

the feature names are listed in columns and presented at the header, and the strains are listed in rows and presented as row names. A table of n strains and m features should be (n+1)-by-(m+1). The upper-left can remain blank. The separator is tab (\t). Illegal characters shouldn't be used in feature names.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.