The seq2geno from thkuo

counting the coverage

Use Rsubread (quick tutorial included in https://www.bioconductor.org/help/course-materials/2016/CSAMA/lab-3-rnaseq/rnaseq_gene_CSAMA2016.html) to count the coverage from bam files

feature reduction

compress the tables (binary ones) by removing redundant features

wildcards problem

In a rule, all the output files should use the same wildcards to avoid incorrect writing.

How to start correcting?
How to fit the current scripts to this requirement while ensuring the correct dependencies?
How to avoid wildcard multi-match problem? (ie. "{sample}.tmp.fa" may be incorrectly matched by "{sample}.fa")
-> Try single and simple extensions at every level (folder)

Division and collaboration

Two main questions to solve:

How are the jobs divided?
Who can guide the changes into whom at what scale? (Who lead the evolution?)

Clean folder

Where to put the smk scripts?
Where to put the external scripts?
Where to put the required software that are not available via conda?
Working folder, results folder, and the program location
Is there any software based on snakemake?

reduce the number of temporary files

Too many temporary files are there. Snakemake, however, prefers information passed by files but not variables, because it allows the users to resume at any rule without rerunning those already done.

format commands

Format the commands under "shell" or "run".

remove redundant scripts

indel_detection/gene_clusters2multi_fasta.py to the one used in mi-tip

Wildcards independent from the output filenames

The wildcards in each rule should be COMPLETELY INDEPENDENT from user-defined filenames. Otherwise, setting the filenames is not really user-friendly, while these final output names are not important to our workflow. The wildcards in our scripts only determine dependencies between rules and the intermediate results.
In short, it should not be the filenames but the variables (eg. software, tmp_d...etc) listed in the config file that play the roles in determining the rule wildcards.

intermediate files to be removed

The dictionaries used by different files may be redundant
Review:

collect_rpg_data.R (for the expression table)

mutation_table.py \
-f dict.txt \
-a /data3/reference_sequences/Pseudomonas_aeruginosa_PA14_annotation_with_ncRNAs_07_2011_12genes.tab \
-o DNA-Pool1.tab

dict.txt: prefix of ".flt.vcf" (the file some_line+".flt.vcf") [to be deprecated]
.flt.vcf: generated above, which contains only snps [to be simplified]
Pseudomonas_aeruginosa_PA14_annotation_with_ncRNAs_07_2011_12genes.tab: gene locus information [to be deprecated]

@Pseudomonas_aeruginosa_PA14|NC_008463|6537649
@Strain Refseq_Accession        Replicon        Locus_Tag       Feature_Type    Start   Stop    Strand  Gene_Name       Product_Name
Pseudomonas_aeruginosa_PA14     NC_008463       Chromosome      PA14_00010      CDS     483     2027    +       dnaA    chromosomal_replication_initiation_protein
Pseudomonas_aeruginosa_PA14     NC_008463       Chromosome      PA14_00020      CDS     2056    3159    +       dnaN    DNA_polymerase_III_subunit_beta

Snp2Amino.py \
-f DNA-Pool1.tab \
-g /data3/reference_sequences/Pseudomonas_aeruginosa_PA14_ncRNA.gbk \
-o DNA_Pool1_final.tab

It doesn't generate syn table but an all table...
to be rewritten with pandas

DNA-Pool1.tab: created by mutation_table.py; containing four columns: "gene, pos, ref, alt"

id-dependent methods fail in case of duplicated ids in gbk

Because the current gbk (used previously by Ariane) includes duplicated gene ids (PA2570.1), which were likely subunits or paralogues, the counting script art2genecount.pl wasn't able to correctly count the read numbers.
Acceptable solution: the locus ids in the annotation file should all be unique
Be aware of them when developing the ng version

clean dirty scripts and methods

Review, clean, and update every involved script and software usage

gene names as wildcards

For the indel table, try using the gene names as wildcards

DESeq2 input

The expression levels (read counts) should be integers.
Change the setting of Salmon

legal characters

For example, some gene names in the roary outcome have a quote, which caused problem when launching mafft in the command

gff file names

Should assembler be included in the assembly name?
Review:

CONSTRUCT_ASSEMBLY.smk
COUNT_GPA.smk
CREATE_GPA_TABLE.smk

Parsing reference annotation file

When parsing the annotation file of reference, the target feature (ie. locus_tag) is likely not included...

ambiguous wildcards

For example, the wildcards.mapper of {mapper}.vcf is 'bwa.snps' when matching the filtered outcome 'bwa.snps.vcf'

The input to Geno2Pheno

Besides creating the genomic results, also writing the input file of Geno2Pheno automatically.
This function should also be open to the precomputed files (listed in the config file)

Lambda, expand, and wildcards

The function expand of snakemake has nothing to do with the dependencies among rules, because it is done at the initialization stage. The other function lambda, however, can use wildcards.
Therefore, it seems better to use lambda when the required files are in the middle but not at the end of workflow.

flexible about reads layout

Automatically detect mate-pair, paired-end, and single reads and set proper options (ie. for mapping). Consequently, the script (snakemake rules) should also be able to conditionally set the command

samples table

Three columns with header specified:

strains
dna_reads
rna_reads

The separator is tab (\t). For paired reads, the files should be both listed with the separator ',', such as fq1,fq2.

software outputs

Try later versions of the external softwares: roary, prokka, stampy, but ...

take care about the output formats

Version control of the external softwares by the publication

user options

Review the dependencies and check if any default value could escape the rules

file timestamp matters the DAG of snakemake

It seems that the output files created earlier than the input files are not recognized by snakemake
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#ignoring-timestamps

Determine whether this behavior should be on or not

check reference genome

Include one more option to check reference data:

The chromosome name in .fa and .gbk file should be same
The .gbk file should have features of CDS that includes a qualifiers of 'locus_tag'

functions about file validation

How the functions checking files interact with the main script?
Should they always return values?
Does the workflow exit at these functions or after something returned from them?

dryrun

With or without creating DAG plot when doing dry-run

roary usage

The software roary automatically exclude the very small genes and pseudogenes
sanger-pathogens/Roary#367
sanger-pathogens/Roary#288
Therefore, genes that are less likely to function would not included in the gpa table and indel table

the indel table

Problems to solve:

core_genes_50.txt
implemented in makeGroupAln.py
currently not able to expand by genes
solved with "dynamic" function of snakemake
vcf2indel.py
keep Aaron's algorithm but make it communicate with snakemake more smoothly
generate_indel_features.py and roary_PA14_abricate.txt,
Script rewritten and the intermediate file is no longer required, because the users need to provide one more file...

Strategy:

Understand the roles of the intermediate files
Rewrite the external scripts
Try to reduce the intermediate files

use bcftools

Use bcftools to filter the vcraints,

size of indel (be careful about cases of indel)

abs(strlen(ALT)-strlen(REF)) >= 0

certain allele

ALT="."

variant types

TYPE!~"mnp" & TYPE!~"snp"

Example,

cat test.vcf | bcftools filter --include \
'TYPE!~"mnp" & \
TYPE!~"snp" & \
abs(strlen(ALT)-strlen(REF)) >= 0'

Workflow independence

The six final output files:

phylogeny
syn SNPs table
non-syn SNPs table
indel table
expression table
gpa table

should all be optional. When a file is precomputed or not needed by the user, the corresponding rules except for those required by the other outputs should be turned off.

errors when opening reference annotation

MAKE_CONS.smk:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 6801: invalid start byte

which was likely caused by the windows-based working environment of Susanne's lab.
Fixed by changing the genbank opening line:

rec= SeqIO.read(open(gbk_f, 'r', encoding='windows-1252'), 'gb')

and should be back to a general usage

stampy

To have Stampy available, some problems need to be dealt with:

installation
environment (python2); can be solved by stacking the environments though
interaction with bwa: It's not recommended to conduct bwa from stampy command but to run bwa before using stampy. Our previous usage, however, directly launches bwa in the command.

user interface

Use a python script to pass user's options to snakemake?
If so, the snakemake API may be helpful

roary

Roary failed to create blast db files

more functions to include

More functions to include:

ancestral reconstruction about expr. levels
differential expression analysis

freebayes with multiple samples

(For ng version)
It is suggested by the freebayes that

In practice, the discriminant power of the method will improve if you run multiple samples simultaneously.

Refer to freebayes README

software choices

Multi-choice about software in:

DETECT_SNPS::mapping (bwa vs stampy)
DETECT_SNPS::create_vcf (freebayes vs samtools)

freebayes

The options for haploid:

-p 1

loading the main environment

The python package subprocess won't directly built a virtual environment by calling source. The function source is not a command but a shell builtin, so requiring another command to build the virtual environment.
https://stackoverflow.com/questions/7040592/calling-the-source-command-from-subprocess-popen

manage the commands

Put the external scripts in another folder that is accessible by the core snakefile

the environment

Use "script" function to execute python and r scripts instead of calling them with "shell".
The scripts are thus need to be rewritten to fit the environment (eg. python and r versions) and variables inherits.

path parsing

The old scripts cannot correct parse paths, although these scripts may be removed in the future...
For now, use less dangerous file and dir names

load the local environment

Environment installation directory

synonymous mutation detection

Multiple lines in a vcf may influence a same codon (tri-nucleotide). As the joint impact of point mutations may be different from that of the individual mutation, the detection of syn/non-syn should be modified.

rules about phenotype table

In phenotype table, multiple phenotypes can be listed as columns in the same file. Each phenotype should be binary, which means each value can only be 1 or 0. For n samples with m phenotypes, the shape of table should be (n+1)-by-(m+1), where the first column includes all the sample names and the header line contains all the phenotype labels.

process reads and mapping results

Check out picard

create submodules

jvarkits is current linked to the official github repo.

rules about feature tables

For the feature tables, which includes those of

syn/non-syn SNPs
indels
gpa
expression levels

the feature names are listed in columns and presented at the header, and the strains are listed in rows and presented as row names. A table of n strains and m features should be (n+1)-by-(m+1). The upper-left can remain blank. The separator is tab (\t). Illegal characters shouldn't be used in feature names.

thkuo / seq2geno Goto Github PK

seq2geno's People

Contributors

Watchers

seq2geno's Issues

take care about the output formats

Determine whether this behavior should be on or not

Recommend Projects

Recommend Topics

Recommend Org