thkuo / seq2geno Goto Github PK
View Code? Open in Web Editor NEWComputational pipeline for genomic features of bacterial population
License: GNU General Public License v3.0
Computational pipeline for genomic features of bacterial population
License: GNU General Public License v3.0
Use Rsubread (quick tutorial included in https://www.bioconductor.org/help/course-materials/2016/CSAMA/lab-3-rnaseq/rnaseq_gene_CSAMA2016.html) to count the coverage from bam files
compress the tables (binary ones) by removing redundant features
In a rule, all the output files should use the same wildcards to avoid incorrect writing.
Two main questions to solve:
Too many temporary files are there. Snakemake, however, prefers information passed by files but not variables, because it allows the users to resume at any rule without rerunning those already done.
Format the commands under "shell" or "run".
indel_detection/gene_clusters2multi_fasta.py to the one used in mi-tip
The wildcards in each rule should be COMPLETELY INDEPENDENT from user-defined filenames. Otherwise, setting the filenames is not really user-friendly, while these final output names are not important to our workflow. The wildcards in our scripts only determine dependencies between rules and the intermediate results.
In short, it should not be the filenames but the variables (eg. software, tmp_d...etc) listed in the config file that play the roles in determining the rule wildcards.
The dictionaries used by different files may be redundant
Review:
mutation_table.py \
-f dict.txt \
-a /data3/reference_sequences/Pseudomonas_aeruginosa_PA14_annotation_with_ncRNAs_07_2011_12genes.tab \
-o DNA-Pool1.tab
@Pseudomonas_aeruginosa_PA14|NC_008463|6537649
@Strain Refseq_Accession Replicon Locus_Tag Feature_Type Start Stop Strand Gene_Name Product_Name
Pseudomonas_aeruginosa_PA14 NC_008463 Chromosome PA14_00010 CDS 483 2027 + dnaA chromosomal_replication_initiation_protein
Pseudomonas_aeruginosa_PA14 NC_008463 Chromosome PA14_00020 CDS 2056 3159 + dnaN DNA_polymerase_III_subunit_beta
Snp2Amino.py \
-f DNA-Pool1.tab \
-g /data3/reference_sequences/Pseudomonas_aeruginosa_PA14_ncRNA.gbk \
-o DNA_Pool1_final.tab
It doesn't generate syn table but an all table...
to be rewritten with pandas
Because the current gbk (used previously by Ariane) includes duplicated gene ids (PA2570.1), which were likely subunits or paralogues, the counting script art2genecount.pl wasn't able to correctly count the read numbers.
Acceptable solution: the locus ids in the annotation file should all be unique
Be aware of them when developing the ng version
Review, clean, and update every involved script and software usage
For the indel table, try using the gene names as wildcards
The expression levels (read counts) should be integers.
Change the setting of Salmon
For example, some gene names in the roary outcome have a quote, which caused problem when launching mafft in the command
Should assembler be included in the assembly name?
Review:
When parsing the annotation file of reference, the target feature (ie. locus_tag) is likely not included...
For example, the wildcards.mapper of {mapper}.vcf is 'bwa.snps' when matching the filtered outcome 'bwa.snps.vcf'
Besides creating the genomic results, also writing the input file of Geno2Pheno automatically.
This function should also be open to the precomputed files (listed in the config file)
The function expand
of snakemake has nothing to do with the dependencies among rules, because it is done at the initialization stage. The other function lambda
, however, can use wildcards.
Therefore, it seems better to use lambda when the required files are in the middle but not at the end of workflow.
Automatically detect mate-pair, paired-end, and single reads and set proper options (ie. for mapping). Consequently, the script (snakemake rules) should also be able to conditionally set the command
Three columns with header specified:
The separator is tab (\t). For paired reads, the files should be both listed with the separator ',', such as fq1,fq2
.
Try later versions of the external softwares: roary, prokka, stampy, but ...
Version control of the external softwares by the publication
Review the dependencies and check if any default value could escape the rules
It seems that the output files created earlier than the input files are not recognized by snakemake
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#ignoring-timestamps
Include one more option to check reference data:
How the functions checking files interact with the main script?
Should they always return values?
Does the workflow exit at these functions or after something returned from them?
With or without creating DAG plot when doing dry-run
The software roary automatically exclude the very small genes and pseudogenes
sanger-pathogens/Roary#367
sanger-pathogens/Roary#288
Therefore, genes that are less likely to function would not included in the gpa table and indel table
Problems to solve:
Strategy:
Use bcftools to filter the vcraints,
abs(strlen(ALT)-strlen(REF)) >= 0
ALT="."
TYPE!~"mnp" & TYPE!~"snp"
Example,
cat test.vcf | bcftools filter --include \
'TYPE!~"mnp" & \
TYPE!~"snp" & \
abs(strlen(ALT)-strlen(REF)) >= 0'
The six final output files:
should all be optional. When a file is precomputed or not needed by the user, the corresponding rules except for those required by the other outputs should be turned off.
MAKE_CONS.smk:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 6801: invalid start byte
which was likely caused by the windows-based working environment of Susanne's lab.
Fixed by changing the genbank opening line:
rec= SeqIO.read(open(gbk_f, 'r', encoding='windows-1252'), 'gb')
and should be back to a general usage
To have Stampy available, some problems need to be dealt with:
Use a python script to pass user's options to snakemake?
If so, the snakemake API may be helpful
Roary failed to create blast db files
More functions to include:
(For ng version)
It is suggested by the freebayes that
In practice, the discriminant power of the method will improve if you run multiple samples simultaneously.
Refer to freebayes README
Multi-choice about software in:
The options for haploid:
-p 1
The python package subprocess
won't directly built a virtual environment by calling source
. The function source
is not a command but a shell builtin, so requiring another command to build the virtual environment.
https://stackoverflow.com/questions/7040592/calling-the-source-command-from-subprocess-popen
Put the external scripts in another folder that is accessible by the core snakefile
Use "script" function to execute python and r scripts instead of calling them with "shell".
The scripts are thus need to be rewritten to fit the environment (eg. python and r versions) and variables inherits.
The old scripts cannot correct parse paths, although these scripts may be removed in the future...
For now, use less dangerous file and dir names
Environment installation directory
Multiple lines in a vcf may influence a same codon (tri-nucleotide). As the joint impact of point mutations may be different from that of the individual mutation, the detection of syn/non-syn should be modified.
In phenotype table, multiple phenotypes can be listed as columns in the same file. Each phenotype should be binary, which means each value can only be 1 or 0. For n samples with m phenotypes, the shape of table should be (n+1)-by-(m+1), where the first column includes all the sample names and the header line contains all the phenotype labels.
Check out picard
jvarkits is current linked to the official github repo.
For the feature tables, which includes those of
the feature names are listed in columns and presented at the header, and the strains are listed in rows and presented as row names. A table of n strains and m features should be (n+1)-by-(m+1). The upper-left can remain blank. The separator is tab (\t). Illegal characters shouldn't be used in feature names.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.