Comments (4)
[vcf2indel.py]
The current binary labels are ambiguous and can result in incorrect ML outcomes.
The 1s may mean insertions, deletions, missing values, or mixture of indels. For a certain indel feature, furthermore, is 1 1 1 0 0
equal to 0 0 0 1 1
? Are two features that have same values redundant?
Key questions to answer:
- Should the definition of indels be reference-based?
- How to deal with the missing values (eg. the dots in ALT and GT fields)?
from seq2geno.
In the four filters of vcf2indel.py,
case 1
not existing
if ref == ".": ...
case 2
majority-based assignment
elif alt == ".":...
case 3
mixture of majority-based and reference-based methods
elif len(sample_names.loc[samples == "."]) > sample_names.shape[0]/2:...
case 4:
reference-based
else:...
from seq2geno.
Only update the connection with snakemake main script (for now)
from seq2geno.
Even with the original methods, the indels were incorrectly detected in this case:
>CH2500
>F1659
>CH2502
ATGAGCCGCTTTGAAATCGCCTTTTCCGGCCAGTTGGTCGCCGGCGCCCGTCCCGAGGTG
GTCAAGGCCAACCTGGCCAAGCTGTTCCAGGCCGACGCGCAGCGTATCGAACTGCTGTTC
TCCGGCCGCCGGGTGGTGATCAAGAACAACCTCGATGCCGCCTCCGCGGAAAAATACCGC
AGCGTGCTGGAGCGAGCGGGAGCGATCGCCGTGGTCGCCGAGATGGAGGTCGAGGAGGTG
GTCATGGCGCCGCCGCCTGCGCAGACGACTCCCGTGGAGGCCCCGCAGACCCGCGCCGCT
ACTGGTACCAGCGCGCCCGCCGGACGCTTGCAGGTAGCGCCGCGGGACGGCTACATGGCG
GCGTTCGCCGAGGTCGATGCGCCGGATTTCGGCCTGGCTCCGGTAGGCGCCGACCTACAG
GACGCCAAGGCCGAAGCCGAGGCGCCGAAACTCGACCTGAGCCGCTTCAGCGTCGCCCCG
GTCGGTAGCGACATGGGCCAGGCACGCTCCGAGCCAGCGGCTCCGGCTCCGGACACCAGC
CACCTGCGCCTGCAGGACTGA
>CH2522
ATGAGCCGCTTTGAAATCGCCTTTTCCGGCCAGTTGGTCGCCGGCGCCCGTCCCGAGGTG
GTCAAGGCCAACCTGGCCAAGCTGTTCCAGGCCGACGCGCAGCGTATCGAACTGCTGTTC
TCCGGCCGCCGGGTGGTGATCAAGAACAACCTCGATGCCGCCTCCGCGGAAAAATACCGC
AGCGTGCTGGAGCGAGCGGGAGCGATCGCCGTGGTCGCCGAGATGGAGGTCGAGGAGGTG
GTCATGGCGCCGCCGCCTGCGCAGACGACTCCCGTGGAGGCCCCGCAGACCCGTGCCGCT
ACTGGTACCAGCGCGCCCGCCGGACGCTTGCAGGTAGCGCCGCGGGACGGCTACATGGCG
GCGTTCGCCGAGGTCGATGCGCCGGATTTCGGCCTGGCTCCGGTAGGCGCCGACCTACAG
GACGCCAAGGCCGAAGCCGAGGCGCCGAAACTCGACCTGAGCCGCTTCAGCGTCGCCCCG
GTCGGTAGCGACATGGGCCAGGCACGCTCCGAGCCAGCGGCTCCGGCTCCGGACACCAGC
CACCTGCGCCTGCAGGACTGA
>ESP088
ATGAGCCGCTTTGAAATCGCCTTTTCCGGCCAGTTGGTCGCCGGCGCCCGTCCCGAGGTG
GTCAAGGCCAACCTGGCCAAGCTGTTCCAGGCCGACGCGCAGCGTATCGAACTGCTGTTC
TCCGGCCGCCGGGTGGTGATCAAGAACAACCTCGATGCCGCCTCCGCGGAAAAATACCGC
AGCGTGCTGGAGCGAGCGGGAGCGATCGCCGTGGTCGCCGAGATGGAGGTCGAGGAGGTG
GTCATGGCGCCGCCGCCTGCGCAGACGACTCCCGTGGAGGCCCCGCAGACCCGCGCCGCT
ACTGGTACCAGCGCGCCCGCCGGACGCTTGCAGGTAGCGCCGCGGGACGGCTACATGGCG
GCGTTCGCCGAGGTCGATGCGCCGGATTTCGGCCTGGCTCCGGTAGGCGCCGACCTACAG
GACGCCAAGGCCGAAGCCGAGGCGCCGAAACTCGACCTGAGCCGCTTCAGCGTCGCCCCG
GTCGGTAGCGACATGGGCCAGGCACGCTCCGAGCCAGCGGCTCCGGCTCCGGACACCAGC
CACCTGCGCCTGCAGGACTGA
>MHH15083
ATGAGCCGCTTTGAAATCGCCTTTTCCGGCCAGTTGGTCGCCGGCGCCCGTCCCGAGGTG
GTCAAGGCCAACCTGGCCAAGCTGTTCCAGGCCGACGCGCAGCGTATCGAACTGCTGTTC
TCCGGCCGCCGGGTGGTGATCAAGAACAACCTCGATGCCGCCTCCGCGGAAAAATACCGC
AGCGTGCTGGAGCGAGCGGGAGCGATCGCCGTGGTCGCCGAGATGGAGGTCGAGGAGGTG
GTCATGGCGCCGCCGCCTGCGCAGACGACTCCCGTGGAGGCCCCGCAGACCCGCGCCGCT
ACTGGTACCAGCGCGCCCGCCGGACGCTTGCAGGTAGCGCCGCGGGACGGCTACATGGCG
GCGTTCGCCGAGGTCGATGCGCCGGATTTCGGCCTGGCTCCGGTAGGCGCCGACCTACAG
GACGCCAAGGCCGAAGCCGAGGCGCCGAAACTCGACCTGAGCCGCTTCAGCGTCGCCCCG
GTCGGTAGCGACATGGGCCAGGCACGCTCCGAGCCAGCGGCTCCGGCTCCGGACACCAGC
CACCTGCGCCTGCAGGACTGA
and the vcf outcome was:
##fileformat=VCFv4.2
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth.">
##contig=<ID=chrUn,length=561>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CH2500 CH2502 CH2522 ESP088 F1659 MHH15083
chrUn 0 . ATGAGCCGCTTTGAAATCGCCTTTTCCGGCCAGTTGGTCGCCGGCGCCCGTCCCGAGGTGGTCAAGGCCAACCTGGCCAAGCTGTTCCAGGCCGACGCGCAGCGTATCGAACTGCTGTTCTCCGGCCGCCGGGTGGTGATCAAGAACAACCTCGATGCCGCCTCCGCGGAAAAATACCGCAGCGTGCTGGAGCGAGCGGGAGCGATCGCCGTGGTCGCCGAGATGGAGGTCGAGGAGGTGGTCATGGCGCCGCCGCCTGCGCAGACGACTCCCGTGGAGGCCCCGCAGACCCGCGCCGCTACTGGTACCAGCGCGCCCGCCGGACGCTTGCAGGTAGCGCCGCGGGACGGCTACATGGCGGCGTTCGCCGAGGTCGATGCGCCGGATTTCGGCCTGGCTCCGGTAGGCGCCGACCTACAGGACGCCAAGGCCGAAGCCGAGGCGCCGAAACTCGACCTGAGCCGCTTCAGCGTCGCCCCGGTCGGTAGCGACATGGGCCAGGCACGCTCCGAGCCAGCGGCTCCGGCTCCGGACACCAGCCACCTGCGCCTGCAGGACTGA ATGAGCCGCTTTGAAATCGCCTTTTCCGGCCAGTTGGTCGCCGGCGCCCGTCCCGAGGTGGTCAAGGCCAACCTGGCCAAGCTGTTCCAGGCCGACGCGCAGCGTATCGAACTGCTGTTCTCCGGCCGCCGGGTGGTGATCAAGAACAACCTCGATGCCGCCTCCGCGGAAAAATACCGCAGCGTGCTGGAGCGAGCGGGAGCGATCGCCGTGGTCGCCGAGATGGAGGTCGAGGAGGTGGTCATGGCGCCGCCGCCTGCGCAGACGACTCCCGTGGAGGCCCCGCAGACCCGTGCCGCTACTGGTACCAGCGCGCCCGCCGGACGCTTGCAGGTAGCGCCGCGGGACGGCTACATGGCGGCGTTCGCCGAGGTCGATGCGCCGGATTTCGGCCTGGCTCCGGTAGGCGCCGACCTACAGGACGCCAAGGCCGAAGCCGAGGCGCCGAAACTCGACCTGAGCCGCTTCAGCGTCGCCCCGGTCGGTAGCGACATGGGCCAGGCACGCTCCGAGCCAGCGGCTCCGGCTCCGGACACCAGCCACCTGCGCCTGCAGGACTGA . . DP=4 GT:DP ./. 0/0:1 1/1:1 0/0:1 ./. 0/0:1
The gene was lost in two strains, but it wasn't detected by the old method. It could be listed in the gpa table though...
from seq2geno.
Related Issues (20)
- process reads and mapping results
- flexible about reads layout HOT 1
- samples table HOT 1
- id-dependent methods fail in case of duplicated ids in gbk
- freebayes with multiple samples HOT 1
- create submodules
- synonymous mutation detection
- dryrun
- user options HOT 1
- functions about file validation
- rules about phenotype table
- DESeq2 input HOT 1
- Division and collaboration HOT 2
- loading the main environment HOT 1
- create genml HOT 1
- Windows style newline characters in the old scripts
- version control of prokka
- reads compression
- Genyml file generator
- install Roary dependencies
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from seq2geno.