mgymrek / lobstr-code Goto Github PK
View Code? Open in Web Editor NEWlobSTR: a short tandem repeat profiler for next generation sequencing data
License: GNU General Public License v3.0
lobSTR: a short tandem repeat profiler for next generation sequencing data
License: GNU General Public License v3.0
The lobSTR tool is no longer actively maintained and unfortunately the documentation website has gone down. We recommend using HipSTR (https://github.com/tfwillems/hipstr) as an alternative for genotyping STRs from short reads.
Hi, I was trying to filter a vcf created with the 4.0 beta version, when the filtering script crashed with this error:
Traceback (most recent call last):
File "/projects/cees/bin/lobstr/lobSTR-bin-Linux-x86_64-4.0.0/share/lobSTR/scripts/lobSTR_filter_vcf.py", line 132, in
for record in reader:
File "/cluster/software/VERSIONS/python_packages-2.7_5/cluster/software/VERSIONS/python2-2.7.10/lib/python2.7/site-packages/vcf/parser.py", line 539, in next
alt = self._map(self._parse_alt, row[4].split(','))
File "/cluster/software/VERSIONS/python_packages-2.7_5/cluster/software/VERSIONS/python2-2.7.10/lib/python2.7/site-packages/vcf/parser.py", line 347, in _map
for x in iterable]
File "/cluster/software/VERSIONS/python_packages-2.7_5/cluster/software/VERSIONS/python2-2.7.10/lib/python2.7/site-packages/vcf/parser.py", line 515, in _parse_alt
elif str[0] == '.' and len(str) > 1:
IndexError: string index out of range
The following entry is likely the culprit:
LG15 1591293 . AAATAAAAATAAAATAAAA ,AAATAAAA,AAATAAAAATAAAATAAAAAAAATAAAAT 669.516 . END=1591311;MOTIF=AAATA;NS=10;REF=3.8;RL=19;RU=AAATA;VT=STR;RPA=0,1.6,5.8 GT:ALLREADS:AML:DISTENDS:DP:GB:PL:Q:SB:STITCH 0/3:0|1;10|1:0.980102/0.961652:10:2:0/10:16,22,191,22,125,122,0,26,24,20:0.941778:2:0 0/3:0|2;10|3:0.999776/0.999994:28:5:0/10:52,67,478,67,347,341,0,52,49,37:0.999769:1.44444:0 0/3:0|1;10|1:0.980102/0.961652:51:2:0/10:16,22,191,22,125,122,0,26,24,20:0.941778:4.25:0 3/3:10|2:0.999969/0.999969:67:2:10/10:44,50,197,50,197,197,5,6,6,0:0.499146:2:0 0/3:0|4;10|2:1/0.997848:41.6667:6:0/10:26,44,574,44,311,299,0,105,99,87:0.997848:26:0 2/3:-11|4;-1|1;10|2:0.999984/0.999784:8:7:-11/10:165,149,354,44,191,176,140,112,0,385:0.999784:26:0 1/0:-19|1;0|7;10|1:0.998671/0.999881:-3.875:9:-19/0:71,0,740,29,283,288,73,161,180,233:0.998671:4.25:0 0/0:0|1:0.993932/0.993932:-28:1:0/0:0,3,98,3,33,30,3,29,27,26:0.330906:5:0 3/3:10|4:1/1:1.5:4:10/10:89,101,395,101,395,395,11,12,12,0:0.798921:5:0 0/3:0|2;10|1:0.999907/0.940512:6.66667:3:0/10:13,22,287,22,155,149,0,52,49,43:0.940419:9.25:0
The ALT field starts with comma, which I don't think is according to specs.
allelotype was run on a set of BWA aligned bams, like this:
allelotype --command classify --bam bam1,bam2,bam3,bam4,bam5,bam6,bam7,bam8,bam9,bam10
--strinfo genome_strinfo.tab --noise_model /projects/cees/bin/lobstr/lobSTR-bin-Linux-x86_64-3.0.2/share/lobSTR/models/illumina_v2.0.3
--index-prefix genome_index/lobSTR_ --out genome_ncc
--filter-mapq0 --realign --max-repeats-in-ends 3 --min-read-end-match 10 >allelotype_ncc.out 2> allelotype_ncc.err
What's the correct way to use autoconf during install? It wasn't explicit in the INSTALL file. Naively, I tried autoconf configure.ac
and got an error.
$as_echo "" >&6; }
{ $as_echo "$as_me:$LINENO: result: To change installtion path, use:" >&5
$as_echo " To change installtion path, use:" >&6; }
{ $as_echo "$as_me:$LINENO: result: ./configure --prefix NEW-PATH" >&5
$as_echo " ./configure --prefix NEW-PATH" >&6; }
{ $as_echo "$as_me:$LINENO: result: " >&5
$as_echo "" >&6; }
configure.ac:19: error: possibly undefined macro: AM_INIT_AUTOMAKE
If this token and others are legitimate, please use m4_pattern_allow.
See the Autoconf documentation.
configure.ac:34: error: possibly undefined macro: AC_MSG_ERROR
ERROR: command 'intersectBed -a /tmp/tmpeUwvrC/lobSTR_ref_merged.bed -b /tmp/tmpeUwvrC/lobSTR_ref_plusflank_sorted.bed -wa -wb | awk '{print $1 "\t" $2 "\t" $3 "\t" $7""$8""$9";"}' | bedtools groupby -g 1,2,3 -c 4 -o concat > index_dir/lobSTR_mergedref.bed' failed.
STDOUT:
***** ERROR: Requested column 4, but database file - only has fields 1 - 0.
why always show up this problem when I build STR index, can you help me? thank you
We should build an hg38 reference and post to the website, and add scripts for the TRF step to the repo.
Take strand information from bam file alignments into account when deciding which direction to trim from.
We don't consider alleles that would result in a negative number of repeats. We should also:
I am trying to build a lobSTR model for the non-PCR-free nano model. I have a BAM file generated by BWA-MEM. The program crashes as follows.
_______
\\ // / -^--\ |
|||| / /\_____/ /
{\ ______{ } / lobSTR: profiling short tandem repeats
{_}{\{\{\{| \=@____/ from high-throughput sequencing data
<{_{-{-{-{-| ====---- >>>
{ }{/{/{/{|______ _/=@_____
{/ { } \ Copyright (C) 2011-2014 Melissa Gymrek
|||| \ \______ \ <[email protected]>
// \\ \ _^_\ |
\______/
[allelotype-4.0.0] 2016-07-01.16:07:48 ProgressMeter: Getting run info
[allelotype-4.0.0] 2016-07-01.17:39:12 WARNING: No extended reference sequence found for locus X:308108 read ST-E00127:299:HG2TYCCXX:7:2104:18568:6126
[allelotype-4.0.0] 2016-07-01.17:39:12 WARNING: No extended reference sequence found for locus X:308108 read ST-E00127:299:HG2TYCCXX:5:2216:25246:24972
[allelotype-4.0.0] 2016-07-01.17:39:12 WARNING: No extended reference sequence found for locus X:308108 read ST-E00127:299:HG2TYCCXX:8:2115:25053:57688
[allelotype-4.0.0] 2016-07-01.17:39:12 WARNING: No extended reference sequence found for locus X:308108 read ST-E00127:299:HG2TYCCXX:2:1204:2006:6513
[...]
[allelotype-4.0.0] 2016-07-01.17:44:57 WARNING: No extended reference sequence found for locus Y:28818965 read ST-E00127:299:HG2TYCCXX:2:1115:18254:52766
[allelotype-4.0.0] 2016-07-01.17:44:57 WARNING: No extended reference sequence found for locus Y:28818965 read ST-E00127:299:HG2TYCCXX:8:1212:6654:24866
terminate called after throwing an instance of 'std::out_of_range'
what(): vector::_M_range_check
Aborted (core dumped)
Here is my command line
lobSTR-bin-Linux-x86_64-4.0.0/bin/allelotype \
--command train \
--bam my.bam \
--haploid X,Y \
--strinfo lobSTR-bin-Linux-x86_64-4.0.0/GRCh37_v3.0.2/lobstr_v3.0.2_GRCh37_strinfo.tab \
--index-prefix lobSTR-bin-Linux-x86_64-4.0.0/GRCh37_v3.0.2/lobstr_v3.0.2_GRCh37_ref/lobSTR_ \
--noise_model illumina_v3
The help message printed by allelotype is no longer accurate. Alignment filters which were originally not applied have been changed so that they are applied by default, but the help message is confusing and seems to suggest otherwise. In addition, the default values printed in the help message are hard-coded instead of using the corresponding variables that are used in the source code.
I suggest that we update the help message appropriately and switch from using the hardcoded values to the values themselves
java -jar /data/melissa/sources/picard-tools-1.59/ValidateSamFile.jar INPUT=test.aligned.bam
gives the errors:
Mate reference index (MRNM) does not match reference index of mate
Mapped mate should have mate reference name
Mate alignment does not match alignment start of mate
and warnings:
WARNING: Read name LP6005441-DNA_A01.aligned.bam, Older BAM file -- does not have terminator block
We should make sure the bams pass validation.
Right now lobSTR re-annotates the library and sample information. But if these already exist and particularly if a BAM has multiple libraries in it, it could be useful for lobSTR to use the existing info rather than having to run a separate lobSTR job for each sample/lib.
When an alternate allele has a length of 0 and there's only 1 alternate allele, the alternate allele field is printed in the VCF as the empty string. While this is technically not incorrect, it causes awk to think there's one less field than every other line in the VCF file. I suspect this will break a lot of subsequent tools, and we may want to consider requiring that alleles have a length >= 1
lobSTR seems to mis-genotype reads (or at least some reads) where the STR (in this case, a poly-G region) has an N in the middle. Is there any way to fix/change this behavior?
Example:
The following read was put into lobSTR (STR bold):
TCTCGGCATCAACATCCAGAGTTTAGGGACCATGTCCCAGTCTCTGTGAGGTGGATGGGAAGTCAACATTAGTTGACTGAGCACCACCTGCGTGGAAGATGCAGCCCCCCCCNGCCCCATCACTGGGAATACAGTGCTGAGCAGGACAGCACCTGATGTGCGAGGGGGAAGACAGACAACAAATACATAAGCAATGGAATGTACCTTTGGCAGGCCGAT
The tags attached to the read by lobSTR afterwards are:
XS:i:46990694
XE:i:46990706
XR:Z:C
XD:i:-5
XC:f:13
XG:Z:CCCCCCC
XX:i:1
XM:i:-1
XQ:i:41
RG:Z:lobSTR;s66;spike_in
NM:i:7
I believe this to be incorrectly genotyped, as the tags should be:
XS:i:46990694
XE:i:46990706
XR:Z:C
XD:i:1
XC:f:13
XG:Z:CCCCCCCCNGCCCC
...
Is there an easy way to fix this behavior?
As a note, we can't use HipSTR for this application because we want a genotype per read, and HipSTR does not allow us to get the level of detail we need for our studies (effectively, pooled populations of cells with unknown population size).
Brendan
Hello, I am on of the contributor for homebrew and recently new version of g++5 and boost have been updated and there is error:
==> brew install --build-bottle --verbose homebrew/science/lobstr
==> Installing lobstr from homebrew/science
git config --replace-all homebrew.private false
==> Downloading https://github.com/mgymrek/lobstr-code/releases/download/v4.0.6/lobSTR-4.0.6.tar.gz
Already downloaded: /home/linuxbrew/.cache/Homebrew/lobstr-4.0.6.tar.gz
==> Verifying lobstr-4.0.6.tar.gz checksum
tar xzf /home/linuxbrew/.cache/Homebrew/lobstr-4.0.6.tar.gz
==> ./configure --prefix=/home/linuxbrew/.linuxbrew/Cellar/lobstr/4.0.6_3 --disable-dependency-tracking
.
==> make
.
Last 150 lines from /home/linuxbrew/.cache/Homebrew/Logs/lobstr/02.make:
2017-08-31 19:09:28 +0000
make
make all-recursive
make[1]: Entering directory '/tmp/lobstr-20170831-4088-1ol7xyc/lobSTR-4.0.6'
Making all in m4
make[2]: Entering directory '/tmp/lobstr-20170831-4088-1ol7xyc/lobSTR-4.0.6/m4'
make[2]: Nothing to be done for 'all'.
make[2]: Leaving directory '/tmp/lobstr-20170831-4088-1ol7xyc/lobSTR-4.0.6/m4'
Making all in src
make[2]: Entering directory '/tmp/lobstr-20170831-4088-1ol7xyc/lobSTR-4.0.6/src'
g++-5 -DHAVE_CONFIG_H -I. -I.. -I../src/ -I../ -I/home/linuxbrew/.linuxbrew/Cellar/gsl/2.4/include -I/usr/include -g -O2 -Wall -Wextra -Wswitch-default -Wno-strict-aliasing -pthread -D_GIT_VERSION="\"4.0.6\"" -D_MACHTYPE="\"x86_64\"" -c -o liblobstr_a-AlignmentUtils.o `test -f 'AlignmentUtils.cpp' || echo './'`AlignmentUtils.cpp
g++-5 -DHAVE_CONFIG_H -I. -I.. -I../src/ -I../ -I/home/linuxbrew/.linuxbrew/Cellar/gsl/2.4/include -I/usr/include -g -O2 -Wall -Wextra -Wswitch-default -Wno-strict-aliasing -pthread -D_GIT_VERSION="\"4.0.6\"" -D_MACHTYPE="\"x86_64\"" -c -o liblobstr_a-BamFileReader.o `test -f 'BamFileReader.cpp' || echo './'`BamFileReader.cpp
g++-5 -DHAVE_CONFIG_H -I. -I.. -I../src/ -I../ -I/home/linuxbrew/.linuxbrew/Cellar/gsl/2.4/include -I/usr/include -g -O2 -Wall -Wextra -Wswitch-default -Wno-strict-aliasing -pthread -D_GIT_VERSION="\"4.0.6\"" -D_MACHTYPE="\"x86_64\"" -c -o liblobstr_a-BamPairedFileReader.o `test -f 'BamPairedFileReader.cpp' || echo './'`BamPairedFileReader.cpp
g++-5 -DHAVE_CONFIG_H -I. -I.. -I../src/ -I../ -I/home/linuxbrew/.linuxbrew/Cellar/gsl/2.4/include -I/usr/include -g -O2 -Wall -Wextra -Wswitch-default -Wno-strict-aliasing -pthread -D_GIT_VERSION="\"4.0.6\"" -D_MACHTYPE="\"x86_64\"" -c -o liblobstr_a-BWAReadAligner.o `test -f 'BWAReadAligner.cpp' || echo './'`BWAReadAligner.cpp
g++-5 -DHAVE_CONFIG_H -I. -I.. -I../src/ -I../ -I/home/linuxbrew/.linuxbrew/Cellar/gsl/2.4/include -I/usr/include -g -O2 -Wall -Wextra -Wswitch-default -Wno-strict-aliasing -pthread -D_GIT_VERSION="\"4.0.6\"" -D_MACHTYPE="\"x86_64\"" -c -o liblobstr_a-common.o `test -f 'common.cpp' || echo './'`common.cpp
g++-5 -DHAVE_CONFIG_H -I. -I.. -I../src/ -I../ -I/home/linuxbrew/.linuxbrew/Cellar/gsl/2.4/include -I/usr/include -g -O2 -Wall -Wextra -Wswitch-default -Wno-strict-aliasing -pthread -D_GIT_VERSION="\"4.0.6\"" -D_MACHTYPE="\"x86_64\"" -c -o liblobstr_a-EntropyDetection.o `test -f 'EntropyDetection.cpp' || echo './'`EntropyDetection.cpp
common.cpp: In function ‘bool fexists(const char*)’:
common.cpp:195:10: error: cannot convert ‘std::ifstream {aka std::basic_ifstream<char>}’ to ‘bool’ in return
return ifile;
^
Makefile:1958: recipe for target 'liblobstr_a-common.o' failed
make[2]: *** [liblobstr_a-common.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: Leaving directory '/tmp/lobstr-20170831-4088-1ol7xyc/lobSTR-4.0.6/src'
Makefile:441: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/tmp/lobstr-20170831-4088-1ol7xyc/lobSTR-4.0.6'
Makefile:373: recipe for target 'all' failed
make: *** [all] Error 2
The distribution is only scaled for positive step sizes:
https://github.com/mgymrek/lobstr-code/blob/master/src/NoiseModel.cpp#L376
Need to add another for loop from -3*maxperiod to 0. PDFs don't sum to 1.
I tried a fastq file with mixed upper/lower case and these failed to align until I tried a new fastq with all reads converted to upper case. We should probably just convert all reads to upper case right off the bat
[lsx@macaca lobSTR_data]$ python scripts/lobstr_index.py --str lobSTR_region --ref CR_genome_chr_edition.fa --out output_directory
Traceback (most recent call last):
File "scripts/lobstr_index.py", line 216, in
if not DEBUG: GetRefFasta(genome, refkeys, merged_str_file, str_ref_fasta, str_map_file)
File "scripts/lobstr_index.py", line 149, in GetRefFasta
chrom, start, end, annot = line.strip().split("\t")
ValueError: need more than 1 value to unpack
How to address this issue? Do anyone meet this issue? I will appreciate if you help me. thank you!
Pointed out by Isidro Cortes. When a @pg line is in the header, we report an error that no read groups are found "ERROR: No read groups found". This could be an issue in the underlying BamTools library.
When running lobSTR on multiple processors, the BAM output is truncated. If run on a small input file (<couple thousand reads), the BAM will be empty. If run on large files, the BAM has output but is missing an EOF marker. Seems the program is ending before SamFileWriter's destructor is called.
Make the aligned.stats file produce output, even in the case of no aligned reads, to know whether a job successfully completed.
Hello,
I am a one of the Homebrew contributor. Recently, cppunit has been upgraded to version 1.14 and now
brew install lobstr
doesn't work.
Could you fix that problem and make lobstr compatible with version 1.14?
Best regards,
Tatyana Mozgacheva
Hello,
When I use the following commend for Filtering lobSTR VCFs:
lobSTR_filter_vcf.py --vcf ABCD.vcf > ABCD_filter
it always shows:
......
import vcf
ImportError: No module named vcf
The ABCD.vcf was exactly generated by by lobSTR, and I am using lobSTR-4.0.6
So, what's the problem?
Thanks a lot !!!
It would be awesome if you could report Q scores in a log scale (e.g. Phred-style). I'm finding that for high coverage genomes, most of the Q scores are equal to 1, and so it makes it difficult to distinguish false positives.
The allelotype vcf seems sound but I get these errors when running the scripts on it.
filter:
Traceback (most recent call last):
File "/home/mnaymik/TOOLS/lobSTR-bin-Linux-x86_64-4.0.0/share/lobSTR/scripts/filter_v2.py", line 165, in
if LogScore(sample["Q"]) < args.call_log_score:
File "/home/mnaymik/TOOLS/lobSTR-bin-Linux-x86_64-4.0.0/share/lobSTR/scripts/filter_v2.py", line 32, in LogScore
return -1*math.log10(1-score+SMALLNUM)
TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'
vcf_to_tab:
Traceback (most recent call last):
File "/home/mnaymik/TOOLS/lobSTR-bin-Linux-x86_64-4.0.0/share/lobSTR/scripts/lobSTR_vcf_to_tab.py", line 40, in
vals.extend(sample["GB"].split("/"))
AttributeError: 'NoneType' object has no attribute 'split'
Hello,
I have met with a problem when I install this software, that is "make" commond, and the process of make has failed.
the error as follow:
make all-recursive
make[1]: Entering directory '/home/XiaominYang/biosoftware/lobSTR-4.0.6'
Making all in m4
make[2]: Entering directory '/home/XiaominYang/biosoftware/lobSTR-4.0.6/m4'
make[2]: Nothing to be done for 'all'.
make[2]: Leaving directory '/home/XiaominYang/biosoftware/lobSTR-4.0.6/m4'
Making all in src
make[2]: Entering directory '/home/XiaominYang/biosoftware/lobSTR-4.0.6/src'
g++ -DHAVE_CONFIG_H -I. -I.. -I../src/ -I../ -I/usr/local/include -I/usr/local/include -g -O2 -Wall -Wextra -Wswitch-default -Wno-strict-aliasing -pthread -D_GIT_VERSION=""4.0.6"" -D_MACHTYPE=""x86_64"" -I/usr/include -MT liblobstr_a-AlignmentUtils.o -MD -MP -MF .deps/liblobstr_a-AlignmentUtils.Tpo -c -o liblobstr_a-AlignmentUtils.o test -f 'AlignmentUtils.cpp' || echo './'
AlignmentUtils.cpp
mv -f .deps/liblobstr_a-AlignmentUtils.Tpo .deps/liblobstr_a-AlignmentUtils.Po
g++ -DHAVE_CONFIG_H -I. -I.. -I../src/ -I../ -I/usr/local/include -I/usr/local/include -g -O2 -Wall -Wextra -Wswitch-default -Wno-strict-aliasing -pthread -D_GIT_VERSION=""4.0.6"" -D_MACHTYPE=""x86_64"" -I/usr/include -MT liblobstr_a-BamFileReader.o -MD -MP -MF .deps/liblobstr_a-BamFileReader.Tpo -c -o liblobstr_a-BamFileReader.o test -f 'BamFileReader.cpp' || echo './'
BamFileReader.cpp
mv -f .deps/liblobstr_a-BamFileReader.Tpo .deps/liblobstr_a-BamFileReader.Po
g++ -DHAVE_CONFIG_H -I. -I.. -I../src/ -I../ -I/usr/local/include -I/usr/local/include -g -O2 -Wall -Wextra -Wswitch-default -Wno-strict-aliasing -pthread -D_GIT_VERSION=""4.0.6"" -D_MACHTYPE=""x86_64"" -I/usr/include -MT liblobstr_a-BamPairedFileReader.o -MD -MP -MF .deps/liblobstr_a-BamPairedFileReader.Tpo -c -o liblobstr_a-BamPairedFileReader.o test -f 'BamPairedFileReader.cpp' || echo './'
BamPairedFileReader.cpp
mv -f .deps/liblobstr_a-BamPairedFileReader.Tpo .deps/liblobstr_a-BamPairedFileReader.Po
g++ -DHAVE_CONFIG_H -I. -I.. -I../src/ -I../ -I/usr/local/include -I/usr/local/include -g -O2 -Wall -Wextra -Wswitch-default -Wno-strict-aliasing -pthread -D_GIT_VERSION=""4.0.6"" -D_MACHTYPE=""x86_64"" -I/usr/include -MT liblobstr_a-BWAReadAligner.o -MD -MP -MF .deps/liblobstr_a-BWAReadAligner.Tpo -c -o liblobstr_a-BWAReadAligner.o test -f 'BWAReadAligner.cpp' || echo './'
BWAReadAligner.cpp
mv -f .deps/liblobstr_a-BWAReadAligner.Tpo .deps/liblobstr_a-BWAReadAligner.Po
g++ -DHAVE_CONFIG_H -I. -I.. -I../src/ -I../ -I/usr/local/include -I/usr/local/include -g -O2 -Wall -Wextra -Wswitch-default -Wno-strict-aliasing -pthread -D_GIT_VERSION=""4.0.6"" -D_MACHTYPE=""x86_64"" -I/usr/include -MT liblobstr_a-common.o -MD -MP -MF .deps/liblobstr_a-common.Tpo -c -o liblobstr_a-common.o test -f 'common.cpp' || echo './'
common.cpp
common.cpp: In function 'bool fexists(const char*)':
common.cpp:195:10: error: cannot convert 'std::ifstream {aka std::basic_ifstream}' to 'bool' in return
return ifile;
^~~~~
Makefile:1954: recipe for target 'liblobstr_a-common.o' failed
make[2]: *** [liblobstr_a-common.o] Error 1
make[2]: Leaving directory '/home/XiaominYang/biosoftware/lobSTR-4.0.6/src'
Makefile:441: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/home/XiaominYang/biosoftware/lobSTR-4.0.6'
Makefile:373: recipe for target 'all' failed
make: *** [all] Error 2
Any help would be greatly appreciated!!
We should make it clear that the new reference is much different than that included in the original lobSTR paper, and generally describe how we made it. We should add this to the "documentation" tab.
It looks like the resource bundles are using a version of BWA that creates both the forward and reverse indexes (.pac, .rpac
) whereas BWA has for a long time move to using on a single direction index. This likely means that when creating custom indexes, we have to use a very old version of bwa to work. Any idea on how to improve this?
Hello!
I think this is a great tool. I was wondering if you know if it works to reconstruct STRs from ancient DNA whole genome sequencing.
Thanks!
I followed the instructions on the website for Genotyping Y-STRs and CODIS markers in ordered to get STR values of HG01515 from the 1000 Genomes Project: ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/phase1/data/HG01515/alignment
I used the HG01515.mapped.ILLUMINA.bwa.IBS.low_coverage.20101123.bam
file and ran the following commands:
lobSTR --index-prefix hg19_v3.0.2/lobstr_v3.0.2_hg19_ref/lobSTR_ -f HG01515.mapped.ILLUMINA.bwa.IBS.low_coverage.20101123.bam --bam --noweb --rg-sample HG01515 --rg-lib HG01515 --fft-window-size 24 --fft-window-step 12 --out HG01515
samtools sort HG01515.aligned.bam HG01515.sorted
samtools index HG01515.sorted.bam
allelotype --index-prefix hg19_v3.0.2/lobstr_v3.0.2_hg19_ref/lobSTR_ --strinfo hg19_v3.0.2/lobstr_v3.0.2_hg19_strinfo.tab --command classify --noise_model /usr/local/share/share/lobSTR/models/illumina_v3.pcrfree --bam HG01515.sorted.bam --min-border 5 --min-bp-before-indel 7 --maximal-end-match 15 --noweb --min-read-end-match 5 --out HG01515
intersectBed -a HG01515.vcf -b lobSTR_ystr_hg19.bed -wa -wb | cut -f 1,2,10,14- | sed 's/:/\t/g' | cut -f 1,2,4,7,11- | sed 's/\//\t/' | cut -f 4 --complement | awk '{print $0 "\t" $6+$4/$5}
and I got the following results:
chrY 3131152 0|1 0 4 12 DYS393 12
chrY 3279702 4|2 4 4 9 DYS640 10
chrY 3679660 0|1 0 4 10 DYS572 10
chrY 6911569 0|1 0 4 11 DYS455 11
chrY 7053359 4|1 4 4 16 DYS576 17
chrY 8126300 0|1 0 5 8 DYS450 8
chrY 8224156 0|2 0 4 11 DYS454 11
chrY 8426378 -3|1 -3 3 22 DYS481 21
chrY 8466195 0|1 0 4 11 DYS531 11
chrY 8555980 0|2 0 5 8 DYS590 8
chrY 14466994 -4|1 -4 4 16 DYS437 15
chrY 16134296 0|1 0 4 10 DYS641 10
chrY 16508484 0|1 0 3 8 DYS472 8
chrY 17426012 -5|3 -5 5 11 DYS643 10
chrY 18393226 0|1 0 4 12 DYS533 12
chrY 18718879 -4|2 -4 4 15 GATA-A10 14
chrY 19134850 0|2 0 3 12 DYS426 12
chrY 19622111 -8|1 -8 2 23 YCAIIa/b 19
chrY 20203508 6|2 6 3 10 DYS425 12
chrY 20440393 3|4 3 3 15 DYS395S1a/b 16
chrY 20842518 0|1 0 4 14 DYS385a/b 14
chrY 21050690 -4|1 -4 4 12 DYS461 11
chrY 21050842 4|1 4 4 10 DYS460 11
chrY 21317047 0|1 0 4 11 DYS462 11
chrY 21520224 -4|1 -4 4 13 DYS549 12
chrY 21656837 0|1 0 5 10 DYS594 10
chrY 22092602 0|1 0 4 12 DYS445 12
chrY 22562564 0|2 0 4 9 DYS578 9
chrY 22633873 0|2 0 3 13 DYS392 13
chrY 22634857 0|1 0 4 12 DYS636 12
chrY 23843595 0|1 0 4 10 DYS406S1 10
chrY 24365178 0|1 0 6 8 DYS448.2 8
chrY 26078851 0|3 0 4 10 DYS459a/b 10
chrY 27087611 0|1 0 4 15 DYS464a/b/c/d 15
As you can see, multicopy markers have only one value. I ran the same commands (I have a wrapper script) on a FTDNA Big Y .bam file and got only one value on multicopy markers too. Is there a problem with the documentation, with lobSTR or maybe I am doing something wrong?
The groupby
command is buggy in bedtools v2.26.0
. See arq5x/bedtools2#435. The lobstr website should be updated to say use v2.25.0 or lower.
The script lobSTR_filter_vcf.py to filter variants fails on cases where the called genotype is "./.", and the subsequent values for "Q", "DISTENDS" are None. For example, the script failed at a location where the sample call was :
Call(sample=xxx, CallData(GT=./., ALLREADS=None, AML=None, DISTENDS=None, DP=None, DPA=None, GB=None, PL=None, PQ=None, Q=None, SB=None, STITCH=None))
A possible solution could be to add a check for ./. in the sample["GT"] values on the two lines that already check to see if sample["GT"] is empty/valid, unless that would have some unintended consequences.
When running allelotype on OSX, using a relative path for ---out
gives a memory error:
allelotype \
--command classify \
--index-prefix smallref/small_lobstr_ref/lobSTR_ \
--strinfo smallref/smallref_strinfo.tab \
--bam test.aligned.sorted.bam \
--out /tmp/lobtest \
--verbose \
--noise_model ../models/illumina_v2.0.3
runs fine, but:
allelotype \
--command classify \
--index-prefix smallref/small_lobstr_ref/lobSTR_ \
--strinfo smallref/smallref_strinfo.tab \
--bam test.aligned.sorted.bam \
--out lobtest \
--verbose \
--noise_model ../models/illumina_v2.0.3
gives the error:
allelotype(8968) malloc: *** error for object 0x10030d4b0: incorrect checksum for freed object - object was probably modified after being freed. *** set a breakpoint in malloc_error_break to debug
This appears to be due to an error in GetTransitionProb in NoiseModel.cpp. It is unclear how this is related to the output prefix path. However we are moving to use a different library for logistic regression so this function will be rewritten soon to use standard C++ scientific libraries, which will hopefully resolve this issue.
Hi guys
we are now using lobstr for MSI status detection for human NGS data generated from illumina X10 platform. And we are concerned with the allelotype in certain number of mononucleotide STR makers we designed in our panel, such as : MONO-27, BAT-26 and so on .
I just checked the vcf output from one sample, for the marker MONO-27, it reported like below:
> chr2 39536690 . TTTTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTTTTTTTTTTTTT,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT 9.99964e+06 . END=39536716;MOTIF=T;NS=1;REF=27;RL=27;RU=T;VT=STR;RPA=29,31 GT:ALLREADS:AML:DISTENDS:DP:DPA:GB:PL:PQ:Q:SB:STITCH 1/2:-1|1;0|5;1|7;2|9;3|4;4|1;7|1:0.991178/0.996765:47.7:28:106:2/4:9640,9640,9640,76,0,9640:1.91875:0.987943:26:0
but for the other markers like BAT-26, it reported like this:
> chr2 47641560 . AAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAA 0 . END=47641586;MOTIF=A;NS=1;REF=27;RL=27;RU=A;VT=STR;RPA=-9973 GT:ALLREADS:AML:DISTENDS:DP:DPA:GB:PL:PQ:Q:SB:STITCH 1/1:-10|2;-9|1;-8|4;-7|1;-6|8;-5|10;-4|16;-3|13;-2|28;-1|20;0|9;1|1;6|1:-nan/-nan:0:114:245:-10000/-10000:-9990000,-10000000,-10000000:nan:-nan:2:0
So is there any tags which could provide us the information about whether this STR marker is stable or unstable? Because as I understand, if it gave a confirmed called status such as the upper report (RPA=29,31 vs ref is 27), then it should be a stable marker in this sample; otherwise it is an unstable marker like the lower report (RPA=-9973 vs ref=27). Am I correct? or some other tags could provide the information?
And by the way , what the meaning if RPA got a value below 0, such as RPA=-9973?
I believe there is a discrepancy between the ALLREADS format field and the GB format field in the VCF allelotyper output. The length differences reported by the two fields should match but it appears that they're off by the length of one repeat unit. For instance, given the format identifier ALLREADS:AML:DP:GB:Q:GT:STITCH, here are some VCF entries with a discrepancy
These discrepancies are ubiquitous and I believe affect every call. Is this indeed an error, or am I misunderstanding the way the field is specified?
I have found a bug in lobSTR, or more exactly BamTools that affects lobSTR. When it comes to optional integer fields, the GetTag()
function used in ReadContainer.cpp
will sometimes not find a correct number if it is a negative one. Instead it will find an arbitrary positive number. I have pointed out this in an issue in BamTools (pezmaster31/bamtools#90). There is a suggestion of how to write a function that works correctly. I suggest that you take a look at that and rewrite the function ReadContainer::AddReadsFromFile()
. This should circumvent the bug in BamTools.
The quality score is calculated as:
(likelihood of the maxlik genotype)/(sum of likelihoods of all genotypes)
at higher and higher coverages the raw likelihood score actually gets smaller, and SMALL_CONST=1e-10 is likely to be bigger than the actual likelihoods, so the denominator shrink at a smaller rate than the numerator and thus the Q scores will get lower.
Various analyses using the Illumina Platinum Genomes dataset has indicated that reads with homopolymer stutter errors quite often contain an 'N' preceding or following the stutter artifact. Therefore, it may be advantageous to ignore these reads when running the allelotyper as such changes in length are problematic for the stutter model and commonly result in genotyping errors. I therefore suggest adding an additional read filter that causes the allelotyper to ignore any reads containing 'N' characters.
Hi Melissa,
I'm running lobSTR on 1000 Genomes NA12878 fastq files and I encountered a formatting error. May you look into it?
The files are:
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/NA12878/sequence_read/SRR000921_1.filt.fastq.gz
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/NA12878/sequence_read/SRR000921_2.filt.fastq.gz
command:
../programs/lobSTR-3.0.2/bin/lobSTR --p1 /net/1000g/atks/data/NA12878/SRR000921_1.filt.fastq.gz --p2 /net/1000g/atks/data/NA12878/SRR000921_2.filt.fastq.gz --index-prefix ../programs/lobSTR-3.0.2/resource/lobstr_v3.0.2_hg19_ref/lobSTR_ -o na12878 --rg-lib abc --rg-sample NA12878
I tried it with the non gzipped versions and the same error was encountered.
error:
[lobSTR-3.0.2] 2014-11-21.09:29:08 ProgressMeter: Getting run info
[lobSTR-3.0.2] 2014-11-21.09:29:08 ProgressMeter: Initializing...
[lobSTR-3.0.2] 2014-11-21.09:29:34 ProgressMeter: Running detection/alignment...
[lobSTR-3.0.2] 2014-11-21.09:29:34 ProgressMeter: Processing files /net/1000g/atks/data/NA12878/SRR000921_1.filt.fastq.gz and /net/1000g/atks/data/NA12878/SRR000921_2.filt.fastq.gz
[lobSTR-3.0.2] 2014-11-21.09:29:34 ERROR: Found Invalid FASTA ID in file /net/1000g/atks/data/NA12878/SRR000921_1.filt.fastq.gz line 1 (expected '>' character)
[lobSTR-3.0.2] 2014-11-21.09:29:34 ProgressMeter: Outputting run statistics
I'm on Arch Linux x64 with automake v1.14.
On master, running "reconf" fails (specifically, the automake part), with this error:
automake: error: global options already processed
automake: Please contact <[email protected]>.
at /usr/share/automake-1.14/Automake/Channels.pm line 662, <GEN0> line 179.
Automake::Channels::msg('automake', '', 'global options already processed') called at /usr/share/automake-1.14/Automake/ChannelDefs.pm line 212
Automake::ChannelDefs::prog_error('global options already processed') called at /usr/share/automake-1.14/Automake/Options.pm line 421
Automake::Options::process_global_option_list('HASH(0x22d51f8)', 'HASH(0x22d52b8)', 'HASH(0x22d5318)') called at /usr/bin/automake line 5331
Automake::scan_autoconf_traces('configure.ac') called at /usr/bin/automake line 5431
Automake::scan_autoconf_files() called at /usr/bin/automake line 8259
However, if I merge the two different AM_INIT_AUTOMAKE stanzas in configure.ac into one, it works.
Specifically:
Use these to either discard reads or incorporate into an alignment score.
This is being worked on on the foreignbam
branch. The plan:
--foreignbam
flag.--foreignbam
flag, since all BAMs will be processed in this way.This warning should happen before trying anything else. See question on the mailing list
Dear Author,
I have generated VCF file for lobSTR and use vcftools to get STR frequency (see below):
CHROM POS N_ALLELES N_CHR {ALLELE:FREQ}
chr14 16035037 3 130 TTTTCTTTTCTTTTCTTTTTTTTTTTTCTTTTCTTTCTTTTCTTTTCTTT:0.938462 TTTTCTTTTCTTTTCTTTTTTTTTTTTCTTTTCTTTCTTTTCTTTTCTTTT:0.0461538 TTTTCTTTTCTTTTCTTTTTTTTTTTTCTTTTCTTTCTTTTCTTTTCTTTTT:0.0153846
chr14 16036677 2 2 TTTCTTTTCTTTTCTTTTCTTCTTTTCTCTCTTCTTTTTTCTTCTCTCTTCTTCCTTCCTTTCTTTCTTCTTTCTTTCTCTCACTCTCTTTCTTTCTTTCTCTCTTTCTTTCTTTCTCTTCTTTCTTTCTTTCCCTTTCTTTCTTTCTTTCTTTATTTTTTTCTTTCTTTC:0 TTTCTTTTCTTTTCTTTTCTTCTTTTCTCTCTTCTTTTTTCTTCTCTCTTCTTCCTTCCTTTCTTTCTTCTTTCTTTCTCTCACTCTCTTTCTTTCTTTCTCTCTTTCTTTCTTTCTCTTCTTTCTTTCTTTCCCTTTCTTTCTTTCTTTCTTTATTTTTTTCTTTC:1
chr14 16037656 6 212 TTCTTTCTTTCTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTC:0.34434 TTCTTTCTTTCTTCTTTCTTTCTTTC:0.45283 TTCTTTCTTTCTTCTTTCTTTCTTTCTTTC:0.0471698 TTCTTTCTTTCTTCTTTCTTTCTTTCTTTCTTTCTTTC:0.0660377 TTCTTTCTTTCTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCCTTT:0.0801887 TTCTTTCTTTCTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCCTTTCTTT:0.00943396
chr14 16037717 6 336 ATTGATTGATTGATTGATTGATTGATT:0.75 ATTGATTGATTGATTGATTGATT:0.0595238 ATTGATTGATTGATTGATTGATTGATTGAT:0.00297619 ATTGATTGATTGATTGATTGATTGATTGATT:0.125 ATTGATTGATTGATTGATTGATTGATTGATTGATT:0.0505952 ATTGATTGATTGATTGATTGATTGATTGATTGATTGATT:0.0119048
I see that I can use this number in your equation in a paper. But just wonder if I could use your "analyze_heterozygosity_new.py" in the str_catalog_supplemental_scripts directory to get heterozygosity per locus.
Thank you in advance.
Weerachai
Right now the gap penalty is essentially 0 since I have int GAPEXTEND = 0.1. We should incorporate the improved realignment.
With the updated reference structure, the --extend
paramter is ignored and hard coded in the bash commands generated. Fix this to read the user-specified value if it's provided
This is the version that i used 'allelotype-4.0.6'.The align software that i used is bwa-mem.So when i put in too many(3048 samples) of bam files ,it will report an error which is 'result/res.allelotype.stats': Too many open files
But when i put 334 bam samples into it ,Lobstr could run.
The sample that I used is rice
Several users have tried multi-sample calling where BAMs have the same read group IDs but a different SM in each file. It would be nice to be able to handle this without the need to change those read group IDs.
We should either:
@agordon do you have any experience with this? Should pthread just work on Mac?
PLs should always be positive, with the maximum likelihood genotyping having a PL of 0.
This problem probably has to do with the fact that we force lobSTR to only consider a restricted set of alleles at annotated loci.
hi, I meet with problem when I try to perform test, my command as follow:
lobSTR --index-prefix ./STR/hg19_v3.0.2/lobstr_v3.0.2_hg19_ref/lobSTR_
--p1 tmp_1.fq --p2 tmp_2.fq
--q
-o test
--rg-sample sample mysample --rg-lib mylibrary
and the error as follow:
ERROR: Found Invalid FASTA ID in file tmp_1.fq line 1 (expected '>' character)
No reads aligned
could you give me some suggetions?
THANKS!
Hi,
I did perform a test using the hg19_v3.0.2/tmp_1.fq/tmp_2.fq files provided in lobstr-code and it did generate test.aligned.bam, test.aligned.stats, test.vcf, and test.allelotype.stats.
I have sequence-capture illumina data and I'm trying to genotype 30 microsatellite loci in a bat species using lobstr. I generated a .bed file using TRF for 30 ~1000 bp regions of that bat genome.
Here's how I created my reference:
python2 lobstr_index.py --str ./lobSTR/bat_lobstr.bed \ --ref ./lobSTR/bat_str.fasta --out ./lobSTR/lobref/ --extend 500 python2 GetSTRInfo.py ./lobSTR/bat_lobstr.bed \ /home/user/workspace/lobSTR/bat_str.fasta > ./lobref/strinfo.tab
Here are my commands for the alignment:
lobSTR --extend 500 \ --p1 2004-AR-40_R1.fastq \ --p2 2004-AR-40_R2.fastq \ --fastq --index-prefix ./lobref/lobSTR_ --out 2004-AR-40 \ --rg-sample 2004-AR-40 --rg-lib 2004-AR-40
samtools sort 2004-AR-40.aligned.bam -o 2004-AR-40.sorted.bam samtools index 2004-AR-40.sorted.bam
That did create the files 2004-AR-40.aligned.bam, 2004-AR-40.aligned.stats, 2004-AR-40.sorted.bam, 2004-AR-40.sorted.bam.bai.
Here are my commands for the allele call:
../tools/lobSTR-bin-Linux-x86_64-4.0.6/bin/allelotype \ --command classify --bam 2004-AR-40.sorted.bam \ --noise_model ../tools/lobSTR-bin-Linux-x86_64-4.0.6/share/lobSTR/models/illumina_v3.pcrfree \ --out 2004-AR-40.sorted.bam --strinfo ./lobref/strinfo.tab \ --index-prefix ./lobref/lobSTR_
I get the following error:
[allelotype-4.0.6] 2020-04-21.16:30:21 ProgressMeter: Getting run info
terminate called after throwing an instance of std::out_of_range
what(): basic_string::substr: __pos (which is 18446744073709551166) > this->size() (which is 1130)
Aborted (core dumped)
It seems to be angry with the first contig in my obSTR_ref.fasta file, because it is 1131 bp in size?
Any help would be greatly appreciated!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.