mgymrek / lobstr-code Goto Github PK

lobSTR: a short tandem repeat profiler for next generation sequencing data

License: GNU General Public License v3.0

Shell 2.95% Python 4.19% Ruby 0.11% C++ 67.42% C 21.20% CMake 0.06% Makefile 1.00% M4 3.08%

lobstr-code's Introduction

The lobSTR tool is no longer actively maintained and unfortunately the documentation website has gone down. We recommend using HipSTR (https://github.com/tfwillems/hipstr) as an alternative for genotyping STRs from short reads.

lobstr-code's People

Contributors

Stargazers

Watchers

Forkers

whitehead gilesc agordon drdaviddelorenzo roland-ewald yesimon honglongwu gymreklab bioinfotools hy714335634 gbrsales sjackman ichobits radygenomics engineerkhan dowing youpu-chen

lobstr-code's Issues

Starting with comma in alt field

Hi, I was trying to filter a vcf created with the 4.0 beta version, when the filtering script crashed with this error:
Traceback (most recent call last):
File "/projects/cees/bin/lobstr/lobSTR-bin-Linux-x86_64-4.0.0/share/lobSTR/scripts/lobSTR_filter_vcf.py", line 132, in
for record in reader:
File "/cluster/software/VERSIONS/python_packages-2.7_5/cluster/software/VERSIONS/python2-2.7.10/lib/python2.7/site-packages/vcf/parser.py", line 539, in next
alt = self._map(self._parse_alt, row[4].split(','))
File "/cluster/software/VERSIONS/python_packages-2.7_5/cluster/software/VERSIONS/python2-2.7.10/lib/python2.7/site-packages/vcf/parser.py", line 347, in _map
for x in iterable]
File "/cluster/software/VERSIONS/python_packages-2.7_5/cluster/software/VERSIONS/python2-2.7.10/lib/python2.7/site-packages/vcf/parser.py", line 515, in _parse_alt
elif str[0] == '.' and len(str) > 1:
IndexError: string index out of range

The following entry is likely the culprit:
LG15 1591293 . AAATAAAAATAAAATAAAA ,AAATAAAA,AAATAAAAATAAAATAAAAAAAATAAAAT 669.516 . END=1591311;MOTIF=AAATA;NS=10;REF=3.8;RL=19;RU=AAATA;VT=STR;RPA=0,1.6,5.8 GT:ALLREADS:AML:DISTENDS:DP:GB:PL:Q:SB:STITCH 0/3:0|1;10|1:0.980102/0.961652:10:2:0/10:16,22,191,22,125,122,0,26,24,20:0.941778:2:0 0/3:0|2;10|3:0.999776/0.999994:28:5:0/10:52,67,478,67,347,341,0,52,49,37:0.999769:1.44444:0 0/3:0|1;10|1:0.980102/0.961652:51:2:0/10:16,22,191,22,125,122,0,26,24,20:0.941778:4.25:0 3/3:10|2:0.999969/0.999969:67:2:10/10:44,50,197,50,197,197,5,6,6,0:0.499146:2:0 0/3:0|4;10|2:1/0.997848:41.6667:6:0/10:26,44,574,44,311,299,0,105,99,87:0.997848:26:0 2/3:-11|4;-1|1;10|2:0.999984/0.999784:8:7:-11/10:165,149,354,44,191,176,140,112,0,385:0.999784:26:0 1/0:-19|1;0|7;10|1:0.998671/0.999881:-3.875:9:-19/0:71,0,740,29,283,288,73,161,180,233:0.998671:4.25:0 0/0:0|1:0.993932/0.993932:-28:1:0/0:0,3,98,3,33,30,3,29,27,26:0.330906:5:0 3/3:10|4:1/1:1.5:4:10/10:89,101,395,101,395,395,11,12,12,0:0.798921:5:0 0/3:0|2;10|1:0.999907/0.940512:6.66667:3:0/10:13,22,287,22,155,149,0,52,49,43:0.940419:9.25:0

The ALT field starts with comma, which I don't think is according to specs.
allelotype was run on a set of BWA aligned bams, like this:
allelotype --command classify --bam bam1,bam2,bam3,bam4,bam5,bam6,bam7,bam8,bam9,bam10
--strinfo genome_strinfo.tab --noise_model /projects/cees/bin/lobstr/lobSTR-bin-Linux-x86_64-3.0.2/share/lobSTR/models/illumina_v2.0.3
--index-prefix genome_index/lobSTR_ --out genome_ncc
--filter-mapq0 --realign --max-repeats-in-ends 3 --min-read-end-match 10 >allelotype_ncc.out 2> allelotype_ncc.err

autoconf

What's the correct way to use autoconf during install? It wasn't explicit in the INSTALL file. Naively, I tried autoconf configure.ac and got an error.

$as_echo "" >&6; }
{ $as_echo "$as_me:$LINENO: result:  To change installtion path, use:" >&5
$as_echo " To change installtion path, use:" >&6; }
{ $as_echo "$as_me:$LINENO: result:     ./configure --prefix NEW-PATH" >&5
$as_echo "    ./configure --prefix NEW-PATH" >&6; }
{ $as_echo "$as_me:$LINENO: result: " >&5
$as_echo "" >&6; }
configure.ac:19: error: possibly undefined macro: AM_INIT_AUTOMAKE
      If this token and others are legitimate, please use m4_pattern_allow.
      See the Autoconf documentation.
configure.ac:34: error: possibly undefined macro: AC_MSG_ERROR

lobstr_index.py can't build correctly index

ERROR: command 'intersectBed -a /tmp/tmpeUwvrC/lobSTR_ref_merged.bed -b /tmp/tmpeUwvrC/lobSTR_ref_plusflank_sorted.bed -wa -wb | awk '{print $1 "\t" $2 "\t" $3 "\t" $7""$8""$9";"}' | bedtools groupby -g 1,2,3 -c 4 -o concat > index_dir/lobSTR_mergedref.bed' failed.

STDOUT:

***** ERROR: Requested column 4, but database file - only has fields 1 - 0.

why always show up this problem when I build STR index, can you help me? thank you

Make an hg38 reference

We should build an hg38 reference and post to the website, and add scripts for the TRF step to the repo.

Direction of read trimming in bam files

Take strand information from bam file alignments into account when deciding which direction to trim from.

Don't use reads supporting invalid alleles to calculate genotype likelihoods

We don't consider alleles that would result in a negative number of repeats. We should also:

not consider reads coming from invalid alleles when doing genotype scoring
don't attempt to report genotypes for calls where all reads are thrown out as being from invalid alleles

Program crashing on --command train

I am trying to build a lobSTR model for the non-PCR-free nano model. I have a BAM file generated by BWA-MEM. The program crashes as follows.


                       _______
             \\ //  /     -^--\ |
             ||||  / /\_____/ /
 {\         ______{ }        /         lobSTR: profiling short tandem repeats
 {_}{\{\{\{|         \=@____/          from high-throughput sequencing data
<{_{-{-{-{-| ====---- >>>
 { }{/{/{/{|______  _/=@_____
 {/               { }        \         Copyright (C) 2011-2014 Melissa Gymrek
            ||||  \ \______  \         <[email protected]>
             // \\  \    _^_\  |
                     \______/

[allelotype-4.0.0] 2016-07-01.16:07:48 ProgressMeter: Getting run info
[allelotype-4.0.0] 2016-07-01.17:39:12 WARNING: No extended reference sequence found for locus X:308108 read ST-E00127:299:HG2TYCCXX:7:2104:18568:6126
[allelotype-4.0.0] 2016-07-01.17:39:12 WARNING: No extended reference sequence found for locus X:308108 read ST-E00127:299:HG2TYCCXX:5:2216:25246:24972
[allelotype-4.0.0] 2016-07-01.17:39:12 WARNING: No extended reference sequence found for locus X:308108 read ST-E00127:299:HG2TYCCXX:8:2115:25053:57688
[allelotype-4.0.0] 2016-07-01.17:39:12 WARNING: No extended reference sequence found for locus X:308108 read ST-E00127:299:HG2TYCCXX:2:1204:2006:6513
[...]
[allelotype-4.0.0] 2016-07-01.17:44:57 WARNING: No extended reference sequence found for locus Y:28818965 read ST-E00127:299:HG2TYCCXX:2:1115:18254:52766
[allelotype-4.0.0] 2016-07-01.17:44:57 WARNING: No extended reference sequence found for locus Y:28818965 read ST-E00127:299:HG2TYCCXX:8:1212:6654:24866
terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check
Aborted (core dumped)

Here is my command line

lobSTR-bin-Linux-x86_64-4.0.0/bin/allelotype \
    --command train \
    --bam my.bam \
    --haploid X,Y \
    --strinfo lobSTR-bin-Linux-x86_64-4.0.0/GRCh37_v3.0.2/lobstr_v3.0.2_GRCh37_strinfo.tab \
    --index-prefix lobSTR-bin-Linux-x86_64-4.0.0/GRCh37_v3.0.2/lobstr_v3.0.2_GRCh37_ref/lobSTR_ \
    --noise_model illumina_v3

Allelotype help message

The help message printed by allelotype is no longer accurate. Alignment filters which were originally not applied have been changed so that they are applied by default, but the help message is confusing and seems to suggest otherwise. In addition, the default values printed in the help message are hard-coded instead of using the corresponding variables that are used in the source code.

I suggest that we update the help message appropriately and switch from using the hardcoded values to the values themselves

lobSTR bam output doesn't pass Picard ValidateSamFile.jar

java -jar /data/melissa/sources/picard-tools-1.59/ValidateSamFile.jar INPUT=test.aligned.bam

gives the errors:
Mate reference index (MRNM) does not match reference index of mate
Mapped mate should have mate reference name
Mate alignment does not match alignment start of mate

and warnings:
WARNING: Read name LP6005441-DNA_A01.aligned.bam, Older BAM file -- does not have terminator block

We should make sure the bams pass validation.

Ability to use existing RG information

Right now lobSTR re-annotates the library and sample information. But if these already exist and particularly if a BAM has multiple libraries in it, it could be useful for lobSTR to use the existing info rather than having to run a separate lobSTR job for each sample/lib.

Alternate allele length of zero

When an alternate allele has a length of 0 and there's only 1 alternate allele, the alternate allele field is printed in the VCF as the empty string. While this is technically not incorrect, it causes awk to think there's one less field than every other line in the VCF file. I suspect this will break a lot of subsequent tools, and we may want to consider requiring that alleles have a length >= 1

lobSTR seems to fail when an N is located in the middle of the STR

lobSTR seems to mis-genotype reads (or at least some reads) where the STR (in this case, a poly-G region) has an N in the middle. Is there any way to fix/change this behavior?

Example:

The following read was put into lobSTR (STR bold):
TCTCGGCATCAACATCCAGAGTTTAGGGACCATGTCCCAGTCTCTGTGAGGTGGATGGGAAGTCAACATTAGTTGACTGAGCACCACCTGCGTGGAAGATGCAGCCCCCCCCNGCCCCATCACTGGGAATACAGTGCTGAGCAGGACAGCACCTGATGTGCGAGGGGGAAGACAGACAACAAATACATAAGCAATGGAATGTACCTTTGGCAGGCCGAT

The tags attached to the read by lobSTR afterwards are:

XS:i:46990694
XE:i:46990706
XR:Z:C
XD:i:-5
XC:f:13
XG:Z:CCCCCCC
XX:i:1
XM:i:-1
XQ:i:41
RG:Z:lobSTR;s66;spike_in
NM:i:7

I believe this to be incorrectly genotyped, as the tags should be:
XS:i:46990694
XE:i:46990706
XR:Z:C
XD:i:1
XC:f:13
XG:Z:CCCCCCCCNGCCCC
...

Is there an easy way to fix this behavior?

As a note, we can't use HipSTR for this application because we want a genotype per read, and HipSTR does not allow us to get the level of detail we need for our studies (effectively, pooled populations of cells with unknown population size).

Brendan

New g++5, boost. Lobsrt fails.

Hello, I am on of the contributor for homebrew and recently new version of g++5 and boost have been updated and there is error:


==> brew install --build-bottle --verbose homebrew/science/lobstr
==> Installing lobstr from homebrew/science
git config --replace-all homebrew.private false
==> Downloading https://github.com/mgymrek/lobstr-code/releases/download/v4.0.6/lobSTR-4.0.6.tar.gz
Already downloaded: /home/linuxbrew/.cache/Homebrew/lobstr-4.0.6.tar.gz
==> Verifying lobstr-4.0.6.tar.gz checksum
tar xzf /home/linuxbrew/.cache/Homebrew/lobstr-4.0.6.tar.gz
==> ./configure --prefix=/home/linuxbrew/.linuxbrew/Cellar/lobstr/4.0.6_3 --disable-dependency-tracking
.
==> make
.
Last 150 lines from /home/linuxbrew/.cache/Homebrew/Logs/lobstr/02.make:
2017-08-31 19:09:28 +0000

make

make  all-recursive
make[1]: Entering directory '/tmp/lobstr-20170831-4088-1ol7xyc/lobSTR-4.0.6'
Making all in m4
make[2]: Entering directory '/tmp/lobstr-20170831-4088-1ol7xyc/lobSTR-4.0.6/m4'
make[2]: Nothing to be done for 'all'.
make[2]: Leaving directory '/tmp/lobstr-20170831-4088-1ol7xyc/lobSTR-4.0.6/m4'
Making all in src
make[2]: Entering directory '/tmp/lobstr-20170831-4088-1ol7xyc/lobSTR-4.0.6/src'
g++-5 -DHAVE_CONFIG_H -I. -I..   -I../src/ -I../ -I/home/linuxbrew/.linuxbrew/Cellar/gsl/2.4/include -I/usr/include   -g -O2 -Wall -Wextra -Wswitch-default -Wno-strict-aliasing -pthread -D_GIT_VERSION="\"4.0.6\"" -D_MACHTYPE="\"x86_64\""  -c -o liblobstr_a-AlignmentUtils.o `test -f 'AlignmentUtils.cpp' || echo './'`AlignmentUtils.cpp
g++-5 -DHAVE_CONFIG_H -I. -I..   -I../src/ -I../ -I/home/linuxbrew/.linuxbrew/Cellar/gsl/2.4/include -I/usr/include   -g -O2 -Wall -Wextra -Wswitch-default -Wno-strict-aliasing -pthread -D_GIT_VERSION="\"4.0.6\"" -D_MACHTYPE="\"x86_64\""  -c -o liblobstr_a-BamFileReader.o `test -f 'BamFileReader.cpp' || echo './'`BamFileReader.cpp
g++-5 -DHAVE_CONFIG_H -I. -I..   -I../src/ -I../ -I/home/linuxbrew/.linuxbrew/Cellar/gsl/2.4/include -I/usr/include   -g -O2 -Wall -Wextra -Wswitch-default -Wno-strict-aliasing -pthread -D_GIT_VERSION="\"4.0.6\"" -D_MACHTYPE="\"x86_64\""  -c -o liblobstr_a-BamPairedFileReader.o `test -f 'BamPairedFileReader.cpp' || echo './'`BamPairedFileReader.cpp
g++-5 -DHAVE_CONFIG_H -I. -I..   -I../src/ -I../ -I/home/linuxbrew/.linuxbrew/Cellar/gsl/2.4/include -I/usr/include   -g -O2 -Wall -Wextra -Wswitch-default -Wno-strict-aliasing -pthread -D_GIT_VERSION="\"4.0.6\"" -D_MACHTYPE="\"x86_64\""  -c -o liblobstr_a-BWAReadAligner.o `test -f 'BWAReadAligner.cpp' || echo './'`BWAReadAligner.cpp
g++-5 -DHAVE_CONFIG_H -I. -I..   -I../src/ -I../ -I/home/linuxbrew/.linuxbrew/Cellar/gsl/2.4/include -I/usr/include   -g -O2 -Wall -Wextra -Wswitch-default -Wno-strict-aliasing -pthread -D_GIT_VERSION="\"4.0.6\"" -D_MACHTYPE="\"x86_64\""  -c -o liblobstr_a-common.o `test -f 'common.cpp' || echo './'`common.cpp
g++-5 -DHAVE_CONFIG_H -I. -I..   -I../src/ -I../ -I/home/linuxbrew/.linuxbrew/Cellar/gsl/2.4/include -I/usr/include   -g -O2 -Wall -Wextra -Wswitch-default -Wno-strict-aliasing -pthread -D_GIT_VERSION="\"4.0.6\"" -D_MACHTYPE="\"x86_64\""  -c -o liblobstr_a-EntropyDetection.o `test -f 'EntropyDetection.cpp' || echo './'`EntropyDetection.cpp
common.cpp: In function ‘bool fexists(const char*)’:
common.cpp:195:10: error: cannot convert ‘std::ifstream {aka std::basic_ifstream<char>}’ to ‘bool’ in return
   return ifile;
          ^
Makefile:1958: recipe for target 'liblobstr_a-common.o' failed
make[2]: *** [liblobstr_a-common.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: Leaving directory '/tmp/lobstr-20170831-4088-1ol7xyc/lobSTR-4.0.6/src'
Makefile:441: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/tmp/lobstr-20170831-4088-1ol7xyc/lobSTR-4.0.6'
Makefile:373: recipe for target 'all' failed
make: *** [all] Error 2

scale negative side of step size PDF

The distribution is only scaled for positive step sizes:

https://github.com/mgymrek/lobstr-code/blob/master/src/NoiseModel.cpp#L376

Need to add another for loop from -3*maxperiod to 0. PDFs don't sum to 1.

Convert reads to upper case before doing anything

I tried a fastq file with mixed upper/lower case and these failed to align until I tried a new fastq with all reads converted to upper case. We should probably just convert all reads to upper case right off the bat

Building index failure

[lsx@macaca lobSTR_data]$ python scripts/lobstr_index.py --str lobSTR_region --ref CR_genome_chr_edition.fa --out output_directory
Traceback (most recent call last):
File "scripts/lobstr_index.py", line 216, in
if not DEBUG: GetRefFasta(genome, refkeys, merged_str_file, str_ref_fasta, str_map_file)
File "scripts/lobstr_index.py", line 149, in GetRefFasta
chrom, start, end, annot = line.strip().split("\t")
ValueError: need more than 1 value to unpack
How to address this issue? Do anyone meet this issue? I will appreciate if you help me. thank you!

Fail to read read groups when PG is in header

Pointed out by Isidro Cortes. When a @pg line is in the header, we report an error that no read groups are found "ERROR: No read groups found". This could be an issue in the underlying BamTools library.

BAM output for lobSTR in multi-threaded mode is truncated

When running lobSTR on multiple processors, the BAM output is truncated. If run on a small input file (<couple thousand reads), the BAM will be empty. If run on large files, the BAM has output but is missing an EOF marker. Seems the program is ending before SamFileWriter's destructor is called.

stats output sometimes does not produce full output

Make the aligned.stats file produce output, even in the case of no aligned reads, to know whether a job successfully completed.

lobstr is not compatible with cppunit 1.14!

Hello,
I am a one of the Homebrew contributor. Recently, cppunit has been upgraded to version 1.14 and now
brew install lobstr doesn't work.
Could you fix that problem and make lobstr compatible with version 1.14?

Best regards,
Tatyana Mozgacheva

Filtering lobSTR VCFs

Hello,

When I use the following commend for Filtering lobSTR VCFs:

lobSTR_filter_vcf.py --vcf ABCD.vcf > ABCD_filter

it always shows:
......
import vcf
ImportError: No module named vcf

The ABCD.vcf was exactly generated by by lobSTR, and I am using lobSTR-4.0.6
So, what's the problem?

Thanks a lot !!!

Report Q scores in a log scale

It would be awesome if you could report Q scores in a log scale (e.g. Phred-style). I'm finding that for high coverage genomes, most of the Q scores are equal to 1, and so it makes it difficult to distinguish false positives.

NoneType Errors running filter and vcf_to_tab scripts

The allelotype vcf seems sound but I get these errors when running the scripts on it.

filter:
Traceback (most recent call last):
File "/home/mnaymik/TOOLS/lobSTR-bin-Linux-x86_64-4.0.0/share/lobSTR/scripts/filter_v2.py", line 165, in
if LogScore(sample["Q"]) < args.call_log_score:
File "/home/mnaymik/TOOLS/lobSTR-bin-Linux-x86_64-4.0.0/share/lobSTR/scripts/filter_v2.py", line 32, in LogScore
return -1*math.log10(1-score+SMALLNUM)
TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'

vcf_to_tab:
Traceback (most recent call last):
File "/home/mnaymik/TOOLS/lobSTR-bin-Linux-x86_64-4.0.0/share/lobSTR/scripts/lobSTR_vcf_to_tab.py", line 40, in
vals.extend(sample["GB"].split("/"))
AttributeError: 'NoneType' object has no attribute 'split'

the

Hello,
I have met with a problem when I install this software, that is "make" commond, and the process of make has failed.
the error as follow:
make all-recursive
make[1]: Entering directory '/home/XiaominYang/biosoftware/lobSTR-4.0.6'
Making all in m4
make[2]: Entering directory '/home/XiaominYang/biosoftware/lobSTR-4.0.6/m4'
make[2]: Nothing to be done for 'all'.
make[2]: Leaving directory '/home/XiaominYang/biosoftware/lobSTR-4.0.6/m4'
Making all in src
make[2]: Entering directory '/home/XiaominYang/biosoftware/lobSTR-4.0.6/src'
g++ -DHAVE_CONFIG_H -I. -I.. -I../src/ -I../ -I/usr/local/include -I/usr/local/include -g -O2 -Wall -Wextra -Wswitch-default -Wno-strict-aliasing -pthread -D_GIT_VERSION=""4.0.6"" -D_MACHTYPE=""x86_64"" -I/usr/include -MT liblobstr_a-AlignmentUtils.o -MD -MP -MF .deps/liblobstr_a-AlignmentUtils.Tpo -c -o liblobstr_a-AlignmentUtils.o test -f 'AlignmentUtils.cpp' || echo './'AlignmentUtils.cpp
mv -f .deps/liblobstr_a-AlignmentUtils.Tpo .deps/liblobstr_a-AlignmentUtils.Po
g++ -DHAVE_CONFIG_H -I. -I.. -I../src/ -I../ -I/usr/local/include -I/usr/local/include -g -O2 -Wall -Wextra -Wswitch-default -Wno-strict-aliasing -pthread -D_GIT_VERSION=""4.0.6"" -D_MACHTYPE=""x86_64"" -I/usr/include -MT liblobstr_a-BamFileReader.o -MD -MP -MF .deps/liblobstr_a-BamFileReader.Tpo -c -o liblobstr_a-BamFileReader.o test -f 'BamFileReader.cpp' || echo './'BamFileReader.cpp
mv -f .deps/liblobstr_a-BamFileReader.Tpo .deps/liblobstr_a-BamFileReader.Po
g++ -DHAVE_CONFIG_H -I. -I.. -I../src/ -I../ -I/usr/local/include -I/usr/local/include -g -O2 -Wall -Wextra -Wswitch-default -Wno-strict-aliasing -pthread -D_GIT_VERSION=""4.0.6"" -D_MACHTYPE=""x86_64"" -I/usr/include -MT liblobstr_a-BamPairedFileReader.o -MD -MP -MF .deps/liblobstr_a-BamPairedFileReader.Tpo -c -o liblobstr_a-BamPairedFileReader.o test -f 'BamPairedFileReader.cpp' || echo './'BamPairedFileReader.cpp
mv -f .deps/liblobstr_a-BamPairedFileReader.Tpo .deps/liblobstr_a-BamPairedFileReader.Po
g++ -DHAVE_CONFIG_H -I. -I.. -I../src/ -I../ -I/usr/local/include -I/usr/local/include -g -O2 -Wall -Wextra -Wswitch-default -Wno-strict-aliasing -pthread -D_GIT_VERSION=""4.0.6"" -D_MACHTYPE=""x86_64"" -I/usr/include -MT liblobstr_a-BWAReadAligner.o -MD -MP -MF .deps/liblobstr_a-BWAReadAligner.Tpo -c -o liblobstr_a-BWAReadAligner.o test -f 'BWAReadAligner.cpp' || echo './'BWAReadAligner.cpp
mv -f .deps/liblobstr_a-BWAReadAligner.Tpo .deps/liblobstr_a-BWAReadAligner.Po
g++ -DHAVE_CONFIG_H -I. -I.. -I../src/ -I../ -I/usr/local/include -I/usr/local/include -g -O2 -Wall -Wextra -Wswitch-default -Wno-strict-aliasing -pthread -D_GIT_VERSION=""4.0.6"" -D_MACHTYPE=""x86_64"" -I/usr/include -MT liblobstr_a-common.o -MD -MP -MF .deps/liblobstr_a-common.Tpo -c -o liblobstr_a-common.o test -f 'common.cpp' || echo './'common.cpp
common.cpp: In function 'bool fexists(const char*)':
common.cpp:195:10: error: cannot convert 'std::ifstream {aka std::basic_ifstream}' to 'bool' in return
return ifile;
^~~~~
Makefile:1954: recipe for target 'liblobstr_a-common.o' failed
make[2]: *** [liblobstr_a-common.o] Error 1
make[2]: Leaving directory '/home/XiaominYang/biosoftware/lobSTR-4.0.6/src'
Makefile:441: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/home/XiaominYang/biosoftware/lobSTR-4.0.6'
Makefile:373: recipe for target 'all' failed
make: *** [all] Error 2

Any help would be greatly appreciated!!

Add documentation of the new reference set to the website

We should make it clear that the new reference is much different than that included in the original lobSTR paper, and generally describe how we made it. We should add this to the "documentation" tab.

Support modern versions of BWA

It looks like the resource bundles are using a version of BWA that creates both the forward and reverse indexes (.pac, .rpac) whereas BWA has for a long time move to using on a single direction index. This likely means that when creating custom indexes, we have to use a very old version of bwa to work. Any idea on how to improve this?

Question about application in ancient DNA

Hello!
I think this is a great tool. I was wondering if you know if it works to reconstruct STRs from ancient DNA whole genome sequencing.
Thanks!

Only one value for multicopy STR markers reported

I followed the instructions on the website for Genotyping Y-STRs and CODIS markers in ordered to get STR values of HG01515 from the 1000 Genomes Project: ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/phase1/data/HG01515/alignment I used the HG01515.mapped.ILLUMINA.bwa.IBS.low_coverage.20101123.bam file and ran the following commands:

lobSTR --index-prefix hg19_v3.0.2/lobstr_v3.0.2_hg19_ref/lobSTR_ -f HG01515.mapped.ILLUMINA.bwa.IBS.low_coverage.20101123.bam --bam --noweb --rg-sample HG01515 --rg-lib HG01515 --fft-window-size 24 --fft-window-step 12 --out HG01515

samtools sort HG01515.aligned.bam HG01515.sorted

samtools index HG01515.sorted.bam

allelotype --index-prefix hg19_v3.0.2/lobstr_v3.0.2_hg19_ref/lobSTR_ --strinfo hg19_v3.0.2/lobstr_v3.0.2_hg19_strinfo.tab --command classify --noise_model /usr/local/share/share/lobSTR/models/illumina_v3.pcrfree --bam HG01515.sorted.bam --min-border 5 --min-bp-before-indel 7 --maximal-end-match 15 --noweb --min-read-end-match 5 --out HG01515

intersectBed -a HG01515.vcf -b lobSTR_ystr_hg19.bed -wa -wb | cut -f 1,2,10,14- | sed 's/:/\t/g' | cut -f 1,2,4,7,11- | sed 's/\//\t/' | cut -f 4 --complement | awk '{print $0 "\t" $6+$4/$5}

and I got the following results:

chrY    3131152 0|1 0   4   12  DYS393  12
chrY    3279702 4|2 4   4   9   DYS640  10
chrY    3679660 0|1 0   4   10  DYS572  10
chrY    6911569 0|1 0   4   11  DYS455  11
chrY    7053359 4|1 4   4   16  DYS576  17
chrY    8126300 0|1 0   5   8   DYS450  8
chrY    8224156 0|2 0   4   11  DYS454  11
chrY    8426378 -3|1    -3  3   22  DYS481  21
chrY    8466195 0|1 0   4   11  DYS531  11
chrY    8555980 0|2 0   5   8   DYS590  8
chrY    14466994    -4|1    -4  4   16  DYS437  15
chrY    16134296    0|1 0   4   10  DYS641  10
chrY    16508484    0|1 0   3   8   DYS472  8
chrY    17426012    -5|3    -5  5   11  DYS643  10
chrY    18393226    0|1 0   4   12  DYS533  12
chrY    18718879    -4|2    -4  4   15  GATA-A10    14
chrY    19134850    0|2 0   3   12  DYS426  12
chrY    19622111    -8|1    -8  2   23  YCAIIa/b    19
chrY    20203508    6|2 6   3   10  DYS425  12
chrY    20440393    3|4 3   3   15  DYS395S1a/b 16
chrY    20842518    0|1 0   4   14  DYS385a/b   14
chrY    21050690    -4|1    -4  4   12  DYS461  11
chrY    21050842    4|1 4   4   10  DYS460  11
chrY    21317047    0|1 0   4   11  DYS462  11
chrY    21520224    -4|1    -4  4   13  DYS549  12
chrY    21656837    0|1 0   5   10  DYS594  10
chrY    22092602    0|1 0   4   12  DYS445  12
chrY    22562564    0|2 0   4   9   DYS578  9
chrY    22633873    0|2 0   3   13  DYS392  13
chrY    22634857    0|1 0   4   12  DYS636  12
chrY    23843595    0|1 0   4   10  DYS406S1    10
chrY    24365178    0|1 0   6   8   DYS448.2    8
chrY    26078851    0|3 0   4   10  DYS459a/b   10
chrY    27087611    0|1 0   4   15  DYS464a/b/c/d   15

As you can see, multicopy markers have only one value. I ran the same commands (I have a wrapper script) on a FTDNA Big Y .bam file and got only one value on multicopy markers too. Is there a problem with the documentation, with lobSTR or maybe I am doing something wrong?

Bedtools v2.26.0 does not work with lobstr

The groupby command is buggy in bedtools v2.26.0. See arq5x/bedtools2#435. The lobstr website should be updated to say use v2.25.0 or lower.

script lobSTR_filter_vcf.py failure

The script lobSTR_filter_vcf.py to filter variants fails on cases where the called genotype is "./.", and the subsequent values for "Q", "DISTENDS" are None. For example, the script failed at a location where the sample call was :

Call(sample=xxx, CallData(GT=./., ALLREADS=None, AML=None, DISTENDS=None, DP=None, DPA=None, GB=None, PL=None, PQ=None, Q=None, SB=None, STITCH=None))

A possible solution could be to add a check for ./. in the sample["GT"] values on the two lines that already check to see if sample["GT"] is empty/valid, unless that would have some unintended consequences.

Memory issue when running allelotype on OSX

When running allelotype on OSX, using a relative path for ---out gives a memory error:

allelotype \
  --command classify \
  --index-prefix smallref/small_lobstr_ref/lobSTR_ \
  --strinfo smallref/smallref_strinfo.tab \
  --bam test.aligned.sorted.bam \
  --out /tmp/lobtest \
  --verbose \
  --noise_model ../models/illumina_v2.0.3

runs fine, but:

allelotype \
  --command classify \
  --index-prefix smallref/small_lobstr_ref/lobSTR_ \
  --strinfo smallref/smallref_strinfo.tab \
  --bam test.aligned.sorted.bam \
  --out lobtest \
  --verbose \
  --noise_model ../models/illumina_v2.0.3

gives the error:
allelotype(8968) malloc: *** error for object 0x10030d4b0: incorrect checksum for freed object - object was probably modified after being freed. *** set a breakpoint in malloc_error_break to debug

This appears to be due to an error in GetTransitionProb in NoiseModel.cpp. It is unclear how this is related to the output prefix path. However we are moving to use a different library for logistic regression so this function will be rewritten soon to use standard C++ scientific libraries, which will hopefully resolve this issue.

MSI/MSS evaluation by using lobstr

Hi guys
we are now using lobstr for MSI status detection for human NGS data generated from illumina X10 platform. And we are concerned with the allelotype in certain number of mononucleotide STR makers we designed in our panel, such as : MONO-27, BAT-26 and so on .
I just checked the vcf output from one sample, for the marker MONO-27, it reported like below:
> chr2 39536690 . TTTTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTTTTTTTTTTTTT,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT 9.99964e+06 . END=39536716;MOTIF=T;NS=1;REF=27;RL=27;RU=T;VT=STR;RPA=29,31 GT:ALLREADS:AML:DISTENDS:DP:DPA:GB:PL:PQ:Q:SB:STITCH 1/2:-1|1;0|5;1|7;2|9;3|4;4|1;7|1:0.991178/0.996765:47.7:28:106:2/4:9640,9640,9640,76,0,9640:1.91875:0.987943:26:0

    but for the other markers like BAT-26, it reported like this:
    >   chr2    47641560        .       AAAAAAAAAAAAAAAAAAAAAAAAAAA     AAAAAAAAAAAAAAAAAAAAAAAAAAA     0       .       END=47641586;MOTIF=A;NS=1;REF=27;RL=27;RU=A;VT=STR;RPA=-9973       GT:ALLREADS:AML:DISTENDS:DP:DPA:GB:PL:PQ:Q:SB:STITCH    1/1:-10|2;-9|1;-8|4;-7|1;-6|8;-5|10;-4|16;-3|13;-2|28;-1|20;0|9;1|1;6|1:-nan/-nan:0:114:245:-10000/-10000:-9990000,-10000000,-10000000:nan:-nan:2:0

    So is there any tags which could provide us the information about whether this STR marker is stable or unstable? Because as I understand, if it gave a confirmed called status such as the upper report (RPA=29,31 vs ref is 27), then it should be a stable marker in this sample; otherwise it is an unstable marker like the lower report (RPA=-9973 vs ref=27). Am I correct? or some other tags could provide the information?
   And by the way , what the meaning if RPA got a value below 0, such as RPA=-9973?

ALLREADS flag discrepancy

I believe there is a discrepancy between the ALLREADS format field and the GB format field in the VCF allelotyper output. The length differences reported by the two fields should match but it appears that they're off by the length of one repeat unit. For instance, given the format identifier ALLREADS:AML:DP:GB:Q:GT:STITCH, here are some VCF entries with a discrepancy

For a dinucleotide locus:
0|2:0.673853/0.673853:2:-2/-2:0.107933:10/10:0
6|2:0.673006/0.673006:2:4/4:0.107592:12/12:0
For a tetranucleotide locus:
-15|1:0.69445/0.69445:1:-19/-19:0.145835:1/1:0
-7|2:0.830356/0.830356:2:-11/-11:0.262465:3/3:0

These discrepancies are ubiquitous and I believe affect every call. Is this indeed an error, or am I misunderstanding the way the field is specified?

Bug in BamTools::BamReader::GetTag() affecting allelotype.

I have found a bug in lobSTR, or more exactly BamTools that affects lobSTR. When it comes to optional integer fields, the GetTag() function used in ReadContainer.cpp will sometimes not find a correct number if it is a negative one. Instead it will find an arbitrary positive number. I have pointed out this in an issue in BamTools (pezmaster31/bamtools#90). There is a suggestion of how to write a function that works correctly. I suggest that you take a look at that and rewrite the function ReadContainer::AddReadsFromFile(). This should circumvent the bug in BamTools.

Q score for heterozygous sites decreases with coverage

The quality score is calculated as:
(likelihood of the maxlik genotype)/(sum of likelihoods of all genotypes)

at higher and higher coverages the raw likelihood score actually gets smaller, and SMALL_CONST=1e-10 is likely to be bigger than the actual likelihoods, so the denominator shrink at a smaller rate than the numerator and thus the Q scores will get lower.

Additional allelotype filter

Various analyses using the Illumina Platinum Genomes dataset has indicated that reads with homopolymer stutter errors quite often contain an 'N' preceding or following the stutter artifact. Therefore, it may be advantageous to ignore these reads when running the allelotyper as such changes in length are problematic for the stutter model and commonly result in genotyping errors. I therefore suggest adding an additional read filter that causes the allelotyper to ignore any reads containing 'N' characters.

Invalida FASTQ ID

Hi Melissa,

I'm running lobSTR on 1000 Genomes NA12878 fastq files and I encountered a formatting error. May you look into it?

The files are:

ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/NA12878/sequence_read/SRR000921_1.filt.fastq.gz
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/NA12878/sequence_read/SRR000921_2.filt.fastq.gz

command:

../programs/lobSTR-3.0.2/bin/lobSTR --p1 /net/1000g/atks/data/NA12878/SRR000921_1.filt.fastq.gz --p2 /net/1000g/atks/data/NA12878/SRR000921_2.filt.fastq.gz --index-prefix ../programs/lobSTR-3.0.2/resource/lobstr_v3.0.2_hg19_ref/lobSTR_ -o na12878 --rg-lib abc --rg-sample NA12878

I tried it with the non gzipped versions and the same error was encountered.

error:

[lobSTR-3.0.2] 2014-11-21.09:29:08 ProgressMeter: Getting run info
[lobSTR-3.0.2] 2014-11-21.09:29:08 ProgressMeter: Initializing...
[lobSTR-3.0.2] 2014-11-21.09:29:34 ProgressMeter: Running detection/alignment...
[lobSTR-3.0.2] 2014-11-21.09:29:34 ProgressMeter: Processing files /net/1000g/atks/data/NA12878/SRR000921_1.filt.fastq.gz and /net/1000g/atks/data/NA12878/SRR000921_2.filt.fastq.gz
[lobSTR-3.0.2] 2014-11-21.09:29:34 ERROR: Found Invalid FASTA ID in file /net/1000g/atks/data/NA12878/SRR000921_1.filt.fastq.gz line 1 (expected '>' character)
[lobSTR-3.0.2] 2014-11-21.09:29:34 ProgressMeter: Outputting run statistics

Doesn't compile due to duplicated AM_INIT_AUTOMAKE stanzas in configure.ac

I'm on Arch Linux x64 with automake v1.14.

On master, running "reconf" fails (specifically, the automake part), with this error:

automake: error: global options already processed
automake: Please contact <[email protected]>.
 at /usr/share/automake-1.14/Automake/Channels.pm line 662, <GEN0> line 179.
        Automake::Channels::msg('automake', '', 'global options already processed') called at /usr/share/automake-1.14/Automake/ChannelDefs.pm line 212
        Automake::ChannelDefs::prog_error('global options already processed') called at /usr/share/automake-1.14/Automake/Options.pm line 421
        Automake::Options::process_global_option_list('HASH(0x22d51f8)', 'HASH(0x22d52b8)', 'HASH(0x22d5318)') called at /usr/bin/automake line 5331
        Automake::scan_autoconf_traces('configure.ac') called at /usr/bin/automake line 5431
        Automake::scan_autoconf_files() called at /usr/bin/automake line 8259

However, if I merge the two different AM_INIT_AUTOMAKE stanzas in configure.ac into one, it works.

add post-alignment checks

Specifically:

check if large gaps near the end of reads
check if after alignment the read still spans with enough flanking bp on each side

Use these to either discard reads or incorporate into an alignment score.

allelotype: Process BAM files created by other aligners

This is being worked on on the foreignbam branch. The plan:

Implement parsing reads from BAM files created by tools besides lobSTR with the --foreignbam flag.
Change to processing all BAMs (lobSTR and otherwise) with this method
Extend this to allow mapping the same read to multiple STRs, as happens often with longer reads anyway
Get rid of --foreignbam flag, since all BAMs will be processed in this way.

lobstr_index should give a clear warning if lobSTRIndex binary is not installed

This warning should happen before trying anything else. See question on the mailing list

Heterzygosity calculation?

Dear Author,

I have generated VCF file for lobSTR and use vcftools to get STR frequency (see below):

CHROM POS N_ALLELES N_CHR {ALLELE:FREQ}
chr14 16035037 3 130 TTTTCTTTTCTTTTCTTTTTTTTTTTTCTTTTCTTTCTTTTCTTTTCTTT:0.938462 TTTTCTTTTCTTTTCTTTTTTTTTTTTCTTTTCTTTCTTTTCTTTTCTTTT:0.0461538 TTTTCTTTTCTTTTCTTTTTTTTTTTTCTTTTCTTTCTTTTCTTTTCTTTTT:0.0153846
chr14 16036677 2 2 TTTCTTTTCTTTTCTTTTCTTCTTTTCTCTCTTCTTTTTTCTTCTCTCTTCTTCCTTCCTTTCTTTCTTCTTTCTTTCTCTCACTCTCTTTCTTTCTTTCTCTCTTTCTTTCTTTCTCTTCTTTCTTTCTTTCCCTTTCTTTCTTTCTTTCTTTATTTTTTTCTTTCTTTC:0 TTTCTTTTCTTTTCTTTTCTTCTTTTCTCTCTTCTTTTTTCTTCTCTCTTCTTCCTTCCTTTCTTTCTTCTTTCTTTCTCTCACTCTCTTTCTTTCTTTCTCTCTTTCTTTCTTTCTCTTCTTTCTTTCTTTCCCTTTCTTTCTTTCTTTCTTTATTTTTTTCTTTC:1
chr14 16037656 6 212 TTCTTTCTTTCTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTC:0.34434 TTCTTTCTTTCTTCTTTCTTTCTTTC:0.45283 TTCTTTCTTTCTTCTTTCTTTCTTTCTTTC:0.0471698 TTCTTTCTTTCTTCTTTCTTTCTTTCTTTCTTTCTTTC:0.0660377 TTCTTTCTTTCTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCCTTT:0.0801887 TTCTTTCTTTCTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCCTTTCTTT:0.00943396
chr14 16037717 6 336 ATTGATTGATTGATTGATTGATTGATT:0.75 ATTGATTGATTGATTGATTGATT:0.0595238 ATTGATTGATTGATTGATTGATTGATTGAT:0.00297619 ATTGATTGATTGATTGATTGATTGATTGATT:0.125 ATTGATTGATTGATTGATTGATTGATTGATTGATT:0.0505952 ATTGATTGATTGATTGATTGATTGATTGATTGATTGATT:0.0119048

I see that I can use this number in your equation in a paper. But just wonder if I could use your "analyze_heterozygosity_new.py" in the str_catalog_supplemental_scripts directory to get heterozygosity per locus.

Thank you in advance.

Weerachai

Improve NW scoring

Right now the gap penalty is essentially 0 since I have int GAPEXTEND = 0.1. We should incorporate the improved realignment.

Extend parameter is hard coded in lobstr_index.py

With the updated reference structure, the --extend paramter is ignored and hard coded in the bash commands generated. Fix this to read the user-specified value if it's provided

I have 3000 samples of bam files to put in.

This is the version that i used 'allelotype-4.0.6'.The align software that i used is bwa-mem.So when i put in too many(3048 samples) of bam files ,it will report an error which is 'result/res.allelotype.stats': Too many open files

But when i put 334 bam samples into it ,Lobstr could run.
The sample that I used is rice

Duplicated read group IDs cause different samples to be merged

Several users have tried multi-sample calling where BAMs have the same read group IDs but a different SM in each file. It would be nice to be able to handle this without the need to change those read group IDs.

Multi-threading does not work on OSX

We should either:

Make it work
Disallow it using a flag when we compile on Mac

@agordon do you have any experience with this? Should pthread just work on Mac?

Using --annotations gives negative PL scores

PLs should always be positive, with the maximum likelihood genotyping having a PL of 0.

This problem probably has to do with the fact that we force lobSTR to only consider a restricted set of alleles at annotated loci.

the proble of test

hi, I meet with problem when I try to perform test, my command as follow:
lobSTR --index-prefix ./STR/hg19_v3.0.2/lobstr_v3.0.2_hg19_ref/lobSTR_
--p1 tmp_1.fq --p2 tmp_2.fq
--q
-o test
--rg-sample sample mysample --rg-lib mylibrary

and the error as follow:
ERROR: Found Invalid FASTA ID in file tmp_1.fq line 1 (expected '>' character)

No reads aligned
could you give me some suggetions?
THANKS!

terminate called after throwing an instance of 'std::out_of_range'

Hi,

I did perform a test using the hg19_v3.0.2/tmp_1.fq/tmp_2.fq files provided in lobstr-code and it did generate test.aligned.bam, test.aligned.stats, test.vcf, and test.allelotype.stats.

I have sequence-capture illumina data and I'm trying to genotype 30 microsatellite loci in a bat species using lobstr. I generated a .bed file using TRF for 30 ~1000 bp regions of that bat genome.

Here's how I created my reference:
python2 lobstr_index.py --str ./lobSTR/bat_lobstr.bed \ --ref ./lobSTR/bat_str.fasta --out ./lobSTR/lobref/ --extend 500 python2 GetSTRInfo.py ./lobSTR/bat_lobstr.bed \ /home/user/workspace/lobSTR/bat_str.fasta > ./lobref/strinfo.tab

Here are my commands for the alignment:
lobSTR --extend 500 \ --p1 2004-AR-40_R1.fastq \ --p2 2004-AR-40_R2.fastq \ --fastq --index-prefix ./lobref/lobSTR_ --out 2004-AR-40 \ --rg-sample 2004-AR-40 --rg-lib 2004-AR-40

samtools sort 2004-AR-40.aligned.bam -o 2004-AR-40.sorted.bam samtools index 2004-AR-40.sorted.bam

That did create the files 2004-AR-40.aligned.bam, 2004-AR-40.aligned.stats, 2004-AR-40.sorted.bam, 2004-AR-40.sorted.bam.bai.

Here are my commands for the allele call:
../tools/lobSTR-bin-Linux-x86_64-4.0.6/bin/allelotype \ --command classify --bam 2004-AR-40.sorted.bam \ --noise_model ../tools/lobSTR-bin-Linux-x86_64-4.0.6/share/lobSTR/models/illumina_v3.pcrfree \ --out 2004-AR-40.sorted.bam --strinfo ./lobref/strinfo.tab \ --index-prefix ./lobref/lobSTR_

I get the following error:
[allelotype-4.0.6] 2020-04-21.16:30:21 ProgressMeter: Getting run info
terminate called after throwing an instance of std::out_of_range
what(): basic_string::substr: __pos (which is 18446744073709551166) > this->size() (which is 1130)
Aborted (core dumped)

It seems to be angry with the first contig in my obSTR_ref.fasta file, because it is 1131 bp in size?

Any help would be greatly appreciated!