bcgsc / straglr Goto Github PK

Tandem repeat expansion detection or genotyping from long-read alignments

License: Other

Python 100.00%

bioinformatics genomics tandem-repeats long-reads

straglr's Introduction

Straglr - Short-tandem repeat genotyping using long reads

Straglr is a tool that can be used for genome-wide scans for tandem repeat(TR) expansions or targeted genotyping using long-read alignments.

Installation

Straglr is implemented in Python 3.8 and has been tested in Linux environment.

Straglr depends on Tandem Repeat Finder(TRF) for identifying TRs and blastn for iotif matching. (TRF and blastn executables must be in $PATH). Other Python dependencies are listed in requirements.txt.

The file environment.yaml can by used by conda to create an environment with all dependencies installed:

conda env create --name straglr --file=environment.yaml

Straglr can be added to the environment via pip,

pip install git+https://github.com/bcgsc/[email protected]#egg=straglr

(for example to install v1.3.0), or run directly from the cloned repository:

conda activate straglr
./straglr.py

Input

Long read alignments sorted by genomic coordindates in BAM format against the reference genome. Suggested aligner: Minimap2 -- Please use the option -Y to enable soft-clipping so that read sequences can be assessed directly from the BAM file.

Usage

python straglr.py <mm2.bam> <reference_fasta> <output_prefix> [--loci loci.bed] [--exclude skip_regions.bed] [--chroms chr] [--regions regions.bed] [--min_support N] [--min_ins_size N] [--min_str_len N] [--max_str_len N] [--nprocs N] [--genotype_in_size] [--max_num_clusters N] [--min_cluster_size N] [--working_dir] [--tmpdir] [--debug]

Some common parameters:

--loci: a BED file containing loci to be genotyped. 4 column BED format: chromosome start end repeat

--exclude: a BED file containing regions to be skipped in genome-scan (e.g. long segmental duplications or pericentromeric regions)

--chroms: space-separated list of specific chromosomes for genome-scan

--regions: a BED file containing regions to be used only in genome-scan

--include_alt_chroms: include ALT chromosomes (chromosomes with "_" in names) in genome scan (Default: NOT included)

--use_unpaired_clips: include examination of unpaired clipped alignments in genome scan to detect expansion beyond read size (Default:NOT used)

--min_support: minimum number of suppport reads for an expansion to be captured in genome-scan (Default:2)

--min_ins_size: minimum increase in size (relative to the reference genome) for an expansion to be captured in genome-scan (Default:100)

--min_str_len: minimum length of repeat-motif for an expansion to be captured in genome-scan (Default:2)

--max_str_len: maximum length of repeat-motif for an expansion to be captured in genome-scan (Default:50)

--nprocs: number of processes to use in Python's multiprocessing (Default:1)

--genotype_in_size: report genotype (column 5 of TSV output) in terms of allele sizes instead of copy numbers

--max_num_clusters: maximum number of clusters to be tried in Gausssian Mixture Model (GMM) clustering (Default:2)

--min_cluster_size: minimum number of reads required to constitute a cluster (allele) in GMM clustering (Default:2)

--trf_args: TRF arguments (Default:2 5 5 80 10 10 500)

--tmpdir: user-specified directory for holding temporary files

Example application: genome scan to detect TRs longer than the reference genome by 100bp:

The most common use of Straglr is for detecting TR expansions over the reference genome by a defined size threshold. This will save computations spent on genotyping the majority of TRs in the human genome with no substantial change in lengths. The identified expanded alleles can then be screened for pathogenicity by comparing against known TR polymorphisms. A sample Straglr run to detect expansions larger than the reference alleles by 100 bp on TR loci 2-100bp in motif length:

straglr.py sample.mm2.bam hg38.fa straglr_scan --min_str_len 2 --max_str_len 100 --min_ins_size 100 --genotype_in_size --exclude hg38.exclude.bed --min_support 2 --max_num_clusters 2 --nprocs 32

Highly repetitive genomic regions may be problematic for aligners and give rise to questionable genotyping results. They can be skipped over in Straglr's genome scan. To generate a bed file that contains all segmental duplications, centromeric and gap regions for exclusion from a Straglr run:

(cut -f1-3 hg38.segdups.bed;awk '$3-$2>=10000' hg38.simple_repeats.bed | cut -f1-3;cat hg38.centromeres.bed hg38_gaps.bed) | bedtools sort -i - | bedtools merge -i - -d 1000 > hg38.exclude.bed

Example application: genome-wide genotyping

Download UCSC Simple Repeats track in output format all fields from selected table using the online Table Browser tool

Convert downloaded table into bed format, skipping all homopolymers, with last field specifying motif sequence, e.g.:

grep -v '^#' simple_repeats.downloaded.tsv | awk -vOFS='\t' 'length($17)>1 {print $2,$3,$4,$17}' > simple_repeats.bed

Split whole-genome bed file into batches with smaller numbers of loci (e.g. 10,000), e.g.:
```
split -l 10000 -d -a 4 --additional-suffix=.bed simple_repeats.bed batch
```

Run Straglr on Minimap2 alignments for each batch of TR loci in parallel on, for example, computing cluster, e.g.:

straglr.py sample.mm2.bam hg38.fa batch0001 --genotype_in_size --min_support 2 --loci batch0001.bed --max_str_len 100 --max_num_clusters 2 --nprocs 32

Output

<output_prefix>.tsv - detailed output one support read per line

chrom - chromosome name
start - start coordinate of locus
end - end coordinate of locus
target_repeat - consensus(shortest) repeat motif from genome scan or target motif in genotyping
locus - locus in UCSC format (chrom:start-end)
coverage - coverage depth of locus
genotype - copy numbers (default) or sizes (--genotype_in_size) of each allele detected for given locus, separate by semi-colon(";") if multiple alleles detected, with number of support reads in bracket following each allele copy number/size. An example of a heterozygyous allele in size: 990.8(10);30.9(10) (Alleles preceded by > indicate minimum values, as full alleles are not captured in any support reads)
actual_repeat - actual repeat motif detected in mapped read
read_name - mapped read name
copy_number - number of copies of repeat in allele
size - size of allele
read_start - start position of repeat in support read
strand - strand of reference genome from which read originates
allele - allele to which support read is assigned

read_status - classification of mapped read

"full": read captures entire repeat (counted as support read)
"partial": read does not capture entire repeat (counted as support read)
"skipped": "not_spanning" - read does not span across locus (NOT counted as support read)
"failed" - read not used for genotyping (NOT counted as support read). Reasons are indicated with following descriptors:

Descriptor	Explanation
cannot_extract_sequence	cannot extract repeat sequence, could be because the repeat is deleted for the read in question, or regions flanking motif are deleted
motif_size_out_of_range	motif size detected outside specified size range
insufficent_repeat_coverage	repeat detected does not cover enough (50%) of expansion/insertion sequence
partial_and_insufficient_span	repeat not covering enough (90%) query minus flanking sequences
unmatched_motif	no repeat found matching target motif

<output_prefix>.bed - summarized genotypes one locus per line
- chrom - chromosome name
- start - start coordinate of locus
- end - end coordinate of locus
- repeat_unit - repeat motifi
- allele<N>.size, where N={1,2,3...} depending on --max_num_clusters e.g. N={1,2} if --max_num_clusters==2 (default)
- allele<N>.copy_number
- allele<N>.support
<output_prefix>.vcf

Utilities

straglr_compare.py: compare Straglr's results
extract_repeats.py: extract repeat sequences from alignment BAM given Straglr's TSV output

Contact

Readman Chiu

Citation

Chiu R, Rajan-Babu IS, Friedman JM, Birol I. Straglr: discovering and genotyping tandem repeat expansions using whole genome long-read sequences. Genome Biol 22, 224 (2021). https://doi.org/10.1186/s13059-021-02447-3

straglr's People

Contributors

Stargazers

Watchers

Forkers

unique379r char-al jesson-mark philres lavafroth

straglr's Issues

Straglr not finding TR's that are definitely present in targeted sequencing data

Hello,

I'm testing Straglr out on some data from targeted sequencing sources. We can find the repeat regions with other repeat measuring programs and by examining the fasta/bam files. However, when straglr is run, no errors are shown but the output files are empty. We have tried by aligning to whole chromosome and gene-specific references, attempted on two separate sources and continue to get the same results.

Any help would be appreciated

Could I set the work directory for straglr?

Hi~
I ran straglr on our cluster but some tasks failed. I found that the default work directory of straglr is /tmp (many tmp files under this directory were found when running straglr), but the storage of /tmp of each node of our cluster is only about 50GB which leaded to the crash of straglr when the storage of /tmp was full. So, could I set the working directory?

Best wishes,
ttbond

Genotyping amplicons?

Hi,

Thanks for a great tool. I am playing around with genotyping amplicon data from Nanopore sequencing. I can get Straglr to call certain STRs but not others, and I wonder if I need to do something differently. I have tried changing motifs, positions of the sequence, etc, but I have yet to be successful. Are there any suggestions you can give, please?

I've attached an example file aligned to grch38, and I'm running straglr (latest version) thus:

straglr.py barcode32.new.sorted.bam genome.fa batch1 --genotype_in_size --min_support 1 --loci strtest.bed --max_str_len 1000 --max_num_clusters 2 --nprocs 8

And I get:

#chrom	start	end	repeat_unit	allele1:size	allele1:copy_number	allele1:support	allele2:size	allele2:copy_number	allele2:support
chr1	204156332	204156364	ACAG	31.0	7.8	8	-	-	-
chr11	2171086	2171116	TGAA	32.1	8.0	13	-	-	-
chr5	150076322	150076397	CTAT	68.4	17.1	139	-	-	-
chrX	134481492	134481561	TCTA	72.5	18.1	2	-	-	-
chrX	67545317	67545419	GCA	94.4	31.5	7	-	-	-

str.tar.gz
Uploading str.tar.gz…

Interpretation on <output_prefix>.tsv

Hi there,

I've received an output file, and according to the readme, the allele column should indicate the allele to which the support read is assigned, based on the genotype results. I'm puzzled as to why it's displaying NA instead of 34.5, and why it does not consider 349.2 as the second allele.

chr19	1049437	1050023	CCCCGTGAGCCCCCCACCACTCCCT	chr19:1049437-1050023	15	34.5(13)	40b2e343-cc45-4c7c-9714-bede62e39748	CCCGTGAGCCCCCCACCACTCCCTC,CCCGTGAGCCCCCCACCACTCCCTCCCCGTGAGCCCCCACACCCTC	34.0	850	15218	-	34.5	full
chr19	1049437	1050023	CCCCGTGAGCCCCCCACCACTCCCT	chr19:1049437-1050023	15	34.5(13)	9fef502d-c95e-47e6-b8e7-7e3f3de33850	GCCCCCACCACTCCCTCCCCGTGA	34.0	849	7101	+	34.5	full
chr19	1049437	1050023	CCCCGTGAGCCCCCCACCACTCCCT	chr19:1049437-1050023	15	34.5(13)	4a03adc1-1d6a-4efe-a38d-fb5fa7a12581	TGAGCCCCCACCACTCCCTCCCCG	34.0	849	744	+	34.5	full
chr19	1049437	1050023	CCCCGTGAGCCCCCCACCACTCCCT	chr19:1049437-1050023	15	34.5(13)	43170596-da45-4fd9-b4a0-a26c934cf4c3	CCTCCCCGTGAGCCCCCACCACTC	32.4	810	480	+	34.5	full
chr19	1049437	1050023	CCCCGTGAGCCCCCCACCACTCCCT	chr19:1049437-1050023	15	34.5(13)	8ed4d156-5898-46a9-b4a2-12c9be80944f	CCCCGTGAGCCCCCACCACTCCCT	32.2	804	2157	+	34.5	full
chr19	1049437	1050023	CCCCGTGAGCCCCCCACCACTCCCT	chr19:1049437-1050023	15	34.5(13)	37df087b-09e5-4b7c-bf5e-452310692056	CCCCGTGGGCCCCCCACCACTCCCT	349.2	8730	2280	+	NA	full

Any comment would be appreciated.

Best,
Hsin

Error processing plasmid reads

Hi,
I am working on counting repeats in ONT plasmid data that we have. Having recently read your paper, we decided to use this method but the yield and the result is not what we are expecting. This is a truth set, so we roughly know how much the count should be.

One of the two major problems we are facing is that it throws the following error consistently.

problem getting seq1 b9435da3-21e3-4de8-aee4-393700401474 ('pBluescript', 762, 821, 'CAG') 682 None 359
problem getting seq1 90bf2650-66ed-4e94-8fb3-5c777860bbc2 ('pBluescript', 762, 821, 'CAG') None 901 None
problem getting seq1 8d4f73e1-508c-4d70-93d9-c010b03a3524 ('pBluescript', 762, 821, 'CAG') 682 None 270

We thought this might be because of a problem related to the BAM file, but we checked that and regenerated that to make sure that the BAM file makes sense,

Further, we thought that maybe providing a FASTA file might aid in retrieving the read and getting a count. So, we provided a FASTA file using the following option.

--reads_fasta

But that led us to another error, this time a TypeError.

The key issue is that the first method mentioned, using the BAM file, yields only ~1200 reads that are processed whereas there are ~200,000 reads that are present.

We would appreciate some help in understanding how to tackle this issue.

Thank you!

P.S. 'pBluescript' is the name of the reference, it is a custom reference built for the purpose of this experiment.

Errors about running straglr to call tandem repeats

Hi, I'm trying to use Straglr to call tandem repeat. After the program finished, I've got the ins_merged.bed and tsv files. However, when I checked the log file, I found some error messages:

problem getting seq1 m64061_210713_112710/160105725/ccs ['chr19', 22927574, 22927624, 'AGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGGGTC,AGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGTC,AGCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGGTC,AGCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGTGTC,CCCCGTCCGGGAGGGAGGTGGGGGGGGTCAGCCCCCCGCCCGGCCAGCCG,CCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGTGTCAGCCCCCCTGC,GCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGGGTCAGCCCCC'] None 22927724 None
problem getting seq1 m64031_210322_010222/106758944/ccs ['chr19', 22927574, 22927624, 'AGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGGGTC,AGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGTC,AGCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGGTC,AGCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGTGTC,CCCCGTCCGGGAGGGAGGTGGGGGGGGTCAGCCCCCCGCCCGGCCAGCCG,CCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGTGTCAGCCCCCCTGC,GCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGGGTCAGCCCCC'] None 22927724 None

In another unfinished program, the error message is as follows:

trf input /tmp/tmpw7x9ozqo
can't generate temp file: /tmp/tmpz0dufyxc
gg 894074
ins all /tmp/tmpjl9mcqu3
gg 865674
trf input /tmp/tmpwcix8m7s
can't generate temp file: /tmp/tmp56_xzfhi
trf input /tmp/tmps9o2l3sh
can't generate temp file: /tmp/tmp08g0hvzk
trf input /tmp/tmp0h0uwas1
can't generate temp file: /tmp/tmpg0yd059k
trf input /tmp/tmpenawf5nt
can't generate temp file: /tmp/tmplzox77_x
trf input /tmp/tmp33jk6v2k
can't generate temp file: /tmp/tmppamp3_4z
trf input /tmp/tmp_n6qb_ks
can't generate temp file: /tmp/tmpkz_d23h3
trf input /tmp/tmpdor68mgx
can't generate temp file: /tmp/tmph1u5397w

I wonder if this will affect the accuracy and credibility of the resulting files. And I want to know how to solve these error messages.
Any suggestions? Thanks ever so much!

Problem with straglr_compare

Hi, I would like to analyze WGS long-reads data of a trio (proband, mother and father) with straglr, but I encountered some problems with straglr_compare.

First I run straglr on the 3 bam files (aligned with minimap2) like so:

python3 straglr.py sample.bam ref.fasta sample_ID \
--exclude hg38_centromeres_telomeres_UCSC.bed \
--min_support 1 --min_ins_size 10 \

and got no errors or warnings.

Then I run 3 instances of straglr_compare:

The first time I ran:
python3 ./straglr_compare.py proband.tsv mother.tsv proband_vs_mother.txt
and got the proper result

Then I did:
python3 ./straglr_compare.py proband.tsv father.tsv proband_vs_father.txt
and got this error:

Traceback (most recent call last):
  File "/work/gspirito/software/straglr-master/straglr_compare.py", line 371, in <module>
    main()
  File "/work/gspirito/software/straglr-master/straglr_compare.py", line 359, in main
    vs_controls.append(vs_each_control(test_bed, control_bed, args.pval_cutoff, min_expansion=args.min_expansion, min_support=args.min_support, label=control_result))
  File "/work/gspirito/software/straglr-master/straglr_compare.py", line 147, in vs_each_control
    control_allele = float(control_cols[j-1])
ValueError: could not convert string to float: '.'

Finally I ran:
python3 ./straglr_compare.py father.tsv mother.tsv father_vs_mother.txt
and got no errors but no output file either.

Am I doing something wrong?
Is there a way to perform a 'proband vs mother+father' comparison?

Thanks in advance for the response.

Giovanni

Different results from different versions of straglr

Hi, I'm now using straglr to analyze a tandem repeat. BeforeI have installed straglr version 1.1.1 and now I downloaded the newest source codes in the zip format. The newest version is 1.2.0.
I ran these two versions to analyze my data and found different results. My motif is CGG and reference copy number is 11.
The parameters I specified are --max_str_len 50 --min_str_len 2 and others are default.
Specifically, the old version of straglr found 7 reads whose motif copy number is as below:

946.7 843.0 400.3 395.0 18.7 16.7 16.7

And the motif number of clustered allele is 894.8(2);169.5(5). While these numbers are true accoding to my manual inspection, the cluster result is not ideal.

The newer version of straglr found 8 reads whose motif copy number is:

400.3 18.7 16.7 16.7 16.3 16.0 15.0 14.3

And the motif number of clustered allele is 16.2(7). You can see that reads that whose copy number are 946.7 and 843.0 are not reported in newer straglr and newer straglr found some new reads that older straglr didn't find.

I don't understand why there is such difference. Could you explain it?

Thanks!

Question about the inconsistent motif between loci.bed and output.bed

Hi there,

I was trying to use Straglr in genotype mode. I ran the command as follows
python straglr.py myseq.sorted.bam genome.fa myseq_straglr --loci loci.bed --nprocs 4

Here is my loci.bed:

chr9    27573484        27573546        CCCCGG

However, I got the following output.bed

#chrom  start   end     repeat_unit     allele1:size    allele1:copy_number     allele1:support allele2:size    allele2:copy_number     allele2:support
chr9    27573484        27573546        GCCCCG  103.80000000000001      17.3    106     -       -       -

I am wondering why repeat unit are different, and whether there is any solution to detect my target pattern (CCCCGG) on Straglr.

Any comments would be appreciated.

Best,
Hsin

VCF output for hetrozygotes seems to be have incorrect GT's

Hi @readmanchiu

Thanks again for implementing the VCF output!

Unfortunatly I run into an issue with the GTs for hetrozygotes:
I seem to get two VCF lines in these cases, of which one has a 0/0 GT and the other a 1/1 GT.

I would expect:

a single line with a 1/2 call if there are 2 variants
a single line with a 1/0 call if there is one variant STR and one reference STR
or alternatively
two lines with a 1/0 GT if there are 2 variants

STR count per strand

Hi,

I was testing out Straglr on some of the reads that we have available to us and there are some issues with respect to the strand output.

I noticed that the bed file has an overall allele size and copy number output for the entire locus.

But when I look at the tsv for the per read analysis, I do see the column labelled genotype which contains the size or copy number. But this number does not seem to align with the copy number in the copy number column for that particular read. Instead it matched the genotype information in the bed file.

Would it be possible to get the strand data from straglr and determine definitely which strand the copy number in the copy number column is associated with?

I have attached some sample output from straglr:
chr9 27573525 27573542 GCCCCG 3616.9(28);14.8(79) e156b402-4d03-4050-97dd-3c11faf43946 667.8 4007 869 3616.9 #with genotype_in_size flag set
chr9 27573525 27573542 GCCCCG 602.8(28);2.5(79) e156b402-4d03-4050-97dd-3c11faf43946 667.8 4007 869 602.8 #without genotype_in_size flag

Not all loci in output bedfile

Dear @readmanchiu,

I am using straglr via the vip pipeline (https://github.com/molgenis/vip/), as described in an earlier thread. Here, we use as an input a bed file with loci. However, I expected all loci to be in the output vcf. However, often loci are missing in the output. Which ones are in and out differs per sample.

We have not yet found the cause of the missing loci. Could it be that straglr filters loci based on quality? If so, is there an option to force all loci in the vcf and use the QUAL and filter columns to indicate the low quality, but keep the locus in the output vcf file?

Because we are using the @philres fork, it could be an issue related to that fork, but I believe this question is not related to the altered code. If you have any insights they would be very welcome.

About quality of detected VNTRs

Hi, @readmanchiu

I'm currently wondering how to assess the quality of tandem repeats detected by Straglr. Since Straglr can report number of support reads for each TR, I wonder if a threshold could be applied to filter TRs whose number of support reads are below this threshold. If so, could you give any suggestions about how to choose this threshold?

Thanks!

Jesson-mark

Discrepancy allele length and sequence

Dear @readmanchiu

I am currently using Straglr 1.5.0 and produced the following output for the FGF14 locus. If I am correct there is a discrepancy in the sequence shown in the vcf and the AL/ALR and AC/ACR. If my breakdown is correct then.
GT = 0/1 (meaning that there is one REF allele and one ALT allele)
DP = 187
AL=33.6/995.8
ALR=13-44/78-1216
AC=11.2/331.9
ACR=4.3-14.7/26.0-405.3
AD=106/72
No ALT_MOTIF (./.)

If I look at the sequences I see reference sequence 50 GAA repeat units and alternative sequence around 11 copies, matching the first allele. Combined with the 0/1 GT this would make a sample with two repeats, one of 11 RU and one of 50 RU. However, I was expecting a sequence of around 332 RU, fiven the AC values.
EDIT: could it be because the ref length (50) falls within the range 26.0-405.3? If so, then this is incorrectly so in my opinion. There are only a few reads supporting an intermediate length allelle with a length close to the reference length.

##contig=<ID=chr13,length=114364328>
##INFO=<ID=LOCUS,Number=1,Type=String,Description="Locus ID">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of repeat">
##INFO=<ID=MOTIF,Number=1,Type=String,Description="Reference motif">
##INFO=<ID=COPIES,Number=1,Type=Float,Description="Reference copies">
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth">
##FORMAT=<ID=AL,Number=.,Type=String,Description="Allelic lengths">
##FORMAT=<ID=ALR,Number=.,Type=String,Description="Allelic length ranges">
##FORMAT=<ID=AC,Number=.,Type=String,Description="Allelic copies">
##FORMAT=<ID=ACR,Number=.,Type=String,Description="Allelic copy ranges">
##FORMAT=<ID=AD,Number=.,Type=String,Description="Allelic depths">
##FORMAT=<ID=ALT_MOTIF,Number=.,Type=String,Description="Alternate motif(s)">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT .
chr13 102161576 . GAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAA AAGAAAGAAGAAGAAGAAGAAGAAAGAAGAAGAA . PASS LOCUS=FGF14;END=102161726;MOTIF=GAA;COPIES=50.0 GT:DP:AL:ALR:AC:ACR:AD:ALT_MOTIF 0/1:187:33.6/995.8:13-44/78-1216:11.2/331.9:4.3-14.7/26.0-405.3:106/72:./.

A slice of the middle of the tsv file also shows the two 11.2 and 331.9 alleles:

chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) 30da0ee0-9108-47c3-a4dd-77be7b1e0c35 GAA 328.3 985 8588 + 331.9 full
chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) ba2e5868-6ae0-4136-bf1f-413a26c2ac31 GAA 326.3 979 8554 + 331.9 full
chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) 22143951-e646-46cf-8017-7c50e468c9ea GAA 317.7 953 8590 + 331.9 full
chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) 24631e93-98a1-4790-8578-f8b84416d19b GAA 301.3 904 8578 + 331.9 full
chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) 9828cb5e-226d-4448-a374-358c8f7ab486 GAA 297.7 893 8545 + 331.9 full
chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) 1551aae9-d9a3-4d0c-b9bf-681af0af2028 GAA 223.7 671 8588 + 331.9 full
chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) c9886ee4-15ab-430e-8afe-230a6a94be26 GAA 166.3 499 8624 + 331.9 full
chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) 9de80113-5231-493f-bd50-9332f42050e3 GAA 102.3 307 8588 + 331.9 full
chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) 23150b37-212f-48f7-8718-bc2ade8c062c GAA 74.3 223 8602 + 331.9 full
chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) 5ba7e228-b2dc-4854-8057-cfeb2ff59c43 GAA 70.3 211 8583 + 331.9 full
chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) 27fd23f4-7a1d-4886-943d-5929ddd83094 GAA 53.7 161 8586 + 331.9 full
chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) 26cb932f-71e3-4e80-8c87-9aaa2e578082 GAA 39.0 117 8670 + 331.9 full
chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) 498288d8-f4e4-4b2c-8a1d-c21a60a3a3ef GAA 26.0 78 8605 + 331.9 full
chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) c8aaed8f-7d96-472b-8092-b08f23a2ebfc GAA 14.7 44 8494 + 11.2 full
chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) 18a898df-0f87-4326-ab29-22b4335b265e GAA 14.3 43 8518 + 11.2 full
chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) ca73a40e-fce8-451b-9a0d-751c4b90e897 GAA 13.7 41 6430 - 11.2 full

And the bed results:

#chrom start end repeat_unit allele1:size allele1:copy_number allele1:support allele2:size allele2:copy_number allele2:support
chr13 102161576 102161726 AAG 995.6999999999999 331.9 72 33.599999999999994 11.2 106

These are the 26.0 copy read details (not sure if they help):

Read name = 498288d8-f4e4-4b2c-8a1d-c21a60a3a3ef
Read length = 10.909bp
Flags = 0
Mapping = Primary @ MAPQ 60
Reference span = chr13:102.152.965-102.163.897 (+) = 10.933bp
Cigar = 28S44M2D44M1D16M1D22M1D13M3D52M1I78M2D72M1D113M2D68M1I96M1D29M3D98M1D3M2I1M1D5M1I16M1I52M2D118M1D107M1D206M1I738M1D99M3D320M1I137M1I665M1D41M2D375M1D612M1D57M2I4M3D342M1I719M1D172M3D135M1I175M2I292M1D385M1D2M1D13M3D116M1D135M2I6M1I3M1D5M2D49M1D72M1D22M2D25M1I63M2D19M1D16M1I62M1D80M1I11M3I67M1D54M1D7M1D88M2D33M2I51M1I85M1D39M1I44M1D86M1D10M1D253M1D148M2I29M1I8M1I15M1I37M1I146M1D57M1D51M1D11M1D155M1D3M1D23M1D7M1D2M1I76M3I81M2D8M1D72M1D61M1D20M1D58M1D24M1I4M2D40M2D19M1I166M1D55M1D151M1D19M1D177M1D91M1I37M1I118M2I33M1I18M2D1M2D8M1D61M1D49M1D134M1I60M1I101M1D11M1I119M1D167M2D63M2S
Clipping = Left 28 soft; Right 2 soft
s1 = 6690
s2 = 96
NM = 244
AS = 20350
de = 0.0188
rl = 4406
cm = 1032
nn = 0
tp = P
ms = 20383
Hidden tags: MD, RG
Location = chr13:102.161.586
Base = G @ QV 29

Alignment start position = chr13:102152965
GATGTGCTTCGTTCAGTTACGTATTGCTCCGTGGCAGGCCCAGGTCCTTGCTAGCATCAGGCTGTGCACAGCAGGTGGAGCTGCTGGGAG....AACTGTGGCTGTCTTGTCACCACTGTATCCCTAGTGTCTGCTGGCTAAAGAGATGAGAAAACAGAAAATATTTCTTTCAGAGAGCAAATGACCGCAATCGTCAGTCAGTGTAAGCTTGCCTTTGCAAATGAAGGAAACTCTTATCTTAGTTGTAAAATATCAATATTCTCTATGCAACCAACCCTTGTGAAGGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGAATCGTTTGGGAGTTCATCGATGAGATGAGATCATCGTGGACGGGACCTGCTTAATTCATCATTGCGTTCCCAATATAGTGCGCCTAGAAGATAATTAAAAAAAAAAAAAAAAACCCTGGAGAATGAGTGAATGACATTGCTCACTATATGGACCTAAATTTAATCTCTTTATCAAGGAAAATTATAACTGTTTTCTTTTTTATCTAA...

In IGV it looks like this

The subsetted read is the same one as in the normal cram file, but extracted to better see the GAA/GGA pattern.

Can you help me on how to interpret these results?

Repeat Interruptions

Hi,

I was wondering if stragler is able to deal with repeat interruption.
For some diseases we can see interruptions of the repeats. Like for Huntington, where the normal repeat CAGCAGCAG... can be interrupted by CAA. Is there a way to pass this to stragler?

Bests
Stefan

Straglr could not find some TR that clearly existed

Hi, I'm using Straglr to detect TR based on my PacBio HiFi data. My parameters are shown below:

straglr.py M1.sort.filter.bam hg38_22_XYM.fa straglr_scan_min_ins50.tsv \
 --min_ins_size 50 \
 --genotype_in_size \
 --max_num_clusters 2 \
 --min_support 2 \
 --nprocs 23

However, I found that there were some TR that I could clearly see on igV that I couldn't find with Straglr, and I wondered if I was setting the parameters incorrectly.I set the insert length to be greater than 50bp and need two reads to support it, but I think the insert sequence meets both conditions.

As shown in the figure, there is a tandem repeat expansion exists in both C1 and M1 and is greater than 50bp supported by 2 or more reads. Oddly, I was able to find the this tandem repeat expansion with Straglr in C1, but not in M1. I wonder why? Is there any way to improve it? Do you have any suggestions?Thank you very much!

Problem with --trf_args

No matter what values I assign to --trf_args, I get this error: straglr.py: error: argument --trf_args: expected 7 arguments

For reference, here is the command I am running

sbj=PBMC_XXXX
bam=/mnt/seq/GRCh38/${sbj}/${sbj}.GRCh38.sorted.bam
fa_ref=/mnt/VNTR/VNTR_gene_strglr/hg38.fa
output=PBMC_XXXX_straglr
loci_bed=/mnt/vntr_region_motifs.e.bed

# Provide the TRF arguments, such as Match, Mismatch, Delta, PM, PI, Minscore, MaxPeriod
trf_args="2 7 7 80 10 50 500"

python /mnt/VNTR/VNTR_gene_strglr/straglr-master/straglr.py $bam $fa_ref $output --loci $loci_bed --debug --trf /home/ilaria99/TRF --trf_args $trf_args

IndexError: list index out of range

I am running STRaglr 1.5.0 (the current release) and get the following error on multiple CRAMs:

python straglr.py NA19676.hg38.cram ../1KG_ONT_VIENNA_hg38.fa NA19676.hg38_straglr_patho --loci ../20240328_straglr_catalog.bed 
Traceback (most recent call last):
  File "/vast/scratch/users/fearnley.l/1KG_ONT_VIENNA/straglr/straglr.py", line 101, in <module>
    main()
  File "/vast/scratch/users/fearnley.l/1KG_ONT_VIENNA/straglr/straglr.py", line 98, in main
    tre_finder.output_vcf(variants, '{}.vcf'.format(args.out_prefix))
  File "/vast/scratch/users/fearnley.l/1KG_ONT_VIENNA/straglr/src/tre.py", line 1513, in output_vcf
    fails = Variant.find_fails(variants)
  File "/vast/scratch/users/fearnley.l/1KG_ONT_VIENNA/straglr/src/variant.py", line 244, in find_fails
    failed_reason = Counter(failed_reasons).most_common(1)[0][0]
IndexError: list index out of range

Any suggestions as to what might cause this?

Error: malformed BED entry

Hi,

I tried running straglr like this to detect TR expansions in ONT whole genome sequencing:

python straglr.py mmi.bam ref.fa straglr_scan --min_str_len 2 --max_str_len 100 --min_ins_size 100 --genotype_in_size --min_support 2 --max_num_clusters 2 --nprocs 10 --exclude hg38.exclude.bed

and I generated hg38.exclude.bed like this:

(cut -f1-3 hg38.segdups.bed;awk '$3-$2>=10000' hg38.simple_repeats.bed | cut -f1-3;cat hg38.centromeres.bed hg38_gaps.bed) | awk -v OFS='\t' '{print $1, $2, $3}' | bedtools sort -i - | bedtools merge -i - -d 1000 > hg38.exclude.bed

But while running straglr I get the following warnings:

***** WARNING: File /tmp/tmpmof4bq6q has inconsistent naming convention for record:
GL000008.2	0	209709

***** WARNING: File /tmp/tmpmof4bq6q has inconsistent naming convention for record:
GL000008.2	0	209709

^[[19~***** WARNING: File /tmp/tmpx0ihvtut has inconsistent naming convention for record:
KI270915.1	7002	7003	2d8640dc-1a9a-43d3-98df-be39a98ef816_851_196

***** WARNING: File /tmp/tmpx0ihvtut has inconsistent naming convention for record:
KI270915.1	7002	7003	2d8640dc-1a9a-43d3-98df-be39a98ef816_851_196

And then this error:

pybedtools.helpers.BEDToolsError: 
Command was:

	bedtools sort -i /tmp/pybedtools.fxeierw9.tmp

Error message was:
Error: malformed BED entry at line 787. Start Coordinate detected that is < 0. Exiting.

Should I remove all patchs alignments from my bam ?

I would love some insights on that.

Many thanks in advance

ziphra

ValueError: invalid coordinates

Dear @readmanchiu,
I am using latest straglr.py under python-3.10.2 and GNU/Linux x86_64. Here is the commend
python3 straglr.py ${aligned_bam}/SRR9951099_ONT.trimmed-ref.SOFT.bam ${ref_genome} SRR9951099_ONT --nprocs 16 --min_ins_size 50 --max_str_len 100
Then, it produced the following error. May I know how to fix it. Thank you!

multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/data/homezvol2/jenyuw/.local/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/data/homezvol2/jenyuw/.local/lib/python3.10/site-packages/multiprocess/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/data/homezvol2/jenyuw/.local/lib/python3.10/site-packages/pathos/helpers/mp_helper.py", line 15, in <lambda>
    func = lambda args: f(*args)
  File "/dfs6/pub/jenyuw/Software/straglr/src/tre.py", line 1135, in get_alleles
    rescued = self.rescue_missed_clipped(missed_clipped, genome_fasta)
  File "/dfs6/pub/jenyuw/Software/straglr/src/tre.py", line 899, in rescue_missed_clipped
    pstart, pend, pseq = self.get_probe(clipped_end, locus, genome_fasta)
  File "/dfs6/pub/jenyuw/Software/straglr/src/tre.py", line 864, in get_probe
    pseq = genome_fasta.fetch(locus[0], max(0, pstart), min(pend, genome_fasta.get_reference_length(locus[0])))
  File "pysam/libcfaidx.pyx", line 288, in pysam.libcfaidx.FastaFile.fetch
  File "pysam/libcutils.pyx", line 256, in pysam.libcutils.parse_region
ValueError: invalid coordinates: start (7740) > stop (3402)
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/pub/jenyuw/Software/straglr/straglr.py", line 93, in <module>
    main()
  File "/pub/jenyuw/Software/straglr/straglr.py", line 82, in main
    variants = tre_finder.examine_ins(ins, min_expansion=args.min_ins_size)
  File "/dfs6/pub/jenyuw/Software/straglr/src/tre.py", line 1254, in examine_ins
    variants = self.collect_alleles(merged_loci)
  File "/dfs6/pub/jenyuw/Software/straglr/src/tre.py", line 1288, in collect_alleles
    batched_results = parallel_process(self.get_alleles, batches, self.nprocs)
  File "/dfs6/pub/jenyuw/Software/straglr/src/utils.py", line 21, in parallel_process
    results = p.map(func, args)
  File "/data/homezvol2/jenyuw/.local/lib/python3.10/site-packages/pathos/multiprocessing.py", line 154, in map
    return _pool.map(star(f), zip(*args), **kwds)
  File "/data/homezvol2/jenyuw/.local/lib/python3.10/site-packages/multiprocess/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/data/homezvol2/jenyuw/.local/lib/python3.10/site-packages/multiprocess/pool.py", line 774, in get
    raise self._value
ValueError: invalid coordinates: start (7740) > stop (3402)

Here is how I produced the alignment:

minimap2 -t ${nT} -a -x ${mapping_option[$read_type]} -Y \
${ref_genome} ${file} |\
samtools view -b -h -@ ${nT} -o - |\
samtools sort -@ ${nT} -o ${aligned_bam}/${name}.trimmed-ref.SOFT.bam
samtools index -@ ${nT} ${aligned_bam}/${name}.trimmed-ref.SOFT.bam

What does parameter `min_support` mean in genotyping mode of Straglr?

Hi,

Recently I have used Straglr to genotype a STR of my data. And I found a quite strange phenomenon.
With the default parameters, I got these copy numbers of TR of several reads:

946.7 843 400.3 395 18.7 16.7 16.7

and the copy numbers of two clusters(alleles) are 894.8(the first two reads) and 169.5(other reads). It seems reasonable.

But when I turned min_cluster_size to 3 or 4, the results are same as that of 2! Which means even if min_cluster_size is bigger than 2, the first two reads are still clustered together. That's strange.

After some simple inspections of TREFinder class, I found that min_cluster_size is affected by min_support, shown in the code below:

self.min_cluster_size = min_cluster_size if min_cluster_size < self.min_support else self.min_support

I don't know why min_cluster_size cannot be larger than min_support. Since it is useful in genome-scan mode, what does it mean in genotyping mode?

Thank you for taking the time. Looking forward to your reply !
Jesson-mark

Error about "ValueError: start out of range (-265)"

hi, @readmanchiu
Thank you for this useful tool. An error occurred when I called STR using nanopore sequencing data.
Could you please help me with this? Thank you so much!

multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.10/site-packages/multiprocess/pool.py", line 48, in mapstar
return list(map(*args))
File "/usr/local/lib/python3.10/site-packages/pathos/helpers/mp_helper.py", line -1, in
File "/usr/local/lib/python3.10/site-packages/src/tre.py", line 869, in get_alleles
for aln in bam.fetch(locus[0], locus[1] - split_neighbour_size, locus[2] + split_neighbour_size):
File "pysam/libcalignmentfile.pyx", line 1091, in pysam.libcalignmentfile.AlignmentFile.fetch
File "pysam/libchtslib.pyx", line 690, in pysam.libchtslib.HTSFile.parse_region
ValueError: start out of range (-265)
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/bin/straglr.py", line 86, in
main()
File "/usr/local/bin/straglr.py", line 79, in main
variants = tre_finder.genotype(args.loci)
File "/usr/local/lib/python3.10/site-packages/src/tre.py", line 1130, in genotype
return self.collect_alleles(loci)
File "/usr/local/lib/python3.10/site-packages/src/tre.py", line 1104, in collect_alleles
batched_results = parallel_process(self.get_alleles, batches, self.nprocs)
File "/usr/local/lib/python3.10/site-packages/src/utils.py", line 21, in parallel_process
results = p.map(func, args)
File "/usr/local/lib/python3.10/site-packages/pathos/multiprocessing.py", line 139, in map
return _pool.map(star(f), zip(*args)) # chunksize
File "/usr/local/lib/python3.10/site-packages/multiprocess/pool.py", line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/local/lib/python3.10/site-packages/multiprocess/pool.py", line 771, in get
raise self._value
ValueError: start out of range (-265)

IndexError: list index out of range

I am trying to run STRaglr 1.5.0 (current release) from a docker container, on ONT data with a catalog of TR regions for chr21 as the input bed file. I kept getting an error so I tried playing around with some parameters. However I kept getting the same error. The command I ran looked like this:

straglr.py map-sminimap2-HG002_hg38_chr21.bam ../chr21/chr21.fa output_straglr --loci HG002_repeats_straglr.bed --min_support 1 --min_ins_size 0 --min_cluster_size 0 --nprocs 5

However I get this error:

multiprocess.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.10/site-packages/multiprocess/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/usr/local/lib/python3.10/site-packages/pathos/helpers/mp_helper.py", line 15, in <lambda>
    func = lambda args: f(*args)
  File "/usr/local/lib/python3.10/site-packages/src/tre.py", line 1252, in get_alleles
    self.update_refs(variants, genome_fasta)
  File "/usr/local/lib/python3.10/site-packages/src/tre.py", line 1271, in update_refs
    refs = self.extract_refs_trf(trf_input)
  File "/usr/local/lib/python3.10/site-packages/src/tre.py", line 607, in extract_refs_trf
    data_motif = cols[3]
IndexError: list index out of range
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/straglr.py", line 101, in <module>
    main()
  File "/usr/local/bin/straglr.py", line 93, in main
    variants = tre_finder.genotype(args.loci)
  File "/usr/local/lib/python3.10/site-packages/src/tre.py", line 1426, in genotype
    return self.collect_alleles(loci)
  File "/usr/local/lib/python3.10/site-packages/src/tre.py", line 1398, in collect_alleles
    batched_results = parallel_process(self.get_alleles, batches, self.nprocs)
  File "/usr/local/lib/python3.10/site-packages/src/utils.py", line 21, in parallel_process
    results = p.map(func, args)
  File "/usr/local/lib/python3.10/site-packages/pathos/multiprocessing.py", line 154, in map
    return _pool.map(star(f), zip(*args), **kwds)
  File "/usr/local/lib/python3.10/site-packages/multiprocess/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/local/lib/python3.10/site-packages/multiprocess/pool.py", line 774, in get
    raise self._value
IndexError: list index out of range

I tried running a debug like

straglr.py map-sminimap2-HG002_hg38_chr21.bam  ../chr21/chr21.fa output_straglr  \
 --loci ../chr21/chr21.fa  \
 --min_support 1 \
 --min_ins_size 0 \
 --min_cluster_size 0 \
 --nprocs $CORES \
 --tmpdir $DEBUG_DIR_STRAGLR/tmp/ \
 --debug > $DEBUG_DIR_STRAGLR/debug.log 2>&1

Any suggestions on what could cause the error? I included all the debug files and the input region bed file.
chr21_regions.bed.gz
debug.log.gz
tmp.tar.gz

Loci with multiple motifs at same position

I have called a bunch of samples with straglr and merged the (chrom, start, end, motif) beds into one unified bed. I would like to use this merged bed to genotype each sample again. Sometimes different motifs occur at the same position, but straglr only calles one of them multiple Times. Example:

Here is a position of the merged bed file with multiple motifs at the same position.

$ cat results/str_candidates/merged.bed | grep 56828456
Y	56828456	56836900	CATTC
Y	56828456	56836900	TCCAT

I use this bed for genotyping.
What I see:

$ cat results/str/818-13.bed | grep 56828456
Y	56828456	56836900	TCCAT	9286.5	1857.3	55	2907.0	581.4	13
Y	56828456	56836900	TCCAT	9286.5	1857.3	55	2907.0	581.4	13

What I am expecting:

$ cat results/str/818-13.bed | grep 56828456
Y	56828456	56836900	CATTC	...	...	...	...	...	...
Y	56828456	56836900	TCCAT	9286.5	1857.3	55	2907.0	581.4	13

Why is detecting and genotyping Short Tandem Repeats (STRs) challenging?

Hi,
Thank you for developing the excellent straglr tool. I've read your paper "Straglr: discovering and genotyping tandem repeat expansions using whole genome long-read sequences". I'm still confused by the following questions:

straglr requires specifying the parameter --loci: a BED file containing loci to be genotyped. Since we know the structure and location of motifs on the reference genome, what are the challenges in detecting motifs and their repeat counts in reads?
How can we evaluate the performance between various STR detection and genotyping tools? Is there a gold standard dataset available to evaluate the sensitivity and precision of different tools?

genome-scan: IndexError: list index out of range

Encountering an issue on data from the ONT 1KG Vienna cohort. I've pulled the CRAMs, dumped to fastq, realigned with minimap2, then run straglr in scan mode as follows:

python /vast/scratch/users/fearnley.l/1KG_ONT_VIENNA/straglr/straglr.py ${cramFile}.cram ${params.reference} ${cramFile}_straglr_scan --min_str_len 2 --max_str_len 100 --min_ins_size 60 --genotype_in_size --min_support 2 --max_num_clusters 2 --nprocs ${task.cpus-1}

I get the following error:

  /vast/scratch/users/fearnley.l/1KG_ONT_VIENNA/straglr/src/ins.py:189: UserWarning: multiple_iterators not implemented for CRAM
    for pileup_col in bam.pileup(region[0], int(region[1]), int(region[2])):
  /vast/scratch/users/fearnley.l/1KG_ONT_VIENNA/straglr/src/ins.py:189: UserWarning: multiple_iterators not implemented for CRAM
    for pileup_col in bam.pileup(region[0], int(region[1]), int(region[2])):
  /vast/scratch/users/fearnley.l/1KG_ONT_VIENNA/straglr/src/ins.py:189: UserWarning: multiple_iterators not implemented for CRAM
    for pileup_col in bam.pileup(region[0], int(region[1]), int(region[2])):
  /vast/scratch/users/fearnley.l/1KG_ONT_VIENNA/straglr/src/ins.py:189: UserWarning: multiple_iterators not implemented for CRAM
    for pileup_col in bam.pileup(region[0], int(region[1]), int(region[2])):
  /vast/scratch/users/fearnley.l/1KG_ONT_VIENNA/straglr/src/ins.py:189: UserWarning: multiple_iterators not implemented for CRAM
    for pileup_col in bam.pileup(region[0], int(region[1]), int(region[2])):
  /vast/scratch/users/fearnley.l/1KG_ONT_VIENNA/straglr/src/ins.py:189: UserWarning: multiple_iterators not implemented for CRAM
    for pileup_col in bam.pileup(region[0], int(region[1]), int(region[2])):
  multiprocess.pool.RemoteTraceback: 
  """
  Traceback (most recent call last):
    File "/vast/projects/fearnleyl_ukbiobank/miniconda/envs/straglr/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
      result = (True, func(*args, **kwds))
    File "/vast/projects/fearnleyl_ukbiobank/miniconda/envs/straglr/lib/python3.10/site-packages/multiprocess/pool.py", line 48, in mapstar
      return list(map(*args))
    File "/vast/projects/fearnleyl_ukbiobank/miniconda/envs/straglr/lib/python3.10/site-packages/pathos/helpers/mp_helper.py", line 15, in <lambda>
      func = lambda args: f(*args)
    File "/vast/scratch/users/fearnley.l/1KG_ONT_VIENNA/straglr/src/tre.py", line 153, in extract_tres
      expansions = self.analyze_trf(grouped_results, target_flank, unpaired_clips)
    File "/vast/scratch/users/fearnley.l/1KG_ONT_VIENNA/straglr/src/tre.py", line 164, in analyze_trf
      same_pats = self.find_similar_long_patterns_ins(results)
    File "/vast/scratch/users/fearnley.l/1KG_ONT_VIENNA/straglr/src/tre.py", line 194, in find_similar_long_patterns_ins
      same_pats = self.parse_pat_blastn(blastn_out)
    File "/vast/scratch/users/fearnley.l/1KG_ONT_VIENNA/straglr/src/tre.py", line 593, in parse_pat_blastn
      pid = float(cols[2])
  IndexError: list index out of range
  """
  
  The above exception was the direct cause of the following exception:
  
  Traceback (most recent call last):
    File "/vast/scratch/users/fearnley.l/1KG_ONT_VIENNA/straglr/straglr.py", line 101, in <module>
      main()
    File "/vast/scratch/users/fearnley.l/1KG_ONT_VIENNA/straglr/straglr.py", line 89, in main
      variants = tre_finder.examine_ins(ins, min_expansion=args.min_ins_size)
    File "/vast/scratch/users/fearnley.l/1KG_ONT_VIENNA/straglr/src/tre.py", line 1351, in examine_ins
      batched_results = parallel_process(self.extract_tres, batches, self.nprocs)
    File "/vast/scratch/users/fearnley.l/1KG_ONT_VIENNA/straglr/src/utils.py", line 21, in parallel_process
      results = p.map(func, args)
    File "/vast/projects/fearnleyl_ukbiobank/miniconda/envs/straglr/lib/python3.10/site-packages/pathos/multiprocessing.py", line 135, in map
      return _pool.map(star(f), zip(*args)) # chunksize
    File "/vast/projects/fearnleyl_ukbiobank/miniconda/envs/straglr/lib/python3.10/site-packages/multiprocess/pool.py", line 367, in map
      return self._map_async(func, iterable, mapstar, chunksize).get()
    File "/vast/projects/fearnleyl_ukbiobank/miniconda/envs/straglr/lib/python3.10/site-packages/multiprocess/pool.py", line 774, in get
      raise self._value
  IndexError: list index out of range

An issue about reproducibility.

**Hi, @readmanchiu

I am a new fan of the first-tier tool Straglr. I am trying to use this tool to genotype the genome-wide STR/VNTR for my ONT data.
I followed your guidance in the "Example application: genome-wide genotyping" part.
However, I found an issue about reproducibility.

Here is my steps:**

(1) I download UCSC Simple Repeats track as an tsv file;
(2) I deleted the prefix "chr" to be compatible with my genome reference;
(3) I also deleted the alternative contigs and kept those STR/VNTR loci in the 22+XY chromosome;
(4) I converted downloaded table into bed format using:

grep -v '^#' simple_repeats.downloaded.tsv | awk -vOFS='\t' 'length($17)>1 {print $2,$3,$4,$17}' > simple_repeats.bed

(5) I split whole-genome bed file into batches using:

split -l 10000 -d -a 4 --additional-suffix=.bed simple_repeats.bed simple_repeats_nochr

Then, I run the Genotype mode (locus given):

work_dir=$(pwd)
map_ref=/genetics/home/zengyiheng/project/zyh-pipeline/reference/grch38/GRCh38_alignment.fa
straglr_sif=/genetics/home/zengyiheng/project/zyh-pipeline/straglr/straglr_blastn_trf.sif
str_bed2=/genetics/home/zengyiheng/project/zyh-pipeline/straglr/str_db_ucsc_hg38/nochr/simple_repeats_nochr_0000.bed
mkdir ${work_dir}/str
mkdir ${work_dir}/tmp

cd ${work_dir}/str
singularity exec ${straglr_sif} python /usr/local/bin/straglr.py
${work_dir}/bam/19620.hg38.bam
${map_ref}
straglr_scan_19620
--loci ${str_bed2}
--min_str_len 3
--max_str_len 100
--min_ins_size 100
--genotype_in_size
--min_support 2
--max_num_clusters 2
--nprocs ${cpu_thread}
--tmpdir ${work_dir}/tmp

=============

The [simple_repeats_nochr_0000.bed] contains 10000 loci;
I run the above command four times respectively setting --nprocs ${cpu_thread} to be 2 / 4 / 48 / 48,
and found the results are not the same.

[the --nprocs i used] [the number of loci genotyped successfully recorded in the straglr_scan_19620.bed file ]
--nprocs 2 [9719 loci]
--nprocs 4 [9808 loci]
--nprocs 48 (1st run) [9878 loci]
--nprocs 48 (2nd run) [9880 loci]

The output of four runs is uploaded here. ( https://airportal.cn/965275/FmgSnxkGpH )(password: straglr)

It really confused me that the numbers of output loci were not 10000 and different from different runs.
Could you please be so kind to help me with this issue? Thank you so much. : )

IndexError: list index out of range

Hello,

I am running Straglr for multiple samples and one of the samples is unsuccessful with the below error:

Command

python straglr.py <sample.bam> <ref.fna>

Error:

Aug 04, 2023 10:00:57:IndexError: list index out of range
Aug 04, 2023 10:00:57:    allele1_repeat_count = round(cols[5])
Aug 04, 2023 10:00:57:  File "/root/miniconda3/envs/straglr/lib/python3.10/site-packages/straglr/tre.py", line 1259, in output_vcf
Aug 04, 2023 10:00:57:    tre_finder.output_vcf(variants, args.vcf, args.sample, args.loci)
Aug 04, 2023 10:00:57:  File "/root/miniconda3/envs/straglr/lib/python3.10/site-packages/straglr/straglr_genotype.py", line 78, in main
Aug 04, 2023 10:00:57:    sys.exit(main())
Aug 04, 2023 10:00:57:  File "/root/miniconda3/envs/straglr/bin/straglr-genotype", line 8, in <module>
Aug 04, 2023 10:00:57:Traceback (most recent call last):
Aug 04, 2023 10:00:50:IndexError: list index out of range
Aug 04, 2023 10:00:50:    allele1_repeat_count = round(cols[5])
Aug 04, 2023 10:00:50:  File "/root/miniconda3/envs/straglr/lib/python3.10/site-packages/straglr/tre.py", line 1259, in output_vcf
Aug 04, 2023 10:00:50:    tre_finder.output_vcf(variants, args.vcf, args.sample, args.loci)
Aug 04, 2023 10:00:44:  File "/root/miniconda3/envs/straglr/lib/python3.10/site-packages/straglr/straglr_genotype.py", line 78, in main
Aug 04, 2023 10:00:44:    sys.exit(main())
Aug 04, 2023 10:00:44:  File "/root/miniconda3/envs/straglr/bin/straglr-genotype", line 8, in <module>
Aug 04, 2023 10:00:44:Traceback (most recent call last):

Could you please let me know if the reason is no TR available? I see one similar issue reported on your Git page?

An issue about complex STRs

Hi, @readmanchiu , Sorry for bother again.
I am using this first-tier tool for STR counting for my neurogenetic patients.

Here is one issue I want to report:

There is one disease named "CANVAS (Cerebellar ataxia, neuropathy, and vestibular areflexia syndrome)", which is caused by an expansion of (AAGGG)n repeat in RFC1 gene. (https://omim.org/entry/102579?search=RFC1&highlight=rfc1)

The sticky situation lies in:

The reference sequence is (AAAAG)n for this loci (hg38, 4:39,348,424-39,348,483).
There is an (AAGGG)n expansion in my data (ONT), which is confirmed by visualizing manually by IGV. (attached below)
When I use the bed file below, Straglr outputs nothing.

【bed file】
4 39348424 39348483 AAAAG"

I am not willing to give up this tool for its outstanding performance. Can Straglr deal with this "complex STRs (changed motif situation)"? I will be pleased if Straglr can deal with this situation, which will make it SUPER perfect.
Thank you!

[attached file1: IGV visualization for my patient's ONT data]

[attached file2: 3510bp insertion in the first line]

AGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAAGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGCGGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGGAAGGGAAGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGCAATACAGAAGAAGAAGTAATACAGAAGGAAGGAAGGAAGGGAAGGGAAGGAAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGAAGGGAAGGGAAGGCAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAAGGAAGGAAGGGAAGGGAAGGGAAGGGAGGAAGGGAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGCGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGGAAGGAAGGGAAGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAGGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGG

Symbolic alleles in VCF output

Hi @readmanchiu

For VCF output of STR's I believe symbolic alleles are commonly used, e.g.: for a variant of 20 repeat units.

Is there a reason Straglr is outputing the full sequence as ALT allele in the VCF? And would it be possible to add an option to get symbolic alleles instead?
That would greatly help us with integrating the tool in our pipeline.

Straglr seems to use multiple threads than specified threads

Hi, @readmanchiu .

I have been using Straglr to detect VNTRs of my ONT data. After running multiple samples using 1 or 3 cpus(specified by --nproc parameter of Straglr), I found that our computing nodes was running busy, which isn't a normal status. When program is using more threads(or cpus) than specified, the computing node will be busy.

Although this busy status is only lasting a few minutes(about 20 minutes), when multiple programs are running on the same computing node, this busy status is extremely high, which may harm node.

After some tests, I found that after the temp files of TRF are generated, the computing node begin to running busy. But I couldn't figure out which step after that is using more threads. Could you give me some suggestions or directions on how to deal with this problem?

Looking forward to your reply.
Thanks.

Jesson-mark

IndexError: list index out of range

Hi there,

I executed the following commad

python ~/bin/straglr/straglr.py ~/stimulated/myseq.sorted.bam ~/Genome/hg38/genome.fa myseq --nprocs 4

I got the error message as follow

  Traceback (most recent call last):
    File "/home/bin/straglr/straglr.py", line 93, in <module>
      main()
    File "/home/bin/straglr/straglr.py", line 82, in main
      variants = tre_finder.examine_ins(ins, min_expansion=args.min_ins_size)
    File "/home/bin/straglr/src/tre.py", line 1254, in examine_ins
      variants = self.collect_alleles(merged_loci)
    File "/home/bin/straglr/src/tre.py", line 1289, in collect_alleles
      tre_variants = combine_batch_results(batched_results, type(batched_results[0]))
  IndexError: list index out of range

Any help would be much appreciated!

Many thanks,
Hsin

Missing output

Thank you for creating Straglr! Does Straglr produce output for repeats whose length is equal to the reference? I am asking this because the output for some repeats seems to be missing.

Conda release

Hi all,

any chance we will see straglr on conda soon?
Thanks,

Davide

Calling complex STR patterns

Dear @readmanchiu,

Some of the STR's we are interested in have a more complex pattern instead of a simple repeating sequence.
e.g. something like (TTTTA){5}TTA(TTCTA){5}
(5 TTTTA's followed by a single TTA and then 5 TTCTA's)

Is this something Straglr would be able to do? And if so what would be the correct way te specify a unit like this in the loci bed file?

And a related question:
is there documentation on how to specify the repeat pattern? I've been playing around with "*" and "+" signs, also in combination with brackets and curly brackets. Are constructions with these tokens supported?

Stable release of staglr

Hi @readmanchiu,

We're interested in getting the output as part of our MAVIS pipeline, while we could just clone the what's on the repository right now - is there a stable release we should be using after v1.1 that you submitted before publication?

Could Gender support and VCF output be added?

Dear @readmanchiu,

We are using Straglr in our pipeline (https://github.com/molgenis/vip/) to call STR's on nanopore data.
Since all the modules in our pipeline are build to have VCF input and VCF output we ended up using the fork from @philres.
We also know the sex of our samples and this fork seems to have added value in that regard.

However, we also would like to offer the genome scan function to our users (currently we only support the genotyping with a loci bed file), and I cannot get that to work with installing the python egg from this fork. (but using the original repo leaves me without VCF output)
Furthermore it seems to be some commits behind the main branch of this original repo.

So we are looking for the best of both worlds (forks), and my questions are:
Would you consider adding the VCF format as output in the future
Would you consider adding the sample sex as an input parameter and use it in the STR calling on the X chromosome?

TypeError: unsupported operand type(s) for /: 'str' and 'int'

I'm encountering the following error running straglr 1.2.0:

    File "/home/ont_user/venv/bin/straglr.py", line 80, in <module>
      main()
    File "/home/ont_user/venv/bin/straglr.py", line 76, in main
      tre_finder.output_bed(variants, '{}.bed'.format(args.out_prefix))
    File "/home/ont_user/venv/lib/python3.8/site-packages/src/tre.py", line 1180, in output_bed
      copy_numbers.append(round(allele / len(variant[4]) , 1))
  TypeError: unsupported operand type(s) for /: 'str' and 'int'

Is this a known behaviour?

Understanding output

Hi, I've recently tried using straglr in genotype-only mode to detect STR from nanopore data.

My data is minimap2 aligned
STR position I'm looking at is in hg38 (chr11:5555444-5555483, motif: GAAG), these coordinates were used as input to the required BED file. Note that to its left there is STR of AGAG (chr11:5555369-5555442) but was not included in the BED file as it was not what I'm looking for.
For the support reads in the tsv file, i extracted them from the bam file to visualize in the IGV

For this particular read in the output tsv file:
chr11 5555444 5555483 GAAG 27.6(10);9.0(412) ba5240c0-cf30-40a9-8811-b4e442a235ac 31.2 125 163 27.6

and this was the information when viewed in IGV:

Hence, i have the following questions:
a. As the reference (hg38) has 10 repeat units, tsv report that there are 31.2 repeat units (insertion of 125bp). May i check what might have caused it to call a 31.2 repeat units ? Could it have included the AGAG motif (reference has 18.8 copies) to its left? Does that mean that Straglr takes the whole read for calculations and not the nucleotides that fall within the coordinates in the BED file?
b. Under read_start in the tsv file, does it include the soft clipped reads as well?

Do let me know if any other information is needed. Thank you !

TRF options

Hello !

I have been using your tool for a few months now, and I've been wondering for some time if it would be possible to modify the parameters for running TRF ? From what I've seen in the source code, it's fixed with this string:
'2 5 5 80 10 10 500 -d -h' (it's the default value for the initialization and it's the same value in the "make_trf_sensitive" function)

Perhaps I am wrong, and this is intended for optimized results...

I thank you sincerely for your understanding.

bcgsc / straglr Goto Github PK

straglr's Introduction

Straglr - Short-tandem repeat genotyping using long reads

Installation

Input

Usage

Example application: genome scan to detect TRs longer than the reference genome by 100bp:

Example application: genome-wide genotyping

Output

Utilities

Contact

Citation

straglr's People

Contributors

Stargazers

Watchers

Forkers

straglr's Issues

grep -v '^#' simple_repeats.downloaded.tsv | awk -vOFS='\t' 'length($17)>1 {print $2,$3,$4,$17}' > simple_repeats.bed

split -l 10000 -d -a 4 --additional-suffix=.bed simple_repeats.bed simple_repeats_nochr

The output of four runs is uploaded here. ( https://airportal.cn/965275/FmgSnxkGpH )(password: straglr)

Error:

Recommend Projects

Recommend Topics

Recommend Org