kaist-ina / bwa-meme Goto Github PK
View Code? Open in Web Editor NEWBWA-MEME: Faster BWA-MEM2 using learned-index
Home Page: https://ina.kaist.ac.kr/projects/bwa-meme/
License: MIT License
BWA-MEME: Faster BWA-MEM2 using learned-index
Home Page: https://ina.kaist.ac.kr/projects/bwa-meme/
License: MIT License
Hi there,
When running conda install -c conda-forge -c bioconda bwa-meme
conda (or mamba) can't find the package...
Same with the search
command
Is the recipe not available anymore?
I tries to build fron source but I'm on macos and it is not straightforward. There are some posts related to bwa-mem2 that suggest some solutions but I didn't manage to get it to work yet. So a conda-based solution would save some headaches.
With thanks!
I'm finding it takes ~25 minutes to load the various components of the indexes, without -7
this is only a couple of minutes.
The loading of the core reference files up to the following message runs at ~100% CPU:
* Reading reference genome..
* Binary seq file = /home/kr525/rds/hpc-work/data/ref/Homo_sapiens_assembly38.fasta.0123
* Reference genome size: 6434693834 bp
* Done reading reference genome !!
The section as follows runs at 5-15% CPU indicating disk-wait:
[M::memoryAllocLearned::LEARNED] Reading kmer SA index File to memory
[M::memoryAllocLearned::LEARNED] Reading ref2sa index File to memory
Is there anything obvious relating to the file reading that could account for this?
I expect unrelated, but I did notice that ref.suffixarray_uint64
can be compressed with gzip -1
for 50% reduction in size.
Decompression cost for L1 is negligible compared to the disk latency (and will be more cost effective for systems with IOPs accounting).
Hi,
the version number when invoking bwa-meme version
is still 1.0.5.
g++: error: unrecognized command line option ‘-msse’
g++: error: unrecognized command line option ‘-msse2’
g++: error: unrecognized command line option ‘-msse3’
g++: error: unrecognized command line option ‘-mssse3’
g++: error: unrecognized command line option ‘-msse4.1’
Hi,
I'm trying to make the index for human genome fasta GRCh38.p14 and i've setted this command:
bwa-meme index -a meme -t 16 -p bwa-meme/bwa-meme_GRCh38.p14_genomic.fna fasta/GRCh38.p14_genomic.fna
But it still use one single core for processing. Can you help me please?
Hello! We faced a difficulty when we tried to install BWA-MEME on a linux server (bwa-mem2 already installed).
We followed the installation steps and got errors at make -j32 arch=avx512
.
It showed many warnings and two errors.
The two errors are
src/LearnedIndex_seeding.cpp:468:5: error: ‘__builtin_expect_with_probability’ was not declared in this scope
src/LearnedIndex_seeding.cpp:565:5: error: ‘__builtin_expect_with_probability’ was not declared in this scope
How can we solve the problem? Thank you.
Is the very large value of -K 100000000
(100 mill) used for a specific reason? Initially this was to prevent variability when specifying different numbers of threads.
The parabricks comparison command indicates10000000
(10 mill).
Would a 10 million value work without any detriment to run time?
I build Docker container then downloaded the refences provided. and ran the lowest memory requirement mode. The hardware is Mac M w 64 GB ram. Allowable ram is 50GB.
docker warning: WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
bwa-meme_mode1 version
Looking to launch executable "/opt/conda/bin/bwa-meme_mode1.sse42", simd = _mode1.sse42
Launching executable "/opt/conda/bin/bwa-meme_mode1.sse42"
Identical to BWA-MEM2 2.2
BWA-MEME v1.0.4
MEME mode 1: uses 38GB for index size in runtime
bwa-meme_mode1 mem -7 -Y -t 1 Homo_sapiens_assembly38.fasta 20A0012672-20A0012672_57977-WGS_R1_001.fastq.gz 20A0012672-20A0012672_57977-WGS_R2_001.fastq.gz -o 20A0012672_bwa-meme.sam
[0000] read_chunk: 10000000, work_chunk_size: 10000024, nseq: 68388
[0000][ M::kt_pipeline] read 68388 sequences (10000024 bp)...
[0000] Reallocating initial memory allocations!!
[0000] Calling mem_process_seqs.., task: 0
[0000] 1. Calling kt_for - worker_bwt
qemu: uncaught target signal 11 (Segmentation fault) - core dumped
Segmentation fault
real 0m50.682s
user 0m32.737s
sys 0m17.065s
Should I compile it for the Apple M? please give me some instructions to do that.
I understand the reasons for not including everything in the main recipe but it would be very useful to provide the tools for training.
I'm making the assumption that training is required to use Human GRCh38 (with alts suitable for bwakit). It will make this far more attractive for groups with a diverse species/build requirement.
BTW I'm actually looking at this to do a bake-off against NVIDIA parabricks, specifically as the implementation doesn't allow for use of bwa-postalt.js
or any other modifications due to the heavy cost/effort required to change it.
Hi there,
I use conda conda create -n meme -c bioconda -c conda-forge bwa-meme=1.0.6
it doesn't throw any erro,then I activate my env and input bwa-meme,the output is ERROR: fail to find the right executable , I can't figure out why this happen.
With thanks!
Hey,
due to lower memory on a different server, I was running bwa-meme (v1.0.6) mode1.
Samtools sort throws
[E::sam_parse1] SEQ and QUAL are of different length
I thought this was fixed in 1.0.6 but maybe only for mode3?
Best
Christo
Can you advice about compiling Windows binary for BWA-MEME?
What are the resource requirements for the build_rmis_dna.sh
script:
Dear bwa-meme team:
when using bacterial genome as reference I have the following error (genome are highly fragmented and are assembled from meagenome):
(base) [jzhao399@login-phoenix-3 Competitive_mapping]$ build_rmis_dna.sh ./all_mags_rename.new.fasta
Training top-level pwl model layer
Training second-level linear model layer (num models = 268435456)
[2nd layer]Computing lower bound stats...
[2nd layer]Fixing empty models...
Computing last level errors...
Average gap: 0.8713072761893272
Total Partial model num: 0, Leaf of partial model num: 0
Total last layer model num: 268435456
Partial start at idx:268435456
Model build time: 103312 ms
Average model error: 1.914389257033157 (0.0000020606146782561575%)
Average model L2 error: 78.10451166276728
Average model log2 error: 1.9188643583795388
Max model log2 error: 7.189824558880018
Max model error on model 79172320: 146 (0.0001571518132585239%)
thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 0', rmi_lib/src/codegen.rs:399:28
note: run with RUST_BACKTRACE=1
environment variable to display a backtrace
Any idea? I installed everything via condo,
Thanks,
Jianshu
Hello,
when I was using the bwa-meme7 for some data,it turns out the quality and sequence are not in same length,but when I using bwa-meme without -7 or bwa,the error is gone?
how could it be,thankyou!
[0000] read_chunk: 80000000, work_chunk_size: 80000037, nseq: 531012
[0000][ M::kt_pipeline] read 531012 sequences (80000037 bp)...
[0000] Reallocating initial memory allocations!!
[0000] Calling mem_process_seqs.., task: 0
[0000] 1. Calling kt_for - worker_bwt
[0000] read_chunk: 80000000, work_chunk_size: 80000068, nseq: 530970
[0000][ M::kt_pipeline] read 530970 sequences (80000068 bp)...
[0000] 2. Calling kt_for - worker_aln
[0000] Inferring insert size distribution of PE reads from data, l_pac: 460349660, n: 531012
[0000][PE] # candidate unique pairs for (FF, FR, RF, RR): (20, 156214, 41, 27)
[0000][PE] analyzing insert size distribution for orientation FF...
[0000][PE] (25, 50, 75) percentile: (221, 707, 3516)
[0000][PE] low and high boundaries for computing mean and std.dev: (1, 10106)
[0000][PE] mean and std.dev: (1828.05, 2315.16)
[0000][PE] low and high boundaries for proper pairs: (1, 13401)
[0000][PE] analyzing insert size distribution for orientation FR...
[0000][PE] (25, 50, 75) percentile: (248, 299, 358)
[0000][PE] low and high boundaries for computing mean and std.dev: (28, 578)
[0000][PE] mean and std.dev: (303.99, 82.31)
[0000][PE] low and high boundaries for proper pairs: (1, 688)
[0000][PE] analyzing insert size distribution for orientation RF...
[0000][PE] (25, 50, 75) percentile: (324, 1223, 4529)
[0000][PE] low and high boundaries for computing mean and std.dev: (1, 12939)
[0000][PE] mean and std.dev: (2298.71, 2392.36)
[0000][PE] low and high boundaries for proper pairs: (1, 17144)
[0000][PE] analyzing insert size distribution for orientation RR...
[0000][PE] (25, 50, 75) percentile: (476, 1843, 6054)
[0000][PE] low and high boundaries for computing mean and std.dev: (1, 17210)
[0000][PE] mean and std.dev: (3128.30, 3213.03)
[0000][PE] low and high boundaries for proper pairs: (1, 22788)
[0000][PE] skip orientation FF
[0000][PE] skip orientation RF
[0000][PE] skip orientation RR
[0000] 3. Calling kt_for - worker_sam
[0000][ M::mem_process_seqs] Processed 531012 reads in 260.608 CPU sec, 32.703 real sec
[0000] Calling mem_process_seqs.., task: 1
[0000] 1. Calling kt_for - worker_bwt
[0000] read_chunk: 80000000, work_chunk_size: 80000073, nseq: 530864
[0000][ M::kt_pipeline] read 530864 sequences (80000073 bp)...
[E::sam_parse1] SEQ and QUAL are of different length
samtools sort: truncated file. Aborting
`
Dear developer:
bwa: Version: 0.7.17-r1188
BWA-MEME:v1.0.5
bwa stat of bam:
338883556 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
448166 + 0 supplementary
0 + 0 duplicates
330144737 + 0 mapped (97.42% : N/A)
338435390 + 0 paired in sequencing
169217695 + 0 read1
169217695 + 0 read2
322879394 + 0 properly paired (95.40% : N/A)
329460362 + 0 with itself and mate mapped
236209 + 0 singletons (0.07% : N/A)
5641738 + 0 with mate mapped to a different chr
2394586 + 0 with mate mapped to a different chr (mapQ>=5)
BWA-MEME stat of bam:
338883548 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
448158 + 0 supplementary
0 + 0 duplicates
330144743 + 0 mapped (97.42% : N/A)
338435390 + 0 paired in sequencing
169217695 + 0 read1
169217695 + 0 read2
322879718 + 0 properly paired (95.40% : N/A)
329460388 + 0 with itself and mate mapped
236197 + 0 singletons (0.07% : N/A)
5641548 + 0 with mate mapped to a different chr
2394572 + 0 with mate mapped to a different chr (mapQ>=5)
Hi,
i'm trying to compile from source bwa-meme with the command:
sudo make -j 32
but i have this error and i don't know how to resolve. Can you help me, please?
I'm working on e2-standard-32 google machine type
hello ,when I using 1.06 version to do bwa-meme some how like that
" bwa-meme mem -M -a -t 8 $2 $3 $4 |mbuffer -m 8G |samtools sort --reference $2 -o $5.cram -O CRAM -@ 8 -"
but it turns out exit without any error report?
how could it be?
I have generated the indexes and model but when i try to run the command i see the following error:
It says Homo_sapiens_assembly38.fasta.bwt.2bit.64 file but iam using the original reference file.
Please help.
Thanks
-Raju
Some questions/comments on the index command.
-a mem2
will -a meme
skip the BWT build?
bwa-meme index
references bwa-mem2
and doesn't list meme
as an option for -a
.(sorry, hopefully I'm more helpful than irritating)
I used the command outlined in the Building pipeline with Samtools https://github.com/kaist-ina/BWA-MEME#building-pipeline-with-samtools
Looking to launch executable "/home/husamia/./BWA-MEME/bwa-meme_mode3.avx512bw", simd = _mode3.avx512bw
Launching executable "/home/husamia/./BWA-MEME/bwa-meme_mode3.avx512bw"
-----------------------------
Executing in AVX512 mode!!
-----------------------------
* SA compression enabled with xfactor: 8
* Ref file: /mnt/c/Research/Homo_sapiens_assembly38.fasta
* Entering FMI_search
Reading other elements of the index from files /mnt/c/Research/Homo_sapiens_assembly38.fasta
* Index prefix: /mnt/c/Research/Homo_sapiens_assembly38.fasta
* Read 0 ALT contigs
* Reading reference genome..
* Binary seq file = /mnt/c/Research/Homo_sapiens_assembly38.fasta.0123
* Reference genome size: 6434693834 bp
* Done reading reference genome !!
------------------------------------------
1. Memory pre-allocation for Chaining: 1419.8876 MB
2. Memory pre-allocation for BSW: 4792.3405 MB
[M::memoryAllocLearned::MEME] Reading Learned-index models into memory
[Learned-Config] MODE:3 SEARCH_METHOD: 1 MEM_TRADEOFF:1 EXPONENTIAL_SMEMSEARCH: 1 DEBUG_MODE:0 Num 2nd Models:268435456 PWL Bits Used:28
[M::memoryAllocLearned::MEME] Loading RMI model and Pac reference file took 66.232 sec
[M::memoryAllocLearned::MEME] Reading suffix array into memory
[M::memoryAllocLearned::MEME] Loading pos_packed file took 285.735 sec
[M::memoryAllocLearned::MEME] Generating SA, 64-bit Suffix and ISA in memory
[W::sam_hdr_create] Ignored @SQ SN:HLA-C*08:02:01:01 : bad or missing LN tag
[E::sam_hrecs_error] Malformed key:value pair at line 3253: "@SQ SN:HLA-C*08:02:01:01 "
[E::sam_hrecs_error] Malformed key:value pair at line 3253: "@SQ SN:HLA-C*08:02:01:01 "
samtools sort: failed to change sort order header to 'coordinate'
summary: 124 kiByte in 5min 40.9sec - average of 0.4 kiB/s
is the LICENSE file in the repo for bwa-meme? (It appears to be for bwa-mem2).
What is the exact license for bwa-meme? thanks.
Hi,
I had installed bwa-meme, but failed at the bwa-mem2 mem
My index building commands :
./bwa-mem2 index -a meme -t 32 human.fna ;
./build_rmis_dna.sh human.fna
All index file :
$ ls human.fna* -hal
lrwxrwxrwx 1 hudeneil hudeneil 45 Oct 4 10:37 human.fna
-rw-rw-r-- 1 hudeneil hudeneil 5.8G Oct 4 12:16 human.fna.0123
-rw-rw-r-- 1 hudeneil hudeneil 1.1K Oct 4 11:56 human.fna.amb
-rw-rw-r-- 1 hudeneil hudeneil 1.9K Oct 4 11:56 human.fna.ann
-rw-rw-r-- 1 hudeneil hudeneil 2.9G Oct 4 11:55 human.fna.bwt
-rw-rw-r-- 1 hudeneil hudeneil 742M Oct 4 11:56 human.fna.pac
-rw-rw-r-- 1 hudeneil hudeneil 29G Oct 4 15:05 human.fna.pos_packed
-rw-rw-r-- 1 hudeneil hudeneil 76G Oct 4 15:05 human.fna.possa_packed
-rw-rw-r-- 1 hudeneil hudeneil 29G Oct 4 15:05 human.fna.ref2sa_packed
-rw-rw-r-- 1 hudeneil hudeneil 1.5G Oct 4 12:13 human.fna.sa
-rw-rw-r-- 1 hudeneil hudeneil 47G Oct 4 15:05 human.fna.suffixarray_uint64
-rw-rw-r-- 1 hudeneil hudeneil 122 Oct 4 15:23 human.fna.suffixarray_uint64_data.h
-rw-rw-r-- 1 hudeneil hudeneil 8 Oct 4 15:23 human.fna.suffixarray_uint64_L0_PARAMETERS
-rw-rw-r-- 1 hudeneil hudeneil 1.2G Oct 4 15:24 human.fna.suffixarray_uint64_L1_PARAMETERS
-rw-rw-r-- 1 hudeneil hudeneil 6.0G Oct 4 15:24 human.fna.suffixarray_uint64_L2_PARAMETERS
drwxrwxr-x 5 hudeneil hudeneil 244 Oct 4 10:14 RMI
drwxrwxr-x 2 hudeneil hudeneil 258 Oct 4 15:24 rmi_data
Segmentation fault (core dumped)
There is the following message :
$ time ~/tools/BWA-MEME/bwa-mem2 mem -Y -K 100000000 -t 32 -7 ~/tools/BWA-MEME/human.fna S461.trim.fq -o S461.trim.fq.mem.sam
-----------------------------
Executing in AVX512 mode!!
-----------------------------
* SA compression enabled with xfactor: 8
* Ref file: /home/hudeneil/tools/BWA-MEME/human.fna
* Entering FMI_search
Reading other elements of the index from files /home/hudeneil/tools/BWA-MEME/human.fna
* Index prefix: /home/hudeneil/tools/BWA-MEME/human.fna
* Read 0 ALT contigs
* Reading reference genome..
* Binary seq file = /home/hudeneil/tools/BWA-MEME/human.fna.0123
* Reference genome size: 6224193392 bp
* Done reading reference genome !!
------------------------------------------
1. Memory pre-allocation for Chaining: 1419.8876 MB
2. Memory pre-allocation for BSW: 7667.7448 MB
Segmentation fault (core dumped)
How to solve this problem? Thank you.
Is it possible to build all chipset binaries and have the tool auto-select the correct one for the system it is running on in the same way as the original bwa-mem2 works?
https://github.com/bwa-mem2/bwa-mem2#installation
Without this I'm not convinced a bioconda version will be as useful.
Hello,
After git cloning the code, I run: make -j32
and got:
...
In file included from src/bwtindex.cpp:43:
src/Learnedindex.h:35:27: note: initializing argument 1 of ‘void buildSAandLEP(char*, int)’
35 | void buildSAandLEP( char* prefix, int num_threads);
| ~~~~~~^~~~~~
src/bwamem_pair.cpp: In function ‘int mem_matesw_batch_pre(const mem_opt_t*, const bntseq_t*, const uint8_t*, const mem_pestat_t*, const mem_alnreg_t*, int, const uint8_t*, mem_alnreg_v*, mem_cache*, int, int32_t, int32_t&, int32_t&, int32_t)’:
src/bwamem_pair.cpp:1158:33: warning: '0' flag ignored with precision and ‘%d’ gnu_printf format [-Wformat=]
1158 | fprintf(stderr, "[0000][%0.4d] Re-allocating (doubling) seqBufRefs in %s\n",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/bwamem_pair.cpp:1175:33: warning: '0' flag ignored with precision and ‘%d’ gnu_printf format [-Wformat=]
1175 | fprintf(stderr, "[0000][%0.4d] Re-allocating (doubling) seqBufQers in %s\n",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/bwamem_pair.cpp:1192:33: warning: '0' flag ignored with precision and ‘%d’ gnu_printf format [-Wformat=]
1192 | fprintf(stderr, "[0000][%0.4d] Re-allocating seqPairs in %s\n", tid, __func__);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
make[1]: Leaving directory '/home/user/soft/BWA-MEME'
make: *** [Makefile:123: multi] Error 2
gcc version is 9.4.0, ubuntu1~20.04.1
What can be the problem?
Thank you in advance,
Adily
First, thank you for developing nice tool.
I got an error while indexing human genome.
Step1. FASTA index
Step2. build rmis
build_rmis_dna.sh genome.fasta
Training top-level pwl model layer
Training second-level linear model layer (num models = 268435456)
[2nd layer]Computing lower bound stats...
[2nd layer]Fixing empty models...
Computing last level errors...
Average gap: 19.046705737228734
Total Partial model num: 41300812, Leaf of partial model num: 186417
Total last layer model num: 309549851
Partial start at idx:268249039
Model build time: 867524 ms
Average model error: 1494.9997627890018 (0.00002420435750882617%)
Average model L2 error: 42821253900152.66
Average model log2 error: 5.169317311164392
Max model log2 error: 18.635250483499124
Max model error on model 309539124: 407164 (0.006592069956143941%)
I wish you would help me. thank you.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.