bede / hostile Goto Github PK
View Code? Open in Web Editor NEWPrecise host read removal
License: MIT License
Precise host read removal
License: MIT License
Would probably have to be for single reads only (nanopore)
Currently this is hardcoded. It would be nice if this could be overridden using an environment variable
Masking instructions are buried in the supplement. Masking between arbitrary refs should be automated and integrated into the tool as a subcommand
Currently a custom database can be specified using --index
. However, this needs to be already available on a local filesystem. If no index is specified and the default (human-t2t-hla
) is not cached locally, it is downloaded. It would be useful if applications depending on Hostile could override the default database such that a custom database could be automatically downloaded on first run of not already present.
Another way to do this would be to implement a database / fetch subcommand
I'm working on integrating this into a service.
If I want to cache the index files to a non-user-specific shared location (e.g.), would the best way to do that be to maybe override the XDG_DATA_DIR?
Line 32 in 82b3a5d
I was tracing through the code, and it looks like this isn't exposed as a parameter, like the output dir is.
But am I missing something?
Thanks!
no need for manual string replacing :-)
I, at least, know it works for env create
.
Lines 4 to 5 in 57be277
Currently Bowtie2 is the default backend and is tried first regardless of the input read type, with Minimap2 as the fallback. As Bowtie2 is poorly suited to long reads and long read performance was evaluated with Minimap2 in the paper, Bowtie2 should probably not be the default for long reads.
How I think it should work:
bowtie2
backend by default--aligner minimap2
--> minimap2
with sr
presetmap-ont
preset--aligner bowtie2
--> bowtie2
unpaired modeI don't think there is a major need to be able to customise away from the map-ont
preset for other long read technologies given that we are simply throwing reads out, but this may need to be revisited in future
Hi! I was just wondering if this would be suitable for removing human contaminants from metatranscriptomics (RNA-seq)? And, if so, would you recommend using the default human-t2t or the human-t2t + argos985?
Thank you!
Hi, the software seems very good, but I am new in metagenome analysis and I have questions about the selection of appropriate reference genomes (indexes) when using 'hostile'.
I read the "Reference genomes (indexes)" part of the README, and my understanding is: compared with using 'human-t2t-hla', additional reads from the '985 reference grade bacterial genomes' will be preserved if 'human-t2t-hla-argos985' is used.
Is my understanding right?
I have metagenome sequencing data from human stool samples, and I want to analyze the bacteria, archaea, fungi, and virus in these samples. I have used 'fastp' for quality control, and the next step should be host decontamination to remove reads from the host, i.e. humans (am I right?). Can this step be completed using 'hostile'? If possible, how to select reference genomes (indexes)? Should I select 'human-t2t-hla' for my objective?
Thank you!
Hi @bede
Hostile looks pretty awesome! You've pretty much got everything already in place to submit to Bioconda.
Are you OK if I do this? Otherwise if you are planning to, I'll hold off.
Cheers!
Robert
Refactoring around Task and Batch classes could simplify things, and make it practically possible to use temporary directories. Initial GIL paranoia wrt parallelisation led to slightly unwieldy current implementation.
Notes
.mmi
, or .fasta
with or without compressionSamtools considers an empty SAM/BAM file invalid, irritatingly. Workaround is to create empty but valid gzip file e.g.:
with gzip.open('empty.fastq.gz', 'wb') as f:
pass
Hi,
I've noticed that Hostile provides three options for reference genomes. I'm wondering which reference I should select in order to remove human genes from my microbiome samples before conducting metagenomics classification.
Could you please explain the advantages of including HLA, argos985, and mycob140 in the host removal process, particularly in the context of clinical samples with CNS infections?
Needs a test case
For some datasets, Bowtie2's CPU utilisation decreases with increasing numbers of threads. Workaround is not to use more than 8-16 threads. Raised upstream BenLangmead/bowtie2#437
% hostile dehost --fastq1 tests/data/h37rv_10.r1.fastq.gz --fastq2 tests/data/h37rv_10.r2.fastq.gz --out-dir test
INFO: Using Bowtie2
INFO: Using cached human index (/Users/bede/Library/Application Support/hostile/human-bowtie2)
Dehosting: 0%| | 0/1 [00:00<?, ?it/s]Exception occurred during executing command bowtie2 -x '/Users/bede/Library/Application Support/hostile/human-bowtie2' -1 'tests/data/h37rv_10.r1.fastq.gz' -2 'tests/data/h37rv_10.r2.fastq.gz' -k 1 --mm -p 10| tee >(samtools view -F 256 -c - > test/h37rv_10.r1.reads_in.txt)| samtools view --threads 5 -f 12 - | tee >(samtools view -F 256 -c - > test/h37rv_10.r1.reads_out.txt) | awk 'BEGIN{FS=OFS="\t"} {$1=int((NR+1)/2)" "; print $0}' | samtools fastq --threads 5 -c 6 -N -1 'test/h37rv_10.r1.dehosted_1.fastq.gz' -2 'test/h37rv_10.r2.dehosted_2.fastq.gz': Command '['/bin/bash', '-c', 'bowtie2 -x \'/Users/bede/Library/Application Support/hostile/human-bowtie2\' -1 \'tests/data/h37rv_10.r1.fastq.gz\' -2 \'tests/data/h37rv_10.r2.fastq.gz\' -k 1 --mm -p 10| tee >(samtools view -F 256 -c - > test/h37rv_10.r1.reads_in.txt)| samtools view --threads 5 -f 12 - | tee >(samtools view -F 256 -c - > test/h37rv_10.r1.reads_out.txt) | awk \'BEGIN{FS=OFS="\\t"} {$1=int((NR+1)/2)" "; print $0}\' | samtools fastq --threads 5 -c 6 -N -1 \'test/h37rv_10.r1.dehosted_1.fastq.gz\' -2 \'test/h37rv_10.r2.dehosted_2.fastq.gz\'']' returned non-zero exit status 1.
Dehosting: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.09it/s]
Traceback (most recent call last):
File "/Users/bede/miniconda3/envs/hostile/bin/hostile", line 8, in <module>
sys.exit(main())
File "/Users/bede/Research/Git/hostile/src/hostile/cli.py", line 68, in main
defopt.run(
File "/Users/bede/miniconda3/envs/hostile/lib/python3.10/site-packages/defopt.py", line 356, in run
return call()
File "/Users/bede/Research/Git/hostile/src/hostile/cli.py", line 28, in dehost
stats = lib.dehost_paired_fastqs(
File "/Users/bede/Research/Git/hostile/src/hostile/lib.py", line 140, in dehost_paired_fastqs
stats = gather_stats(fastqs, out_dir=out_dir)
File "/Users/bede/Research/Git/hostile/src/hostile/lib.py", line 89, in gather_stats
n_reads_in = util.parse_count_file(n_reads_in_path)
File "/Users/bede/Research/Git/hostile/src/hostile/util.py", line 55, in parse_count_file
with open(path, "r") as fh:
FileNotFoundError: [Errno 2] No such file or directory: 'test/h37rv_10.r1.reads_in.txt'
e.g. Removed x/y reads (z%)
Two calls to tee
in stream?
For whatever reason, when initially implementing long read support using Minimap2, I was unable to demonstrate significantly reduced execution time versus recreating the index from scratch every time hostile clean
is called. Using a prebuilt index was only marginally quicker and frankly not worth the complexity of managing indexes. However, recently I tested whether this is still the case and observed that running hostile clean
on a small long read fastq drops from taking ~45s to ~7s through use of a precomputed index.
This behaviour should first be characterised / verified on Linux and MacOS. Assuming the performance benefits are replicated on both OSs, adding invisible (but suitably logged) index caching and reuse should be done unless a good reason not to do so becomes apparent.
This will dramatically reduce execution time for processing many long read samples where this redundant indexing overhead is painful.
Should probably complain about output files already existing unless a --force
flag is given
Currently a default ref/index is downloaded even if a custom index is specified. Thanks for raising @pvanheus
gen_clean_cmd()
and gen_paired_clean_cmd()
both mutate Aligner.cmd
and Aligner.paired_cmd
, and since Aligner is instantiated once, templating fails after processing the first fastq / pair of fastqs leading to corruption of the output of subsequent samples when using the Python API. Templating should not mutate Aligner instance. Needs tests for single and paired reads.
A common use I have for hostile is to cleanup and concatenate a directory of fastqs prior to moving them off instrument, if I could do all of this with a single one liner such as:
cat *.fastq.gz | hostile clean --fastq1 - > combined_clean.fastq.gz
This would make my life easier, as it stands hostile works absolutely fine since I can just concatenate then run the hostile clean command but doing it all in a more pipe-centric way would be a nice to have!
Currently if a genome/index download is abandoned, Hostile may think it's present and correct leading to errors. Could download to a temp location and move into $XDG_DATA_DIR
or download and rename etc
Thanks for providing us with hassle-free and fast dehosting tool.
I am however running into an issue when using PE fastq files and providing the custom reference fasta file of Bos taurus.
For ONT fastq files, hostile
automatically indexes the fasta file but the same is not true for PE bowtie mode. Can this be implemented?
hostile clean --fastq1 SRR27845761_1.fastq.gz --fastq2 SRR27845761_2.fastq.gz --threads 10 --index Bos_taurus.ARS-UCD1.3.dna.toplevel.fa
10:37:37 INFO: Hostile version 1.1.0. Mode: paired short read (Bowtie2)
Traceback (most recent call last):
File "/home/subudhak/miniconda3/envs/serotyper/bin/hostile", line 10, in <module>
sys.exit(main())
File "/home/subudhak/miniconda3/envs/serotyper/lib/python3.10/site-packages/hostile/cli.py", line 154, in main
defopt.run(
File "/home/subudhak/miniconda3/envs/serotyper/lib/python3.10/site-packages/defopt.py", line 356, in run
return call()
File "/home/subudhak/miniconda3/envs/serotyper/lib/python3.10/site-packages/hostile/cli.py", line 68, in clean
stats = lib.clean_paired_fastqs(
File "/home/subudhak/miniconda3/envs/serotyper/lib/python3.10/site-packages/hostile/lib.py", line 225, in clean_paired_fastqs
index_path = aligner.value.check_index(index, offline=offline)
File "/home/subudhak/miniconda3/envs/serotyper/lib/python3.10/site-packages/hostile/aligner.py", line 61, in check_index
raise FileNotFoundError(message)
FileNotFoundError: Bos_taurus.ARS-UCD1.3.dna.toplevel.fa is neither a valid custom index path nor a valid standard index name
I want to be able to utilize single-end short-read data in addition to paired-end data
Acceptance Criteria:
By implementing this user story, users can utilize single-end short-read data alongside paired-end data in their genomic analysis workflows, enhancing the tool's utility and accessibility for diverse research needs.
For unpaired short reads
Do you have plan to create the reference genome with virus masked?
Running multiple instances of Hostile on the same FASTQs in the same directory corrupts decontamination statistics since they will write to the same count files. Could fix by putting these inside a tempfile.TemporaryDirectory
CM.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.