bede / hostile Goto Github PK

View Code? Open in Web Editor NEW

68.0 68.0 5.0 2.34 MB

Precise host read removal

License: MIT License

Python 98.93% Dockerfile 1.07%

hostile's People

Contributors

Stargazers

Watchers

Forkers

carden24 sam-baird bdklahn datasoc-ltd eit-pathogena

hostile's Issues

Support for stdin (and stdout)

Would probably have to be for single reads only (nanopore)

Allow remote index repository to be overridden

Currently this is hardcoded. It would be nice if this could be overridden using an environment variable

Automatic masking

Masking instructions are buried in the supplement. Masking between arbitrary refs should be automated and integrated into the tool as a subcommand

Allow non-default databases in cloud bucket to be downloaded on first run

Currently a custom database can be specified using --index. However, this needs to be already available on a local filesystem. If no index is specified and the default (human-t2t-hla) is not cached locally, it is downloaded. It would be useful if applications depending on Hostile could override the default database such that a custom database could be automatically downloaded on first run of not already present.

Another way to do this would be to implement a database / fetch subcommand

What's the best way to override the index download directory?

I'm working on integrating this into a service.
If I want to cache the index files to a non-user-specific shared location (e.g.), would the best way to do that be to maybe override the XDG_DATA_DIR?

hostile/src/hostile/util.py

Line 32 in 82b3a5d

XDG_DATA_DIR = Path(user_data_dir("hostile", "Bede Constantinides"))

I was tracing through the code, and it looks like this isn't exposed as a parameter, like the output dir is.
But am I missing something?

Thanks!

I think you can use --name (-n) on the mamba command line to override the environment name in the env file.

no need for manual string replacing :-)

I, at least, know it works for env create.

hostile/Dockerfile

Lines 4 to 5 in 57be277

    
           RUN sed -i 's/name: hostile/name: base/' hostile/environment.yml 
        
           RUN mamba env update -f hostile/environment.yml

Automatically choose most appropriate alignment backend for the input read type

Currently Bowtie2 is the default backend and is tried first regardless of the input read type, with Minimap2 as the fallback. As Bowtie2 is poorly suited to long reads and long read performance was evaluated with Minimap2 in the paper, Bowtie2 should probably not be the default for long reads.

How I think it should work:

Paired input --> bowtie2 backend by default
Paired input with --aligner minimap2 --> minimap2 with sr preset
Unpaired input --> Minimap2 with map-ont preset
Unpaired input with --aligner bowtie2 --> bowtie2 unpaired mode

I don't think there is a major need to be able to customise away from the map-ont preset for other long read technologies given that we are simply throwing reads out, but this may need to be revisited in future

Suitable for metatranscriptomics?

Hi! I was just wondering if this would be suitable for removing human contaminants from metatranscriptomics (RNA-seq)? And, if so, would you recommend using the default human-t2t or the human-t2t + argos985?

Thank you!

Add aligner and index information to output json

Selection of appropriate reference genomes (indexes)

Hi, the software seems very good, but I am new in metagenome analysis and I have questions about the selection of appropriate reference genomes (indexes) when using 'hostile'.

I read the "Reference genomes (indexes)" part of the README, and my understanding is: compared with using 'human-t2t-hla', additional reads from the '985 reference grade bacterial genomes' will be preserved if 'human-t2t-hla-argos985' is used.
Is my understanding right?

I have metagenome sequencing data from human stool samples, and I want to analyze the bacteria, archaea, fungi, and virus in these samples. I have used 'fastp' for quality control, and the next step should be host decontamination to remove reads from the host, i.e. humans (am I right?). Can this step be completed using 'hostile'? If possible, how to select reference genomes (indexes)? Should I select 'human-t2t-hla' for my objective?

Thank you!

Adding to Bioconda

Hi @bede

Hostile looks pretty awesome! You've pretty much got everything already in place to submit to Bioconda.

Are you OK if I do this? Otherwise if you are planning to, I'll hold off.

Cheers!
Robert

Refactor with more OOP

Refactoring around Task and Batch classes could simplify things, and make it practically possible to use temporary directories. Initial GIL paranoia wrt parallelisation led to slightly unwieldy current implementation.

Support --custom-index

Notes

mm2 supports either .mmi, or .fasta with or without compression
bt2 requires a prebuilt index (takes forever) split across numerous files, specified as a path without an extension

Support for uncompressed fastq

Check if input files exist

When 0 reads remain after decontamination, an invalid gzip file is created

Samtools considers an empty SAM/BAM file invalid, irritatingly. Workaround is to create empty but valid gzip file e.g.:

with gzip.open('empty.fastq.gz', 'wb') as f:
    pass

Regarding the choice of Reference genomes

Hi,
I've noticed that Hostile provides three options for reference genomes. I'm wondering which reference I should select in order to remove human genes from my microbiome samples before conducting metagenomics classification.

Could you please explain the advantages of including HLA, argos985, and mycob140 in the host removal process, particularly in the context of clinical samples with CNS infections?

Crashes with exclusively contaminated input

Needs a test case

BAM ingest

Bowtie2 performance suffers when using many threads in some conditions

For some datasets, Bowtie2's CPU utilisation decreases with increasing numbers of threads. Workaround is not to use more than 8-16 threads. Raised upstream BenLangmead/bowtie2#437

Make fastq header stripping optional

Make --debug useful

out_dir / --out-dir broken

% hostile dehost --fastq1 tests/data/h37rv_10.r1.fastq.gz --fastq2 tests/data/h37rv_10.r2.fastq.gz --out-dir test
INFO: Using Bowtie2
INFO: Using cached human index (/Users/bede/Library/Application Support/hostile/human-bowtie2)
Dehosting:   0%|                                                                                                                | 0/1 [00:00<?, ?it/s]Exception occurred during executing command bowtie2 -x '/Users/bede/Library/Application Support/hostile/human-bowtie2' -1 'tests/data/h37rv_10.r1.fastq.gz' -2 'tests/data/h37rv_10.r2.fastq.gz' -k 1 --mm -p 10| tee >(samtools view -F 256 -c - > test/h37rv_10.r1.reads_in.txt)| samtools view --threads 5 -f 12 - | tee >(samtools view -F 256 -c - > test/h37rv_10.r1.reads_out.txt) | awk 'BEGIN{FS=OFS="\t"} {$1=int((NR+1)/2)" "; print $0}' | samtools fastq --threads 5 -c 6 -N -1 'test/h37rv_10.r1.dehosted_1.fastq.gz' -2 'test/h37rv_10.r2.dehosted_2.fastq.gz': Command '['/bin/bash', '-c', 'bowtie2 -x \'/Users/bede/Library/Application Support/hostile/human-bowtie2\' -1 \'tests/data/h37rv_10.r1.fastq.gz\' -2 \'tests/data/h37rv_10.r2.fastq.gz\' -k 1 --mm -p 10| tee >(samtools view -F 256 -c - > test/h37rv_10.r1.reads_in.txt)| samtools view --threads 5 -f 12 - | tee >(samtools view -F 256 -c - > test/h37rv_10.r1.reads_out.txt) | awk \'BEGIN{FS=OFS="\\t"} {$1=int((NR+1)/2)" "; print $0}\' | samtools fastq --threads 5 -c 6 -N -1 \'test/h37rv_10.r1.dehosted_1.fastq.gz\' -2 \'test/h37rv_10.r2.dehosted_2.fastq.gz\'']' returned non-zero exit status 1.
Dehosting: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.09it/s]
Traceback (most recent call last):
  File "/Users/bede/miniconda3/envs/hostile/bin/hostile", line 8, in <module>
    sys.exit(main())
  File "/Users/bede/Research/Git/hostile/src/hostile/cli.py", line 68, in main
    defopt.run(
  File "/Users/bede/miniconda3/envs/hostile/lib/python3.10/site-packages/defopt.py", line 356, in run
    return call()
  File "/Users/bede/Research/Git/hostile/src/hostile/cli.py", line 28, in dehost
    stats = lib.dehost_paired_fastqs(
  File "/Users/bede/Research/Git/hostile/src/hostile/lib.py", line 140, in dehost_paired_fastqs
    stats = gather_stats(fastqs, out_dir=out_dir)
  File "/Users/bede/Research/Git/hostile/src/hostile/lib.py", line 89, in gather_stats
    n_reads_in = util.parse_count_file(n_reads_in_path)
  File "/Users/bede/Research/Git/hostile/src/hostile/util.py", line 55, in parse_count_file
    with open(path, "r") as fh:
FileNotFoundError: [Errno 2] No such file or directory: 'test/h37rv_10.r1.reads_in.txt'

Add completion message on stderr

e.g. Removed x/y reads (z%)

Return number of reads before and after decontamination

Two calls to tee in stream?

Automatically generate and cache minimap2 indexes to eliminate redundant indexing overhead

For whatever reason, when initially implementing long read support using Minimap2, I was unable to demonstrate significantly reduced execution time versus recreating the index from scratch every time hostile clean is called. Using a prebuilt index was only marginally quicker and frankly not worth the complexity of managing indexes. However, recently I tested whether this is still the case and observed that running hostile clean on a small long read fastq drops from taking ~45s to ~7s through use of a precomputed index.

This behaviour should first be characterised / verified on Linux and MacOS. Assuming the performance benefits are replicated on both OSs, adding invisible (but suitably logged) index caching and reuse should be done unless a good reason not to do so becomes apparent.

This will dramatically reduce execution time for processing many long read samples where this redundant indexing overhead is painful.

Avoid overwriting output unless forced

Should probably complain about output files already existing unless a --force flag is given

Skip checking presence of default ref/index if user supplies custom ref/index

Currently a default ref/index is downloaded even if a custom index is specified. Thanks for raising @pvanheus

Validate scrubbed output in tests/CI

Corrupted output if decontaminating more than one sample using Python API

gen_clean_cmd() and gen_paired_clean_cmd() both mutate Aligner.cmd and Aligner.paired_cmd, and since Aligner is instantiated once, templating fails after processing the first fastq / pair of fastqs leading to corruption of the output of subsequent samples when using the Python API. Templating should not mutate Aligner instance. Needs tests for single and paired reads.

Dockerise

Add versions and options to json log

Support accepting stdin instead of a specific filepath for single ended data

A common use I have for hostile is to cleanup and concatenate a directory of fastqs prior to moving them off instrument, if I could do all of this with a single one liner such as:

cat *.fastq.gz | hostile clean --fastq1 - > combined_clean.fastq.gz

This would make my life easier, as it stands hostile works absolutely fine since I can just concatenate then run the hostile clean command but doing it all in a more pipe-centric way would be a nice to have!

Discard partially downloaded indexes

Currently if a genome/index download is abandoned, Hostile may think it's present and correct leading to errors. Could download to a temp location and move into $XDG_DATA_DIR or download and rename etc

Mode: paired short read (Bowtie2) fails when index is provided as reference fasta file

Thanks for providing us with hassle-free and fast dehosting tool.

I am however running into an issue when using PE fastq files and providing the custom reference fasta file of Bos taurus.
For ONT fastq files, hostile automatically indexes the fasta file but the same is not true for PE bowtie mode. Can this be implemented?

hostile clean --fastq1 SRR27845761_1.fastq.gz --fastq2 SRR27845761_2.fastq.gz --threads 10     --index Bos_taurus.ARS-UCD1.3.dna.toplevel.fa

10:37:37 INFO: Hostile version 1.1.0. Mode: paired short read (Bowtie2)
Traceback (most recent call last):
  File "/home/subudhak/miniconda3/envs/serotyper/bin/hostile", line 10, in <module>
    sys.exit(main())
  File "/home/subudhak/miniconda3/envs/serotyper/lib/python3.10/site-packages/hostile/cli.py", line 154, in main
    defopt.run(
  File "/home/subudhak/miniconda3/envs/serotyper/lib/python3.10/site-packages/defopt.py", line 356, in run
    return call()
  File "/home/subudhak/miniconda3/envs/serotyper/lib/python3.10/site-packages/hostile/cli.py", line 68, in clean
    stats = lib.clean_paired_fastqs(
  File "/home/subudhak/miniconda3/envs/serotyper/lib/python3.10/site-packages/hostile/lib.py", line 225, in clean_paired_fastqs
    index_path = aligner.value.check_index(index, offline=offline)
  File "/home/subudhak/miniconda3/envs/serotyper/lib/python3.10/site-packages/hostile/aligner.py", line 61, in check_index
    raise FileNotFoundError(message)
FileNotFoundError: Bos_taurus.ARS-UCD1.3.dna.toplevel.fa is neither a valid custom index path nor a valid standard index name

Incorporating Single-End Short Read Data Support

I want to be able to utilize single-end short-read data in addition to paired-end data

Acceptance Criteria:

As a user, I should be able to specify single-end short-read data as input to the Bowtie2 tool.
The tool should correctly process and align single-end reads using Bowtie2 algorithm.
The tool's documentation should reflect the newly added support for single-end short-read data.
The tool's performance with single-end data should be evaluated and compared to its performance with paired-end data.
The tool should provide appropriate warnings or errors if incompatible data types are provided as input.

By implementing this user story, users can utilize single-end short-read data alongside paired-end data in their genomic analysis workflows, enhancing the tool's utility and accessibility for diverse research needs.

	RUN sed -i 's/name: hostile/name: base/' hostile/environment.yml
	RUN mamba env update -f hostile/environment.yml