Coder Social home page Coder Social logo

hostile's People

Contributors

bede avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

hostile's Issues

Automatic masking

Masking instructions are buried in the supplement. Masking between arbitrary refs should be automated and integrated into the tool as a subcommand

Allow non-default databases in cloud bucket to be downloaded on first run

Currently a custom database can be specified using --index. However, this needs to be already available on a local filesystem. If no index is specified and the default (human-t2t-hla) is not cached locally, it is downloaded. It would be useful if applications depending on Hostile could override the default database such that a custom database could be automatically downloaded on first run of not already present.

Another way to do this would be to implement a database / fetch subcommand

What's the best way to override the index download directory?

I'm working on integrating this into a service.
If I want to cache the index files to a non-user-specific shared location (e.g.), would the best way to do that be to maybe override the XDG_DATA_DIR?

XDG_DATA_DIR = Path(user_data_dir("hostile", "Bede Constantinides"))

I was tracing through the code, and it looks like this isn't exposed as a parameter, like the output dir is.
But am I missing something?

Thanks!

Automatically choose most appropriate alignment backend for the input read type

Currently Bowtie2 is the default backend and is tried first regardless of the input read type, with Minimap2 as the fallback. As Bowtie2 is poorly suited to long reads and long read performance was evaluated with Minimap2 in the paper, Bowtie2 should probably not be the default for long reads.

How I think it should work:

  • Paired input --> bowtie2 backend by default
  • Paired input with --aligner minimap2 --> minimap2 with sr preset
  • Unpaired input --> Minimap2 with map-ont preset
  • Unpaired input with --aligner bowtie2 --> bowtie2 unpaired mode

I don't think there is a major need to be able to customise away from the map-ont preset for other long read technologies given that we are simply throwing reads out, but this may need to be revisited in future

Suitable for metatranscriptomics?

Hi! I was just wondering if this would be suitable for removing human contaminants from metatranscriptomics (RNA-seq)? And, if so, would you recommend using the default human-t2t or the human-t2t + argos985?

Thank you!

Selection of appropriate reference genomes (indexes)

Hi, the software seems very good, but I am new in metagenome analysis and I have questions about the selection of appropriate reference genomes (indexes) when using 'hostile'.

I read the "Reference genomes (indexes)" part of the README, and my understanding is: compared with using 'human-t2t-hla', additional reads from the '985 reference grade bacterial genomes' will be preserved if 'human-t2t-hla-argos985' is used.
Is my understanding right?

I have metagenome sequencing data from human stool samples, and I want to analyze the bacteria, archaea, fungi, and virus in these samples. I have used 'fastp' for quality control, and the next step should be host decontamination to remove reads from the host, i.e. humans (am I right?). Can this step be completed using 'hostile'? If possible, how to select reference genomes (indexes)? Should I select 'human-t2t-hla' for my objective?

Thank you!

Adding to Bioconda

Hi @bede

Hostile looks pretty awesome! You've pretty much got everything already in place to submit to Bioconda.

Are you OK if I do this? Otherwise if you are planning to, I'll hold off.

Cheers!
Robert

Refactor with more OOP

Refactoring around Task and Batch classes could simplify things, and make it practically possible to use temporary directories. Initial GIL paranoia wrt parallelisation led to slightly unwieldy current implementation.

Support --custom-index

Notes

  • mm2 supports either .mmi, or .fasta with or without compression
  • bt2 requires a prebuilt index (takes forever) split across numerous files, specified as a path without an extension

Regarding the choice of Reference genomes

Hi,
I've noticed that Hostile provides three options for reference genomes. I'm wondering which reference I should select in order to remove human genes from my microbiome samples before conducting metagenomics classification.

Could you please explain the advantages of including HLA, argos985, and mycob140 in the host removal process, particularly in the context of clinical samples with CNS infections?

out_dir / --out-dir broken

% hostile dehost --fastq1 tests/data/h37rv_10.r1.fastq.gz --fastq2 tests/data/h37rv_10.r2.fastq.gz --out-dir test
INFO: Using Bowtie2
INFO: Using cached human index (/Users/bede/Library/Application Support/hostile/human-bowtie2)
Dehosting:   0%|                                                                                                                | 0/1 [00:00<?, ?it/s]Exception occurred during executing command bowtie2 -x '/Users/bede/Library/Application Support/hostile/human-bowtie2' -1 'tests/data/h37rv_10.r1.fastq.gz' -2 'tests/data/h37rv_10.r2.fastq.gz' -k 1 --mm -p 10| tee >(samtools view -F 256 -c - > test/h37rv_10.r1.reads_in.txt)| samtools view --threads 5 -f 12 - | tee >(samtools view -F 256 -c - > test/h37rv_10.r1.reads_out.txt) | awk 'BEGIN{FS=OFS="\t"} {$1=int((NR+1)/2)" "; print $0}' | samtools fastq --threads 5 -c 6 -N -1 'test/h37rv_10.r1.dehosted_1.fastq.gz' -2 'test/h37rv_10.r2.dehosted_2.fastq.gz': Command '['/bin/bash', '-c', 'bowtie2 -x \'/Users/bede/Library/Application Support/hostile/human-bowtie2\' -1 \'tests/data/h37rv_10.r1.fastq.gz\' -2 \'tests/data/h37rv_10.r2.fastq.gz\' -k 1 --mm -p 10| tee >(samtools view -F 256 -c - > test/h37rv_10.r1.reads_in.txt)| samtools view --threads 5 -f 12 - | tee >(samtools view -F 256 -c - > test/h37rv_10.r1.reads_out.txt) | awk \'BEGIN{FS=OFS="\\t"} {$1=int((NR+1)/2)" "; print $0}\' | samtools fastq --threads 5 -c 6 -N -1 \'test/h37rv_10.r1.dehosted_1.fastq.gz\' -2 \'test/h37rv_10.r2.dehosted_2.fastq.gz\'']' returned non-zero exit status 1.
Dehosting: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.09it/s]
Traceback (most recent call last):
  File "/Users/bede/miniconda3/envs/hostile/bin/hostile", line 8, in <module>
    sys.exit(main())
  File "/Users/bede/Research/Git/hostile/src/hostile/cli.py", line 68, in main
    defopt.run(
  File "/Users/bede/miniconda3/envs/hostile/lib/python3.10/site-packages/defopt.py", line 356, in run
    return call()
  File "/Users/bede/Research/Git/hostile/src/hostile/cli.py", line 28, in dehost
    stats = lib.dehost_paired_fastqs(
  File "/Users/bede/Research/Git/hostile/src/hostile/lib.py", line 140, in dehost_paired_fastqs
    stats = gather_stats(fastqs, out_dir=out_dir)
  File "/Users/bede/Research/Git/hostile/src/hostile/lib.py", line 89, in gather_stats
    n_reads_in = util.parse_count_file(n_reads_in_path)
  File "/Users/bede/Research/Git/hostile/src/hostile/util.py", line 55, in parse_count_file
    with open(path, "r") as fh:
FileNotFoundError: [Errno 2] No such file or directory: 'test/h37rv_10.r1.reads_in.txt'

Automatically generate and cache minimap2 indexes to eliminate redundant indexing overhead

For whatever reason, when initially implementing long read support using Minimap2, I was unable to demonstrate significantly reduced execution time versus recreating the index from scratch every time hostile clean is called. Using a prebuilt index was only marginally quicker and frankly not worth the complexity of managing indexes. However, recently I tested whether this is still the case and observed that running hostile clean on a small long read fastq drops from taking ~45s to ~7s through use of a precomputed index.

This behaviour should first be characterised / verified on Linux and MacOS. Assuming the performance benefits are replicated on both OSs, adding invisible (but suitably logged) index caching and reuse should be done unless a good reason not to do so becomes apparent.

This will dramatically reduce execution time for processing many long read samples where this redundant indexing overhead is painful.

Corrupted output if decontaminating more than one sample using Python API

gen_clean_cmd() and gen_paired_clean_cmd() both mutate Aligner.cmd and Aligner.paired_cmd, and since Aligner is instantiated once, templating fails after processing the first fastq / pair of fastqs leading to corruption of the output of subsequent samples when using the Python API. Templating should not mutate Aligner instance. Needs tests for single and paired reads.

Support accepting stdin instead of a specific filepath for single ended data

A common use I have for hostile is to cleanup and concatenate a directory of fastqs prior to moving them off instrument, if I could do all of this with a single one liner such as:

cat *.fastq.gz | hostile clean --fastq1 - > combined_clean.fastq.gz

This would make my life easier, as it stands hostile works absolutely fine since I can just concatenate then run the hostile clean command but doing it all in a more pipe-centric way would be a nice to have!

Discard partially downloaded indexes

Currently if a genome/index download is abandoned, Hostile may think it's present and correct leading to errors. Could download to a temp location and move into $XDG_DATA_DIR or download and rename etc

Mode: paired short read (Bowtie2) fails when index is provided as reference fasta file

Thanks for providing us with hassle-free and fast dehosting tool.

I am however running into an issue when using PE fastq files and providing the custom reference fasta file of Bos taurus.
For ONT fastq files, hostile automatically indexes the fasta file but the same is not true for PE bowtie mode. Can this be implemented?

hostile clean --fastq1 SRR27845761_1.fastq.gz --fastq2 SRR27845761_2.fastq.gz --threads 10     --index Bos_taurus.ARS-UCD1.3.dna.toplevel.fa
10:37:37 INFO: Hostile version 1.1.0. Mode: paired short read (Bowtie2)
Traceback (most recent call last):
  File "/home/subudhak/miniconda3/envs/serotyper/bin/hostile", line 10, in <module>
    sys.exit(main())
  File "/home/subudhak/miniconda3/envs/serotyper/lib/python3.10/site-packages/hostile/cli.py", line 154, in main
    defopt.run(
  File "/home/subudhak/miniconda3/envs/serotyper/lib/python3.10/site-packages/defopt.py", line 356, in run
    return call()
  File "/home/subudhak/miniconda3/envs/serotyper/lib/python3.10/site-packages/hostile/cli.py", line 68, in clean
    stats = lib.clean_paired_fastqs(
  File "/home/subudhak/miniconda3/envs/serotyper/lib/python3.10/site-packages/hostile/lib.py", line 225, in clean_paired_fastqs
    index_path = aligner.value.check_index(index, offline=offline)
  File "/home/subudhak/miniconda3/envs/serotyper/lib/python3.10/site-packages/hostile/aligner.py", line 61, in check_index
    raise FileNotFoundError(message)
FileNotFoundError: Bos_taurus.ARS-UCD1.3.dna.toplevel.fa is neither a valid custom index path nor a valid standard index name

Incorporating Single-End Short Read Data Support

I want to be able to utilize single-end short-read data in addition to paired-end data

Acceptance Criteria:

  1. As a user, I should be able to specify single-end short-read data as input to the Bowtie2 tool.
  2. The tool should correctly process and align single-end reads using Bowtie2 algorithm.
  3. The tool's documentation should reflect the newly added support for single-end short-read data.
  4. The tool's performance with single-end data should be evaluated and compared to its performance with paired-end data.
  5. The tool should provide appropriate warnings or errors if incompatible data types are provided as input.

By implementing this user story, users can utilize single-end short-read data alongside paired-end data in their genomic analysis workflows, enhancing the tool's utility and accessibility for diverse research needs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.