pdimens / harpy Goto Github PK

View Code? Open in Web Editor NEW

10.0 2.0 1.0 396.06 MB

Process raw haplotagging data, from raw sequences to phased haplotypes, batteries included.

Home Page: https://pdimens.github.io/harpy

License: GNU General Public License v3.0

Python 59.35% Shell 0.19% R 0.42% Perl 39.26% C++ 0.79%

bioinformatics haplotype linked-reads pipeline sequencing variant-calling

harpy's People

Contributors

Stargazers

Watchers

Forkers

gmkov

harpy's Issues

change alignment depth to use samtools depth

Description of feature

I'm not satisfied with how depth is currently being caclulated. A better approach would be with samtools depth -a then calculating a windowed approach in R afterward using a convolution.

harpy hpc to create executor plugin templates

Description of feature

an other module that creates a template config.yaml file to run a snakemake executor plugin

slurm
lfs
htcondor
googlebatch
generic

usage:

harpy hpc <scheduler>

[phase] retain sites with less than or more than 2 alleles

Description of feature

The most recent commit fixes issues where there are single allele or triallelic (or more) sites, but those sites get lost when merging the unused sites back to make the final vcf. These ignored sites should be isolated too and also merged back into the final vcf.

replace molecule distance with QuickDeconvolute?

Description of feature

https://github.com/RolandFaure/QuickDeconvolution?tab=readme-ov-file

add standalone deconvolute module
add opt-in deconvolute for harpy qc
adjust the python scripts from alignment workflow to accomodate hyphenated BX tags
adjust those same python scripts to restore the BX tag and rename the hyphenated one something like DX:Z

remove ema-h

Since ema on bioconda has been updated to 0.7.0, remove the bundled ema-h and likewise remove that part from the build script and add ema as a dep in the meta.yaml file necessary for conda-build.

Also, change references to ema-h to ema in the snakemake workflow

add manual rerun

Description of feature

Add a module called something like harpy run INPUT that lets your manually trigger the snakemake workflow in a setup Harpy workflow folder without Harpy redoing and overwriting things.

-x for module `qc` creates error

Describe the bug

When using -x (--extra-params) for the QC module, the workflow results in an error about config files not existing.

Harpy Version

0.8.0 and under

File that triggers the error (if applicable)

No response

Harpy error log

╭─ Harpy qc ──────────────────────────────────────────────────────────────╮
│ Samples: 2                                                              │
│ Output Directory: QC/                                                   │
╰─────────────────────────── Running Workflow ────────────────────────────╯
Traceback (most recent call last):

  File "/home/runner/micromamba/envs/harpy/lib/python3.12/site-packages/snakemake/cli.py", line 1940, in args_to_api
    config_settings=ConfigSettings(
                    ^^^^^^^^^^^^^^^

  File "<string>", line 6, in __init__

  File "/home/runner/micromamba/envs/harpy/lib/python3.12/site-packages/snakemake/settings.py", line 304, in __post_init__
    self.overwrite_config = self._get_overwrite_config()
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/runner/micromamba/envs/harpy/lib/python3.12/site-packages/snakemake/settings.py", line 312, in _get_overwrite_config
    update_config(overwrite_config, load_configfile(f))
                                    ^^^^^^^^^^^^^^^^^^

  File "/home/runner/micromamba/envs/harpy/lib/python3.12/site-packages/snakemake/common/configfile.py", line 38, in load_configfile
    config = _load_configfile(configpath)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/runner/micromamba/envs/harpy/lib/python3.12/site-packages/snakemake/common/configfile.py", line 12, in _load_configfile
    obj = open(configpath_or_obj, encoding="utf-8")
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

FileNotFoundError: [Errno 2] No such file or directory: 'extra: --trim_poly_g\n'



### Before submitting

- [X] I have read the [relevant documentation](https://pdimens.github.io/harpy/).
- [X] I am using the latest release of Harpy.

ema-h preproc -p generating empty ema-bin-xxx files

Hello,

Thanks for this fantastic pipeline and the detailed notes that come with it, very useful!
I went through the quality trimming and read mapping modules, the snakemake rules are executed correctly, but a problem happens when running ema-h.

I checked the ema-h commands by hand one by one out of the Harpy pipeline. I found no problem with the ema-h count -p command, which outputs a ~50Mo ema-ncnt file. The ema-count Summary generated by Harpy also looks good, e.g.:

"Sample","ReadsIgnored","BarcodeValid","BarcodeTotal","PercentValid","PercentIgnored"
"Sample_1","302610","22730346","22730346","100","1.33"

But then the ema-h preproc -p command generates a list of empty ema-bin-xxx output files.

Here is the detailed command line I ran out of Harpy:
seqfu interleave -1 /my_path_to_/Trimming/Sample_1.R1.fq.gz -2 /my_path_to_/Trimming/Sample_1.R2.fq.gz | ema-h preproc -p -n 500 -t 2 -o ./ /my_path_to_/Alignments/ema/count/Sample_1.ema-ncnt

And the output:

:: Bucketing 1 inputs into 500 files with 2 threads
:: Loading known counts file /my_path_to_/Alignments/ema/count/Sample_1.ema-ncnt ...
:: Loading known counts ... done in 4.8 s
:: File assignment ... done in 2.0 s
:: [==================================================] 0 out of 22,730,346 (100.00%)
Writing barcodes ... done in 49.5 s

Looks like although the data contain 22,730,346 valid barcodes, none of them has been processed/written by ema-h preproc -p.

Do you have any idea what might be going on here?
Thanks!
Pierre-Alexandre

harpy impute error IsADirectoryError:

Describe the bug

Thank you for this great resource. I realise it still under development, so unsure about whether to expect modules to run smoothly.

After some initial teething issues with sample names in my own vcf file created with bcftools mpileup outside harpy, I ran into a cryptic conda issue. So I went a step back, and used the harpy snp mpileup module to obtain a bcf, which finished with an error (i think just unable to compile the bcf html report) but the files looked ok within SNP/mpileup/. So then I tried imputation with harpy impute and using this file as input.

my command is:

harpy impute --threads 10 --parameters stitch.params \
--vcf SNP/mpileup/variants.raw.bcf \
bams 2> log.impute

my stitch param file is:

model   usebx   bxlimit k       s       ngen
diploid TRUE    50000   30      1       500
diploid FALSE   50000   30      1       500
diploid TRUE    50000   40      1       500
diploid FALSE   50000   40      1       500

This produces the same cryptic conda error as before, which I paste below.

I also ran the harpy preflight bam bams and whole snakemake build a DAG correctly and seems to create/activate conda environments fine, and runs correctly (after I manually installed install.packages("flexdashboard"), this was the error inititally). So the preflight checks with the bam files were ok.

Harpy Version

0.9.1

File that triggers the error (if applicable)

No response

Harpy error log

(harpy) [mgm49@login-n-1 harpy]$ harpy impute --threads 10 --parameters stitch.params \
--vcf SNP/mpileup/variants.raw.bcf \
bams 2> log.impute

(harpy) [mgm49@login-n-1 harpy]$ less log.impute 
Traceback (most recent call last):
  File "/home/mgm49/miniconda3/envs/harpy/bin/harpy", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/harpy/__main__.py", line 383, in main
    cli()
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/rich_click/rich_command.py", line 126, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/harpy/impute.py", line 66, in impute
    fetch_file(f"{i}.Rmd", f"{workflowdir}/report/")
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/harpy/helperfunctions.py", line 124, in fetch_file
    shutil.copy2(result, destination)
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/shutil.py", line 475, in copy2
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/shutil.py", line 260, in copyfile
    with open(src, 'rb') as fsrc:
         ^^^^^^^^^^^^^^^
IsADirectoryError: [Errno 21] Is a directory: '/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/Impute'

Before submitting

I have read the relevant documentation.
I am using the latest release of Harpy.

Molecule lengths inferred from BX tags with BWA

Hi Pavel,

I tested the new conda version of Harpy installed with mamba, and it works great!
I just had to install the R package "circlize" myself in my Harpy environment because apparently it didn't come with the conda installation.

I'm now working with the BWA mapping pipeline instead of EMA following your suggestion. The alignment works well, as do the various sumstats output by Harpy align, with the exception of the reportGencov.Rmd script, which I had to modify in two places to match the format of the sample.cov.gz file produced with the BWA option (I replaced 'Barcodes' with 'alignments' and modified the circos.genomicTrack() function).

I was wondering if you were planning to integrate in the BWA pipeline the molecule lengths inferred from BX tags (and also possibly binned as a histogram) as you did with the EMA option. We are currently looking into how to extract this information directly from the sorted bam, I'll let you know if we find a solution on our side.

Best
Pierre-Alexandre

[phase] ridgeplot replacement

Describe the bug

The ridgeplots don't scale well with sample number. Consider a alternatives:

plotly where you can toggle the legend on/off
horizontal barplot (gradient) colored by peak height
others?

Harpy Version

1.2

File that triggers the error (if applicable)

No response

Harpy error log

No response

Before submitting

I have read the relevant documentation.
I am using the latest release of Harpy.

[phase] --genome error

Describe the bug

when using --genome the genome isn't being created in Genome/, which is where it should be

Harpy Version

1.2.1

File that triggers the error (if applicable)

No response

Harpy error log

No response

Before submitting

I have read the relevant documentation.
I am using the latest release of Harpy.

alphabetize workflow options

Description of feature

alphabetize workflow options

[feature] combine snp logs into single file

move the {log} of variant calling into a single file with a basic python run block featuring:

interval1 message1
interval1 message2
interval2 message1
etc...

Add bam indexing to snp module

Description of feature

Currently, harpy snp expects the input bam files to be indexed. Create a rule to index bam files so as to avoid hiccups. #70

[IMPUTE] create separate conda env for STITCH

Description of feature

STITCH is version-bound if kept in the r.yaml environment harpy uses, so create a separate environment just for itself.

output prefix option

Description of feature

Provide a option like -o and --output-prefix to specify an output folder, otherwise default to harpy

[phase] --vcf-samples not working correctly

Describe the bug

Expected behavior

You have fewer samples in the --vcf input than the provided alignments, so you use --vcf-samples to only care about the samples in the vcf, even if more alignments were provided as input bc of wildcards or providing a full directory

Observed behavior

The file checks all pass, but the workflow still uses all the input BAMs and terminates with an error when trying to extract snps for a sample that doesn't exist in the vcf

Harpy Version

1.2.1

File that triggers the error (if applicable)

No response

Harpy error log

[Wed Jul 10 10:56:45 2024]
Error in rule extract_homozygous:
    jobid: 2204
    input: snp/qual.dp.miss100.maf.bcf
    output: Phase/workflow/input/vcf/neg3_ShadHap3.hom.vcf
    shell:
        
        bcftools view -s neg3_ShadHap3 -Ou snp/qual.dp.miss100.maf.bcf | bcftools view -i 'GT="hom"' > Phase/workflow/input/vcf/neg3_ShadHap3.hom.vcf
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job extract_homozygous since they might be corrupted:
Phase/workflow/input/vcf/neg3_ShadHap3.hom.vcf

Before submitting

I have read the relevant documentation.
I am using the latest release of Harpy.

containerize

With the release of Snakemake 8.x, this would be a good time to reevaluate how Harpy is distributed. Instead of having one environment with all the dependencies (~1gb disk space), it may be a lot easier to strip the harpy conda package down to the essentials and containerize the rest and let snakemake use that container (or multiple) to do execute the workflows.

The essentials would likely just be:

snakemake
rich-click

It would then make sense to create a container for each analytical module so that the deps can be managed independent of the entire workflow

(harpy align) - aggregate report for BX stats for all samples

Description of feature

Have the harpy align modules create an aggregate report for all samples

mark duplicates with samtools

Using samtools markdup will allow marking duplicates in a barcode-aware fashion. The process follows:

samtools collate -o namecollate.bam example.bam

samtools fixmate -m namecollate.bam fixmate.bam

samtools sort -o positionsort.bam fixmate.bam

samtools markdup positionsort.bam markdup.bam

validate samplename

In steps with an input vcf, cross-reference the sample names in the vcf with the samplenames available in the input directory.

The goal is to catch sample name mismatching early, or to run without error and print some kind of log info saying "sample blah was in the directory but not listed in the vcf"

consider phasing with whatshapp haplotagphase

Description of feature

https://whatshap.readthedocs.io/en/latest/guide.html#whatshap-haplotagphase

[QC] add deduplication to fastp

Description of feature

add the -D parameter to fastp to deduplicate the reads
add extra flag to the CLI to skip that

add flag to omit creating reports

There should be an option like -r to skip the creation of the html reports, should a user want that.

option to disable adapter trimming in fastp

Would be useful to have a bit more control over the adapter parts of harpy. At minimum a flag that lets you disable adapter trimming

add minimap2 aligner

Description of feature

For reads >100bp, minimap2 is comparable to bwa in accuracy and much, much faster. Add it as an aligner option.

add pysam as direct dependency

Description of feature

With the completion of bioconda/bioconda-recipes#33333 (comment), pysam can be a direct dependency for the base harpy conda environment, meaning the python scripts that rely on it can now be installed globally via /bin

molecule coverage

Description of feature

Improve the way molecular coverage is calculated in bxStats.py.

just use inferred lengths
make sure the numbers make sense

[QC] --skip-adapters should be opt-in as --trim-adapters

Description of feature

reverse the order so there is more consistent control of this workflow

[internal] replace --print-only with --config-only

Description of feature

Given that the new version of harpy now uses a more robust config.yaml file, it would be more valuable to replace --print-only with --config-only, where the harpy module does all the validations and sets up the output directory, but stops short of invoking snakemake

[SV] interactive variant plots as single plots with dropdowns

Description of feature

Mimic the work done in #108 for the SV plots.

Consider Cue as a variant caller

Description of feature

Investigate if there is merit to having Cue be an SV caller https://github.com/PopicLab/cue?tab=readme-ov-file

remove need for workflow/input

Description of feature

For better hpc compliance and avoiding copying data (or symlinking), try a new approach where:

all inputs have fully resolved paths via Path(I).resolve()
inputs are listed in workflow/config.yaml under inputs: section
the snakefile parses the new config.yaml file and builds the samplename wildcards from it
all the associated clashing checks
[new] check that the file is readable

Checklist

Update wiki to ensure module names are consistent

Description of feature

Minor issue, just thought i'd let you know.

I note that at some point in the past the module harpy snp was called harpy variants; and in the wiki there are pages which still refer to harpy variants, which is a bit confusing. e.g. https://pdimens.github.io/harpy/modules/phase/

thanks

replace minimap with strobealign

Description of feature

strobealign is much faster than BWA and minimap2 (which is itself faster than bwa) and as (or more) accurate, so

keep BWA b/c it's great and reliable
replace minimap with strobealign
https://github.com/ksahlin/strobealign

pdimens / harpy Goto Github PK

harpy's People

Contributors

Stargazers

Watchers

Forkers

harpy's Issues

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Describe the bug

Harpy Version

File that triggers the error (if applicable)

Harpy error log

Describe the bug

Harpy Version

File that triggers the error (if applicable)

Harpy error log

Before submitting

Describe the bug

Harpy Version

File that triggers the error (if applicable)

Harpy error log

Before submitting

Describe the bug

Harpy Version

File that triggers the error (if applicable)

Harpy error log

Before submitting

Description of feature

Description of feature

Description of feature

Description of feature

Describe the bug

Expected behavior

Observed behavior

Harpy Version

File that triggers the error (if applicable)

Harpy error log

Before submitting

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Checklist

Description of feature

Description of feature

Recommend Projects

Recommend Topics

Recommend Org