Coder Social home page Coder Social logo

pdimens / harpy Goto Github PK

View Code? Open in Web Editor NEW
10.0 2.0 1.0 396.06 MB

Process raw haplotagging data, from raw sequences to phased haplotypes, batteries included.

Home Page: https://pdimens.github.io/harpy

License: GNU General Public License v3.0

Python 59.35% Shell 0.19% R 0.42% Perl 39.26% C++ 0.79%
bioinformatics haplotype linked-reads pipeline sequencing variant-calling

harpy's People

Contributors

pdimens avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

gmkov

harpy's Issues

change alignment depth to use samtools depth

Description of feature

I'm not satisfied with how depth is currently being caclulated. A better approach would be with samtools depth -a then calculating a windowed approach in R afterward using a convolution.

harpy hpc to create executor plugin templates

Description of feature

an other module that creates a template config.yaml file to run a snakemake executor plugin

  • slurm
  • lfs
  • htcondor
  • googlebatch
  • generic

usage:

harpy hpc <scheduler>

[phase] retain sites with less than or more than 2 alleles

Description of feature

The most recent commit fixes issues where there are single allele or triallelic (or more) sites, but those sites get lost when merging the unused sites back to make the final vcf. These ignored sites should be isolated too and also merged back into the final vcf.

remove ema-h

Since ema on bioconda has been updated to 0.7.0, remove the bundled ema-h and likewise remove that part from the build script and add ema as a dep in the meta.yaml file necessary for conda-build.

Also, change references to ema-h to ema in the snakemake workflow

add manual rerun

Description of feature

Add a module called something like harpy run INPUT that lets your manually trigger the snakemake workflow in a setup Harpy workflow folder without Harpy redoing and overwriting things.

-x for module `qc` creates error

Describe the bug

When using -x (--extra-params) for the QC module, the workflow results in an error about config files not existing.

Harpy Version

0.8.0 and under

File that triggers the error (if applicable)

No response

Harpy error log

╭─ Harpy qc ──────────────────────────────────────────────────────────────╮
│ Samples: 2                                                              │
│ Output Directory: QC/                                                   │
╰─────────────────────────── Running Workflow ────────────────────────────╯
Traceback (most recent call last):

  File "/home/runner/micromamba/envs/harpy/lib/python3.12/site-packages/snakemake/cli.py", line 1940, in args_to_api
    config_settings=ConfigSettings(
                    ^^^^^^^^^^^^^^^

  File "<string>", line 6, in __init__

  File "/home/runner/micromamba/envs/harpy/lib/python3.12/site-packages/snakemake/settings.py", line 304, in __post_init__
    self.overwrite_config = self._get_overwrite_config()
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/runner/micromamba/envs/harpy/lib/python3.12/site-packages/snakemake/settings.py", line 312, in _get_overwrite_config
    update_config(overwrite_config, load_configfile(f))
                                    ^^^^^^^^^^^^^^^^^^

  File "/home/runner/micromamba/envs/harpy/lib/python3.12/site-packages/snakemake/common/configfile.py", line 38, in load_configfile
    config = _load_configfile(configpath)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/runner/micromamba/envs/harpy/lib/python3.12/site-packages/snakemake/common/configfile.py", line 12, in _load_configfile
    obj = open(configpath_or_obj, encoding="utf-8")
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

FileNotFoundError: [Errno 2] No such file or directory: 'extra: --trim_poly_g\n'


### Before submitting

- [X] I have read the [relevant documentation](https://pdimens.github.io/harpy/).
- [X] I am using the latest release of Harpy.

ema-h preproc -p generating empty ema-bin-xxx files

Hello,

Thanks for this fantastic pipeline and the detailed notes that come with it, very useful!
I went through the quality trimming and read mapping modules, the snakemake rules are executed correctly, but a problem happens when running ema-h.

I checked the ema-h commands by hand one by one out of the Harpy pipeline. I found no problem with the ema-h count -p command, which outputs a ~50Mo ema-ncnt file. The ema-count Summary generated by Harpy also looks good, e.g.:

"Sample","ReadsIgnored","BarcodeValid","BarcodeTotal","PercentValid","PercentIgnored"
"Sample_1","302610","22730346","22730346","100","1.33"

But then the ema-h preproc -p command generates a list of empty ema-bin-xxx output files.

Here is the detailed command line I ran out of Harpy:
seqfu interleave -1 /my_path_to_/Trimming/Sample_1.R1.fq.gz -2 /my_path_to_/Trimming/Sample_1.R2.fq.gz | ema-h preproc -p -n 500 -t 2 -o ./ /my_path_to_/Alignments/ema/count/Sample_1.ema-ncnt

And the output:

:: Bucketing 1 inputs into 500 files with 2 threads
:: Loading known counts file /my_path_to_/Alignments/ema/count/Sample_1.ema-ncnt ...
:: Loading known counts ... done in 4.8 s
:: File assignment ... done in 2.0 s
:: [==================================================] 0 out of 22,730,346 (100.00%)
Writing barcodes ... done in 49.5 s

Looks like although the data contain 22,730,346 valid barcodes, none of them has been processed/written by ema-h preproc -p.

Do you have any idea what might be going on here?
Thanks!
Pierre-Alexandre

harpy impute error IsADirectoryError:

Describe the bug

Thank you for this great resource. I realise it still under development, so unsure about whether to expect modules to run smoothly.

After some initial teething issues with sample names in my own vcf file created with bcftools mpileup outside harpy, I ran into a cryptic conda issue. So I went a step back, and used the harpy snp mpileup module to obtain a bcf, which finished with an error (i think just unable to compile the bcf html report) but the files looked ok within SNP/mpileup/. So then I tried imputation with harpy impute and using this file as input.

my command is:

harpy impute --threads 10 --parameters stitch.params \
--vcf SNP/mpileup/variants.raw.bcf \
bams 2> log.impute

my stitch param file is:

model   usebx   bxlimit k       s       ngen
diploid TRUE    50000   30      1       500
diploid FALSE   50000   30      1       500
diploid TRUE    50000   40      1       500
diploid FALSE   50000   40      1       500

This produces the same cryptic conda error as before, which I paste below.

I also ran the harpy preflight bam bams and whole snakemake build a DAG correctly and seems to create/activate conda environments fine, and runs correctly (after I manually installed install.packages("flexdashboard"), this was the error inititally). So the preflight checks with the bam files were ok.

Harpy Version

0.9.1

File that triggers the error (if applicable)

No response

Harpy error log

(harpy) [mgm49@login-n-1 harpy]$ harpy impute --threads 10 --parameters stitch.params \
--vcf SNP/mpileup/variants.raw.bcf \
bams 2> log.impute

(harpy) [mgm49@login-n-1 harpy]$ less log.impute 
Traceback (most recent call last):
  File "/home/mgm49/miniconda3/envs/harpy/bin/harpy", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/harpy/__main__.py", line 383, in main
    cli()
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/rich_click/rich_command.py", line 126, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/harpy/impute.py", line 66, in impute
    fetch_file(f"{i}.Rmd", f"{workflowdir}/report/")
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/harpy/helperfunctions.py", line 124, in fetch_file
    shutil.copy2(result, destination)
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/shutil.py", line 475, in copy2
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/shutil.py", line 260, in copyfile
    with open(src, 'rb') as fsrc:
         ^^^^^^^^^^^^^^^
IsADirectoryError: [Errno 21] Is a directory: '/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/Impute'

Before submitting

Molecule lengths inferred from BX tags with BWA

Hi Pavel,

I tested the new conda version of Harpy installed with mamba, and it works great!
I just had to install the R package "circlize" myself in my Harpy environment because apparently it didn't come with the conda installation.

I'm now working with the BWA mapping pipeline instead of EMA following your suggestion. The alignment works well, as do the various sumstats output by Harpy align, with the exception of the reportGencov.Rmd script, which I had to modify in two places to match the format of the sample.cov.gz file produced with the BWA option (I replaced 'Barcodes' with 'alignments' and modified the circos.genomicTrack() function).

I was wondering if you were planning to integrate in the BWA pipeline the molecule lengths inferred from BX tags (and also possibly binned as a histogram) as you did with the EMA option. We are currently looking into how to extract this information directly from the sorted bam, I'll let you know if we find a solution on our side.

Best
Pierre-Alexandre

[phase] ridgeplot replacement

Describe the bug

The ridgeplots don't scale well with sample number. Consider a alternatives:

  • plotly where you can toggle the legend on/off
  • horizontal barplot (gradient) colored by peak height
  • others?

Harpy Version

1.2

File that triggers the error (if applicable)

No response

Harpy error log

No response

Before submitting

[phase] --genome error

Describe the bug

  • when using --genome the genome isn't being created in Genome/, which is where it should be

Harpy Version

1.2.1

File that triggers the error (if applicable)

No response

Harpy error log

No response

Before submitting

Add bam indexing to snp module

Description of feature

Currently, harpy snp expects the input bam files to be indexed. Create a rule to index bam files so as to avoid hiccups. #70

output prefix option

Description of feature

Provide a option like -o and --output-prefix to specify an output folder, otherwise default to harpy

[phase] --vcf-samples not working correctly

Describe the bug

Expected behavior

You have fewer samples in the --vcf input than the provided alignments, so you use --vcf-samples to only care about the samples in the vcf, even if more alignments were provided as input bc of wildcards or providing a full directory

Observed behavior

The file checks all pass, but the workflow still uses all the input BAMs and terminates with an error when trying to extract snps for a sample that doesn't exist in the vcf

Harpy Version

1.2.1

File that triggers the error (if applicable)

No response

Harpy error log

[Wed Jul 10 10:56:45 2024]
Error in rule extract_homozygous:
    jobid: 2204
    input: snp/qual.dp.miss100.maf.bcf
    output: Phase/workflow/input/vcf/neg3_ShadHap3.hom.vcf
    shell:
        
        bcftools view -s neg3_ShadHap3 -Ou snp/qual.dp.miss100.maf.bcf | bcftools view -i 'GT="hom"' > Phase/workflow/input/vcf/neg3_ShadHap3.hom.vcf
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job extract_homozygous since they might be corrupted:
Phase/workflow/input/vcf/neg3_ShadHap3.hom.vcf

Before submitting

containerize

With the release of Snakemake 8.x, this would be a good time to reevaluate how Harpy is distributed. Instead of having one environment with all the dependencies (~1gb disk space), it may be a lot easier to strip the harpy conda package down to the essentials and containerize the rest and let snakemake use that container (or multiple) to do execute the workflows.

The essentials would likely just be:

  • snakemake
  • rich-click

It would then make sense to create a container for each analytical module so that the deps can be managed independent of the entire workflow

mark duplicates with samtools

Using samtools markdup will allow marking duplicates in a barcode-aware fashion. The process follows:

samtools collate -o namecollate.bam example.bam

samtools fixmate -m namecollate.bam fixmate.bam

samtools sort -o positionsort.bam fixmate.bam

samtools markdup positionsort.bam markdup.bam

validate samplename

In steps with an input vcf, cross-reference the sample names in the vcf with the samplenames available in the input directory.

The goal is to catch sample name mismatching early, or to run without error and print some kind of log info saying "sample blah was in the directory but not listed in the vcf"

add minimap2 aligner

Description of feature

For reads >100bp, minimap2 is comparable to bwa in accuracy and much, much faster. Add it as an aligner option.

molecule coverage

Description of feature

Improve the way molecular coverage is calculated in bxStats.py.

  1. just use inferred lengths
  2. make sure the numbers make sense

[internal] replace --print-only with --config-only

Description of feature

Given that the new version of harpy now uses a more robust config.yaml file, it would be more valuable to replace --print-only with --config-only, where the harpy module does all the validations and sets up the output directory, but stops short of invoking snakemake

remove need for workflow/input

Description of feature

For better hpc compliance and avoiding copying data (or symlinking), try a new approach where:

  • all inputs have fully resolved paths via Path(I).resolve()
  • inputs are listed in workflow/config.yaml under inputs: section
  • the snakefile parses the new config.yaml file and builds the samplename wildcards from it
  • all the associated clashing checks
  • [new] check that the file is readable

Checklist

  • demultiplex
  • preflight
  • qc
  • align
    • bwa
    • ema
    • minimap
  • snp
    • mpileup
    • freebayes
  • sv
    • naibr
    • naibr populations
    • leviathan
    • leviathan populations
  • impute
  • phase
  • simulate
    • linkedreads
    • variants...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.