pdimens / harpy Goto Github PK
View Code? Open in Web Editor NEWProcess raw haplotagging data, from raw sequences to phased haplotypes, batteries included.
Home Page: https://pdimens.github.io/harpy
License: GNU General Public License v3.0
Process raw haplotagging data, from raw sequences to phased haplotypes, batteries included.
Home Page: https://pdimens.github.io/harpy
License: GNU General Public License v3.0
I'm not satisfied with how depth is currently being caclulated. A better approach would be with samtools depth -a
then calculating a windowed approach in R afterward using a convolution.
an other
module that creates a template config.yaml
file to run a snakemake executor plugin
usage:
harpy hpc <scheduler>
The most recent commit fixes issues where there are single allele or triallelic (or more) sites, but those sites get lost when merging the unused sites back to make the final vcf. These ignored sites should be isolated too and also merged back into the final vcf.
https://github.com/RolandFaure/QuickDeconvolution?tab=readme-ov-file
deconvolute
moduledeconvolute
for harpy qc
Since ema
on bioconda has been updated to 0.7.0, remove the bundled ema-h
and likewise remove that part from the build script and add ema
as a dep in the meta.yaml
file necessary for conda-build.
Also, change references to ema-h
to ema
in the snakemake workflow
Add a module called something like harpy run INPUT
that lets your manually trigger the snakemake workflow in a setup Harpy workflow folder without Harpy redoing and overwriting things.
When using -x
(--extra-params
) for the QC module, the workflow results in an error about config files not existing.
0.8.0 and under
No response
╭─ Harpy qc ──────────────────────────────────────────────────────────────╮
│ Samples: 2 │
│ Output Directory: QC/ │
╰─────────────────────────── Running Workflow ────────────────────────────╯
Traceback (most recent call last):
File "/home/runner/micromamba/envs/harpy/lib/python3.12/site-packages/snakemake/cli.py", line 1940, in args_to_api
config_settings=ConfigSettings(
^^^^^^^^^^^^^^^
File "<string>", line 6, in __init__
File "/home/runner/micromamba/envs/harpy/lib/python3.12/site-packages/snakemake/settings.py", line 304, in __post_init__
self.overwrite_config = self._get_overwrite_config()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/runner/micromamba/envs/harpy/lib/python3.12/site-packages/snakemake/settings.py", line 312, in _get_overwrite_config
update_config(overwrite_config, load_configfile(f))
^^^^^^^^^^^^^^^^^^
File "/home/runner/micromamba/envs/harpy/lib/python3.12/site-packages/snakemake/common/configfile.py", line 38, in load_configfile
config = _load_configfile(configpath)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/runner/micromamba/envs/harpy/lib/python3.12/site-packages/snakemake/common/configfile.py", line 12, in _load_configfile
obj = open(configpath_or_obj, encoding="utf-8")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'extra: --trim_poly_g\n'
### Before submitting
- [X] I have read the [relevant documentation](https://pdimens.github.io/harpy/).
- [X] I am using the latest release of Harpy.
Hello,
Thanks for this fantastic pipeline and the detailed notes that come with it, very useful!
I went through the quality trimming and read mapping modules, the snakemake rules are executed correctly, but a problem happens when running ema-h.
I checked the ema-h commands by hand one by one out of the Harpy pipeline. I found no problem with the ema-h count -p command, which outputs a ~50Mo ema-ncnt file. The ema-count Summary generated by Harpy also looks good, e.g.:
"Sample","ReadsIgnored","BarcodeValid","BarcodeTotal","PercentValid","PercentIgnored"
"Sample_1","302610","22730346","22730346","100","1.33"
But then the ema-h preproc -p command generates a list of empty ema-bin-xxx output files.
Here is the detailed command line I ran out of Harpy:
seqfu interleave -1 /my_path_to_/Trimming/Sample_1.R1.fq.gz -2 /my_path_to_/Trimming/Sample_1.R2.fq.gz | ema-h preproc -p -n 500 -t 2 -o ./ /my_path_to_/Alignments/ema/count/Sample_1.ema-ncnt
And the output:
:: Bucketing 1 inputs into 500 files with 2 threads
:: Loading known counts file /my_path_to_/Alignments/ema/count/Sample_1.ema-ncnt ...
:: Loading known counts ... done in 4.8 s
:: File assignment ... done in 2.0 s
:: [==================================================] 0 out of 22,730,346 (100.00%)
Writing barcodes ... done in 49.5 s
Looks like although the data contain 22,730,346 valid barcodes, none of them has been processed/written by ema-h preproc -p.
Do you have any idea what might be going on here?
Thanks!
Pierre-Alexandre
Thank you for this great resource. I realise it still under development, so unsure about whether to expect modules to run smoothly.
After some initial teething issues with sample names in my own vcf file created with bcftools mpileup outside harpy, I ran into a cryptic conda issue. So I went a step back, and used the harpy snp mpileup module to obtain a bcf, which finished with an error (i think just unable to compile the bcf html report) but the files looked ok within SNP/mpileup/. So then I tried imputation with harpy impute and using this file as input.
my command is:
harpy impute --threads 10 --parameters stitch.params \
--vcf SNP/mpileup/variants.raw.bcf \
bams 2> log.impute
my stitch param file is:
model usebx bxlimit k s ngen
diploid TRUE 50000 30 1 500
diploid FALSE 50000 30 1 500
diploid TRUE 50000 40 1 500
diploid FALSE 50000 40 1 500
This produces the same cryptic conda error as before, which I paste below.
I also ran the harpy preflight bam bams
and whole snakemake build a DAG correctly and seems to create/activate conda environments fine, and runs correctly (after I manually installed install.packages("flexdashboard"), this was the error inititally). So the preflight checks with the bam files were ok.
0.9.1
No response
(harpy) [mgm49@login-n-1 harpy]$ harpy impute --threads 10 --parameters stitch.params \
--vcf SNP/mpileup/variants.raw.bcf \
bams 2> log.impute
(harpy) [mgm49@login-n-1 harpy]$ less log.impute
Traceback (most recent call last):
File "/home/mgm49/miniconda3/envs/harpy/bin/harpy", line 10, in <module>
sys.exit(main())
^^^^^^
File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/harpy/__main__.py", line 383, in main
cli()
File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/rich_click/rich_command.py", line 126, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/harpy/impute.py", line 66, in impute
fetch_file(f"{i}.Rmd", f"{workflowdir}/report/")
File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/harpy/helperfunctions.py", line 124, in fetch_file
shutil.copy2(result, destination)
File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/shutil.py", line 475, in copy2
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/shutil.py", line 260, in copyfile
with open(src, 'rb') as fsrc:
^^^^^^^^^^^^^^^
IsADirectoryError: [Errno 21] Is a directory: '/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/Impute'
Hi Pavel,
I tested the new conda version of Harpy installed with mamba, and it works great!
I just had to install the R package "circlize" myself in my Harpy environment because apparently it didn't come with the conda installation.
I'm now working with the BWA mapping pipeline instead of EMA following your suggestion. The alignment works well, as do the various sumstats output by Harpy align, with the exception of the reportGencov.Rmd script, which I had to modify in two places to match the format of the sample.cov.gz file produced with the BWA option (I replaced 'Barcodes' with 'alignments' and modified the circos.genomicTrack() function).
I was wondering if you were planning to integrate in the BWA pipeline the molecule lengths inferred from BX tags (and also possibly binned as a histogram) as you did with the EMA option. We are currently looking into how to extract this information directly from the sorted bam, I'll let you know if we find a solution on our side.
Best
Pierre-Alexandre
The ridgeplots don't scale well with sample number. Consider a alternatives:
1.2
No response
No response
--genome
the genome isn't being created in Genome/
, which is where it should be1.2.1
No response
No response
alphabetize workflow options
move the {log} of variant calling into a single file with a basic python run
block featuring:
interval1 message1
interval1 message2
interval2 message1
etc...
Currently, harpy snp
expects the input bam files to be indexed. Create a rule to index bam files so as to avoid hiccups. #70
STITCH is version-bound if kept in the r.yaml
environment harpy uses, so create a separate environment just for itself.
Provide a option like -o
and --output-prefix
to specify an output folder, otherwise default to harpy
You have fewer samples in the --vcf
input than the provided alignments, so you use --vcf-samples
to only care about the samples in the vcf, even if more alignments were provided as input bc of wildcards or providing a full directory
The file checks all pass, but the workflow still uses all the input BAMs and terminates with an error when trying to extract snps for a sample that doesn't exist in the vcf
1.2.1
No response
[Wed Jul 10 10:56:45 2024]
Error in rule extract_homozygous:
jobid: 2204
input: snp/qual.dp.miss100.maf.bcf
output: Phase/workflow/input/vcf/neg3_ShadHap3.hom.vcf
shell:
bcftools view -s neg3_ShadHap3 -Ou snp/qual.dp.miss100.maf.bcf | bcftools view -i 'GT="hom"' > Phase/workflow/input/vcf/neg3_ShadHap3.hom.vcf
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Removing output files of failed job extract_homozygous since they might be corrupted:
Phase/workflow/input/vcf/neg3_ShadHap3.hom.vcf
With the release of Snakemake 8.x, this would be a good time to reevaluate how Harpy is distributed. Instead of having one environment with all the dependencies (~1gb disk space), it may be a lot easier to strip the harpy conda package down to the essentials and containerize the rest and let snakemake use that container (or multiple) to do execute the workflows.
The essentials would likely just be:
It would then make sense to create a container for each analytical module so that the deps can be managed independent of the entire workflow
Have the harpy align modules create an aggregate report for all samples
Using samtools markdup will allow marking duplicates in a barcode-aware fashion. The process follows:
samtools collate -o namecollate.bam example.bam
samtools fixmate -m namecollate.bam fixmate.bam
samtools sort -o positionsort.bam fixmate.bam
samtools markdup positionsort.bam markdup.bam
In steps with an input vcf, cross-reference the sample names in the vcf with the samplenames available in the input directory.
The goal is to catch sample name mismatching early, or to run without error and print some kind of log info saying "sample blah was in the directory but not listed in the vcf"
https://whatshap.readthedocs.io/en/latest/guide.html#whatshap-haplotagphase
-D
parameter to fastp to deduplicate the readsThere should be an option like -r
to skip the creation of the html reports, should a user want that.
Would be useful to have a bit more control over the adapter parts of harpy. At minimum a flag that lets you disable adapter trimming
For reads >100bp, minimap2 is comparable to bwa in accuracy and much, much faster. Add it as an aligner option.
With the completion of bioconda/bioconda-recipes#33333 (comment), pysam can be a direct dependency for the base harpy conda environment, meaning the python scripts that rely on it can now be installed globally via /bin
Improve the way molecular coverage is calculated in bxStats.py
.
reverse the order so there is more consistent control of this workflow
Given that the new version of harpy now uses a more robust config.yaml file, it would be more valuable to replace --print-only
with --config-only
, where the harpy module does all the validations and sets up the output directory, but stops short of invoking snakemake
Mimic the work done in #108 for the SV plots.
Investigate if there is merit to having Cue
be an SV caller https://github.com/PopicLab/cue?tab=readme-ov-file
For better hpc compliance and avoiding copying data (or symlinking), try a new approach where:
Path(I).resolve()
workflow/config.yaml
under inputs:
sectionconfig.yaml
file and builds the samplename
wildcards from itMinor issue, just thought i'd let you know.
I note that at some point in the past the module harpy snp was called harpy variants; and in the wiki there are pages which still refer to harpy variants, which is a bit confusing. e.g. https://pdimens.github.io/harpy/modules/phase/
thanks
strobealign is much faster than BWA and minimap2 (which is itself faster than bwa) and as (or more) accurate, so
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.