mehelmy / princess Goto Github PK

License: MIT License

Python 100.00%

long-reads methylation phasing single-nucleotide-variation structural-variation

princess's Introduction

Hi there 👋, Medhat Mahmoud

Bioinformatic scientist

Computational biology Ph.D., Software engineer. Love programming using Java, Go, Python, and R. Thrive on challenges and live for breakthroughs (and coffee). Working on data analysis, genome assembly, and comparative analysis. Also interested in evolution and mutations and phylogenetics. Where others say “that’s impossible”, I say “when do I start?” (usually after a cup of coffee).

Skills:

🔭 I’m currently working on Human genomic variations
🌱 I’m currently learning Rust
👯 I’m looking to collaborate on Open source Bioiformatics projects

princess's People

Contributors

Stargazers

Watchers

Forkers

pythseq musketeer-d mariesaitou quanrd jang1563 sachingadakh laurentijntilleman zhengxinchang n-damo

princess's Issues

Question on running princess on single machine

Hello,
I'm trying to run princess on a single node using a total of 24 cores. When I run the task using the command:

princess all -f reference.fa -j 24 -a ngmlr -e -u -r ccs -d ./ -g princess.log -s all.reads.fq.gz

The workflow runs fine, however I noticed that NGMLR is actually using only 5 cores at the time, making it running inefficiently. Is there a way to specify the number of cores for each single substage?
Otherwise, is it possible to break down the workflow into substages to optimize the number of resources by running them one after another? For example, would something like this lead to use 24 cores for the alignment, the variant calling and the phasing?

princess align -j 24 [OPTIONS HERE]
princess variant -j 24 [OPTIONS HERE]
princess phase -j 24 [OPTIONS HERE]

Thank you in advance,
Andrea

Issues running on cluster2

Dear Medhat,

Thank you very much for your constant help.
Princess worked fine once, but now, previously worked Princess pipeline stopped working on our server.
I really appreciate your help with this.

This is what I have done so far:

(1) Ran previously worked pipeline (May) recently but got an error (August)
(2) Re-installed the newest Princess and replaced default .yml files with the one for our environment
(3) Got the same error

This is what we used to get
This is what we now get:snakemake.log
This is what we now get:general.log
The script

Thank you very much!!!

Failed to read parameters in config file

Hello.

I created a virtual environment for princess and did the setup as described in the README file. Here is my command for princess: ./princess all -d /path/to/output/ -r ccs -e False -a ngmlr -s /path/to/sample.fastq.gz -f hg38.fa -c chr22.

I got this error:
Invalid config definition: Config entries have to be defined as name=value pairs.

None of the parameters specified was modified in the config.yaml inside /path/to/output/. I tried editing the config file manually but still got the same error. Did I miss any step?

Thank you!
James

Parental information for phasing

Hi,

The preprint mentions that parental information can be used by the tool but it is not clear where this can be given in the help documentation and how you tell the tool what relation the data is. Ideally I'd like to give it to the all subcommand, would this be possible?

Thank you!

An error

Hi there

I ran a trial test using the command line as: princess all --directory analysis --ReadType ccs --ref /home/bwu3/Used_V41_P13_107/p107.ens.fasta --jobs 7 --sampleFiles HiFi.fastq.gz --latency-wait 200 -p -c 2 --verbose

An error popped up below
########
Error in rule mergeAlign:
jobid: 4
output: /mnt/chsrhome/bwu3/test_file/analysis/align/minimap/data.bam
log: /mnt/chsrhome/bwu3/test_file/analysis/align/minimap/merge.log (check log file(s) for error message)
conda-env: /mnt/chsrhome/bwu3/test_file/analysis/.snakemake/conda/a34f2739af6ede4165125922d923477d_
shell:

    samtools merge -@ 5 /mnt/chsrhome/bwu3/test_file/analysis/align/minimap/data.bam /mnt/chsrhome/bwu3/test_file/analysis/align/minimap/HiFi.fastq.gz.bam > /mnt/chsrhome/bwu3/test_file/analysis/align/minimap/merge.log 2>&1

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
cluster_jobid: 3999

Error executing rule mergeAlign on cluster (jobid: 4, external: 3999, jobscript: /mnt/chsrhome/bwu3/test_file/analysis/.snakemake/tmp.a0pkihz4/snakejob.mergeAlign.4.sh). For error details see the cluster log and the log files of the involved rule(s).
########

Could you please give me some suggestions? Thanks

Issues running on cluster and locally

Issue for running locally:

Command:
./princess all -d /path/to/output/directory -r ont -e False -a minimap -s /path/to/fastq.gz -f /path/to/reference.fna -c chr15 -m -md /path/to/fast5/folder/

Error:
Invalid config definition: Config entries have to be defined as name=value pairs.

Issue for running on cluster:

Command:
./princess all -d /path/to/output/directory -r ont -a minimap -s /path/to/fastq.gz -f /path/to/reference.fna -c chr15 -m -md /path/to/fast5/folder/

Error:

Traceback (most recent call last):
  File "/path/to/working/directory/cluster/scheduler.py", line 79, in <module>
    raise Exception("Job can't be submitted\n"+output.decode("utf-8")+error.decode("utf-8"))
Exception: Job can't be submitted
Unable to run job: attribute "m_numa_nodes" is not a integer value.
Exiting.

command not found

Hi!

After princess was installed, "princess -h" doesnt work. It says "command not found".
Anyone can help?

Thanks! :)

Could not clone princess

Hi @MeHelmy ,

I face this error when I want to clone princess:

(base) [vakbari@gphost07 princess]$ git clone [email protected]:MeHelmy/princess.git
Cloning into 'princess'...
Permission denied (publickey).
fatal: Could not read from remote repository.
Please make sure you have the correct access rights and the repository exists.

How can I clone it?
Many thanks,
Vahid.

can your software be used on a local machine

We want to use your software to detect SV and SNV on ONT fasta file, however, we do not have a cluster, can your software be used on a local machine?If it could, how to revise the config.yaml file?

ModuleNotFoundError: No module named 'skbuild'

Hello @MeHelmy
When I try to install princess, the last step will throw this error message，ModuleNotFoundError: No module named 'skbuild'.
I have tried this method，refer to https://github.com/MeHelmy/princess/issues/2，but I failed too.

Configuration：
Ubuntu 5.4.0-6ubuntu1~16.04.5
python3.7.3

Can I change "pypy3 -m ensurepip pypy3 -m pip install --no-cache-dir intervaltree blosc" to "python -m ensurepip pyhton -m pip install --no-cache-dir intervaltree blosc" in install.sh。

best，
XFY

ModuleNotFoundError: No module named 'skbuild'

Hello @MeHelmy
When I try to install princess, the last step will throw this error message.

Can you tell me your hardware and software environment of this software?
Or you can make a Docker image of this software and publish it to DockerHub, then anyone can use your software easy.

    Using cached https://files.pythonhosted.org/packages/00/83/b4a77d044e78ad1a45610eb88f745be2fd2c6d658f9798a15e384b7d57c9/w
heel-0.33.6-py2.py3-none-any.whl
  Collecting scikit-build
    Using cached https://files.pythonhosted.org/packages/8a/b5/c6ca60421991c22e69b9a950b0d046e06d714f79f7071946ab885c7115fb/s
cikit_build-0.10.0-py2.py3-none-any.whl
  Collecting cmake
    Using cached https://files.pythonhosted.org/packages/0a/f5/3212616a15b4112d7ad075a407f007eac2cac59292a9d973f2ee7e3c4068/c
make-3.13.2.post1.tar.gz
      Complete output from command python setup.py egg_info:
      Traceback (most recent call last):
        File "<string>", line 1, in <module>
        File "/tmp/pip-install-jnys9rno/cmake/setup.py", line 7, in <module>
          from skbuild import setup
      ModuleNotFoundError: No module named 'skbuild'
  
      ----------------------------------------
  Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-jnys9rno/cmake/
  You are using pip version 18.1, however version 20.0.1 is available.
  You should consider upgrading via the 'pip install --upgrade pip' command.

About Phasing in Princess

I was in contact with the developer but he seems to be busy recently so I am posting it here as well hoping other users may have an answer.

Q (1) Phasing Accuracy
I used IGV to check some SV loci. I could not find “clean” loci so far (phasing and genotype are not contradicting in IGV and VCF).
For example, This locus looks like all heterozygous in six individuals, but in IGV it looks like del/del in one and ref/ref in another.
How can we interpret it? Am I misinterpreting the VCF result?

How does Princess do the phasing? This locus is not a particularly odd locus but in my brief manual inspection, I could not find non-contradicting sites so far.

#CHROM  POS ID REF ALT QUAL    FILTER  INFO   FORMAT  LLsal   Barry  tanner  Bond    Klopp  Brian
ssa01   139041651  0_24650 N   <DEL>   .  PASS    PRECISE;SVMETHOD=JASMINE;CHR2=ssa01;END=139043334;
0|1:0:DEL:.:8:2 0|1:0:DEL:.:8:20|1:0:DEL:.:8:20|1:0:DEL:.:8:20|1:0:DEL:.:8:20|1:0:DEL:.:8:2
#ssa01  139041651   0_24650 N   <DEL>   .   PASS    PRECISE;SVMETHOD=JASMINE;CHR2=ssa01;END=139043334;STD_quant_start=3.33542;STD_quant_stop=8.2991;Kurtosis_quant_start=1.31966;Kurtosis_quant_stop=2.30177;SVTYPE=DEL;RNAMES=4250ef10-dff4-4c5c-9c75-8ad085dcf7a9,43cd9070-cfef-48e3-af31-9ab39d2b93e7,49e571b9-1c82-405c-907e-f019f88f37de,8ba13242-89a8-46b0-8bbe-9ded59f71356,9674455c-0317-4798-9419-cb9e3c6db356,969e1d1e-d696-4d44-8c07-892fbfa14bd7,f6d1babd-83a1-4876-859b-0bff91cad0ce,f7a73e18-1dd5-4032-ac7e-55f61225e5c8;SUPTYPE=SR;SVLEN=-1683;STRANDS=+-;RE=8;REF_strand=1,1;AF=0.8;CONFLICT=0;OLDTYPE=DEL;IS_SPECIFIC=0;STARTVARIANCE=-4.000000;ENDVARIANCE=0.000000;AVG_LEN=-1683.000000;AVG_START=139041651.000000;AVG_END=139043334.000000;SUPP_VEC_EXT=111111;IDLIST_EXT=24650,24650,24650,24650,24650,24650;SUPP_EXT=6;SUPP_VEC=111111;SUPP=6;IDLIST=24650,24650,24650,24650,24650,24650;REFINEDALT=. GT:IS:OT:OS:DV:DR   0|1:0:DEL:.:8:2 0|1:0:DEL:.:8:2 0|1:0:DEL:.:8:2 0|1:0:DEL:.:8:2 0|1:0:DEL:.:8:2 0|1:0:DEL:.:8:2

Q (2) Phased Bam file?
The output Bam file says “minimap.hap.bam”, so I assumed that phasing information was already incorporated.
But it was not clear in the IGV.
The second picture is a bam file from another software for phasing.
Is it possible to tag maternal and paternal strands differently like this yellow/pink picture (or it is already done somehow?)

Thank you very much,
Marie

Can't run this line...

Hi! Hope you can help me. I can't run this readme line:

princess all -d ./princess_all -r ont -s reads.split00.fastq.gz reads.split01.fastq.gz -f hs37d5_mainchr.fa

I also tried to run samtools unsuccessfully...

Specifying read types per file input

Is there a way to specify the -r option per read? My samples are a mix of pacbio and nanopore reads. I was guessing the SV and SNP genotyping would be more accurate if I called on the whole population at once, rather than separate princess runs for nanopore and pacbio.

Let me know if this is just not possible, or if merging results post-hoc would be just as good.

Thanks!

align stage completed successfully but small align bam

Hello.

I ran the align command

bsub -q normal -e ../logs/align.e -o ../logs/align.o -n 32 "princess align -d ../output/17092020 --ReadType clr -u -e --Aligner minimap --samplesFiles list.txt --ref /gpfs/projects/ymokrab_Lab/REF/REF005_Long_read_structural_variants/princess/MNDY02/MNDY0201/input/ref/GRCh37.fa --jobs 200 --log align.log"

and it ended succesfully

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 200
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 indexBam
1 mergeAlign
1 minimap2
3
cat pictures/start.txt

[Thu Sep 17 16:45:25 2020]
Job 1: Running minimap2 , sample is: list.txt
        minimap2  -ax map-pb "/gpfs/projects/ymokrab_Lab/REF/REF005_Long_read_structural_variants/princess/MNDY02/MNDY0201/input/ref/GRCh37.fa"  "/gpfs/projects/ymokrab_Lab/SDR400076/REF005_Long_read_structural_variants/princess/MNDY02/MNDY0201/output/17092020/list.txt" -H "--MD" -t "5" | samtools sort -@ 5 - > "/gpfs/projects/ymokrab_Lab/SDR400076/REF005_Long_read_structural_variants/princess/MNDY02/MNDY0201/output/17092020/align/minimap/list.txt.bam"
[Thu Sep 17 16:46:30 2020]
Finished job 1.
1 of 3 steps (33%) done

[Thu Sep 17 16:46:30 2020]
Job 2: Indexing /gpfs/projects/ymokrab_Lab/SDR400076/REF005_Long_read_structural_variants/princess/MNDY02/MNDY0201/output/17092020/align/minimap/list.txt.bam

samtools index /gpfs/projects/ymokrab_Lab/SDR400076/REF005_Long_read_structural_variants/princess/MNDY02/MNDY0201/output/17092020/align/minimap/list.txt.bam
[Thu Sep 17 16:46:30 2020]
Finished job 2.
2 of 3 steps (67%) done

[Thu Sep 17 16:46:30 2020]
Job 0: Mergeing data
    samtools merge -@ 5 /gpfs/projects/ymokrab_Lab/SDR400076/REF005_Long_read_structural_variants/princess/MNDY02/MNDY0201/output/17092020/align/minimap/data.bam /gpfs/projects/ymokrab_Lab/SDR400076/REF005_Long_read_structural_variants/princess/MNDY02/MNDY0201/output/17092020/align/minimap/list.txt.bam > /gpfs/projects/ymokrab_Lab/SDR400076/REF005_Long_read_structural_variants/princess/MNDY02/MNDY0201/output/17092020/align/minimap/merge.log 2>&1
[Thu Sep 17 16:46:30 2020]
Finished job 0.
3 of 3 steps (100%) done
Complete log: /gpfs/projects/ymokrab_Lab/SDR400076/REF005_Long_read_structural_variants/princess/MNDY02/MNDY0201/output/17092020/.snakemake/log/2020-09-17T164524.510866.snakemake.log
mkdir -p /gpfs/projects/ymokrab_Lab/SDR400076/REF005_Long_read_structural_variants/princess/MNDY02/MNDY0201/output/17092020/snake_log && find . -maxdepth 1 -name 'snakejob.*' -type f -print0 | xargs -0r mv -t /gpfs/projects/ymokrab_Lab/SDR400076/REF005_Long_read_structural_variants/princess/MNDY02/MNDY0201/output/17092020/snake_log && cat pictures/success.txt

but the output bam seems very small. I tried to check the stderr and the log files but did not find anything wrong. Could you please advise ?

-rw-rw---- 1 rmohamadrazali sysbio 1.5K Sep 17 16:46 list.txt.bam
-rw-rw---- 1 rmohamadrazali sysbio 0 Sep 17 16:46 list.txt.log
-rw-rw---- 1 rmohamadrazali sysbio 688 Sep 17 16:46 list.txt.bam.bai
-rw-rw---- 1 rmohamadrazali sysbio 0 Sep 17 16:46 merge.log
-rw-rw---- 1 rmohamadrazali sysbio 1.5K Sep 17 16:46 data.bam

Issues with running on Slurm

Hello

I'm trying to run Princess on a cluster managed by Slurm. I've followed the 4 steps indicated to change the configuration files and the minimap job has been submitted and is running.

However, despite the configuration file specifying 12 CPUs, it seems like the job only requested 3 CPUs on the cluster and the minimap command line only specifies 3 threads. Are there any more settings I need to change to increase the number of threads used and requested?

The job running on the cluster:
minimap2 -Y -R @RG\tSM:SAMPLE\tID:SAMPLE -ax map-ont /mnt/ScratchProjects/Causative/reference/bovine_ARS-UCD1.2/GCF_002263795.2_ARS-UCD1.3_genomic.fna.gz /mnt/ScratchProjects/Causative/bovine_11978/princess/filtlong_11978.fq.gz --MD -t 3 -y

The command to submit to the cluster:
sbatch --parsable --job-name=snakejob.minimap2 -n 3 --mem=20G --partition=smallmem --time=72:00:00 /net/fs-2/scale/OrionStore/ScratchProjects/Causative/bovine_11978/princess/.snakemake/tmp.tqfwec9x/snakejob.minimap2.3.sh