omaroakheart / nphase Goto Github PK

View Code? Open in Web Editor NEW

41.0 41.0 4.0 51.76 MB

Ploidy agnostic phasing pipeline and algorithm

License: GNU General Public License v3.0

Python 100.00%

bioinformatics bioinformatics-pipeline phasing polyploid

nphase's People

Contributors

Stargazers

Watchers

Forkers

pythseq agedplanarian shengxinzhuan

nphase's Issues

Short reads unsuccessfully mapped to reference

When I use the Conda installation, I got this error :

Long reads successfully mapped to reference
Removing quality 0 (multimapped) reads, turning to bam and sorting it
ERROR: Short reads unsuccessfully mapped to reference

this is my command:

conda activate polyploidPhasing
nphase pipeline --sampleName TME3 --reference /media/vol2/scratch/lcornet/cassava/NPHASE/tme-hybrid-complete.fasta --R1 /media/vol2/scratch/lcornet/cassava/Dataset/raw-illumina/tme3/separator-R1.paired.fastq --R2 /media/vol2/scratch/lcornet/cassava/Dataset/raw-illumina/tme3/separator-R2.paired.fastq --longReads /media/vol2/scratch/lcornet/cassava/Dataset/ONT/tme3/tme3_all_cells_more_than_8_kb.fastq --longReadPlatform ont --output /media/vol2/scratch/lcornet/cassava/NPHASE/nphase --threads 40

The same command works (at this first step) with a singularity and the same file. Is there something more do to do with conda ? Can it be linked to a RAM problem ?

Thanks,

nPhase needs a guide/tutorial for working on large/very heterozygous genomes

As long as nPhase doesn't make efficient use of heuristics to drastically speed up prediction time, users will run into issues with trying to run it on large genomes and could benefit from a guide to help reduce the time it takes to obtain results and how to interpret them.

nPhase is too slow/memory intensive

Right now nPhase is deterministic, meaning it doesn't use any useful time-saving heuristics. It also makes no use of parallelization. Generally it isn't optimized. It would be good to work on these points to make it more appropriate for use on larger/more heterozygous genomes.

Applying to contigs or chromosome-level scaffolds

Hi,

Thank you for sharing this tool with the community. After reading the manual and issues, I would like to use nPhase for my genome projects.

FYI,
Input data: PacBio (Sequel) with ~200X coverage (raw read N50 = ~14Kb)
Assembled contigs: ~1,000 cotigs (N50 = ~3Mb)
Chromosome-level scaffolds: 20 Chromosomes with Pore-C data (PromethION)
Trio data: Not available

Given the circumstances, I would like to ask your professional opinions on which stage would be better to use nPhase 1) "Assembled contigs" or 2) "Chromosome-level scaffolds"?

Thank you in advance!

Any advice on how to create a haplotype-resolved assembly form the phase FastQ reads

I now have phased nanopore reads from nPhase. Thank you very much for the tool and your help so far.

Now, the task is to generate a first haplotype-resolved assembly. The phased reads are distributed into 191 clusters of different sizes. Having a tetraploid organism with 16 chromosomes that is approximately 3 clusters per expected haplotype which looks like a good number to me. However, the clusters are of very different size (from 1 to several thousands of reads) and total length (from 87bp to 57Mbp!). So I thought to: 1. filter out any cluster < 10 kb and any singleton clusters 2: sort the clusters by chromosome (as given in the .tsv) 3: assemble each remaining cluster individually with Flye or Necat, in a way similar to what is described in O'Donnell et al. 2023. But I am not sure if I understand the methods used correctly.

Thank you very much for your help.

ZeroDivisionError in assignLongReadToSNPs when samLines < 100

Hi,

I've come across an edge case whilst messing around with the tool. When the number of reads in the long-read alignment file is less than 100, the following code in function assignLongReadToSNPs() will fail:

if i%(int(len(samLines)/100)) == 0:
        print(int(i/len(samLines)*100)+1,"%")

If len(samLines)/100 returns a value less than 1, int() will turn it to a 0, which results in a ZeroDivisionError in the if conditional.

Cheers
Alastair

nPhase version: nPhase pipeline 1.1.3
Downloaded through conda

Running nPhase partial with mapped short reads

Hi!
I have an issue when trying to run nPhase partial (version v1.0.12) using both mapped long reads and mapped short reads.
Running:
nphase partial --sampleName $ID --reference $ref --output ~/nPhase/ --longReads $longReads --mappedLongReads $mappedLongReads.sam --mappedShortReads $mappedShortReads.bam

have me the following error:
Traceback (most recent call last): File "/home/denkena/miniconda3/envs/polyploidPhasing/bin/nphase", line 8, in <module> sys.exit(main()) File "/home/denkena/miniconda3/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 557, in main partialPipeline(args) File "/home/denkena/miniconda3/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 375, in partialPipeline SNPVCFFile=open(SNPVCFFilePath,"r") FileNotFoundError: [Errno 2] No such file or directory: '/home/denkena/nPhase/matchedNormal_DNA/VariantCalls/shortReads/matchedNormal_DNA.SNPs.vcf'

It seems the pipeline is skipping the Variant Calling from the mapped short reads? Is there something I can do to get this to run?
Thanks!

Extract phased variants or reads

Hi Omar,

Thanks for sharing this great tool. I was wondering if there is a way to export a phased vcf or tag long reads with their haplotype group as in Whatshap-haplotag?

Is it possible to run nPhase without short reads?

I just wondered if I use Pacbio HiFi reads, which are supposed to have lower error rates, is it still a requirement of illumina short reads to run the pipeline? Thanks very much! Best, Tao

Number of haplotig

I have a question concerning the nulber of haplotig.

I have a diploid genome of 173 contigs.
nphase predicted 757 haplotig.
After vizualisation, some contig have a high ploidy (frequently 4-6).

Is it the results of the options/threshold, should I specify something else (run with default option currently) ?

errors in partial and full pipeline

Hi, very interesting tool for us. I tried the following but ancountered an error:
nphase partial --sampleName $samplename --reference $reference --output ./ --mappedShortReads $mappedShortReads
--longReads $longreads --mappedLongReads $mappedONTread

Traceback (most recent call last):
File "/media/bulk_01/users/essel002/anaconda3/envs/polyploidPhasing/bin/nphase", line 11, in
sys.exit(main())
File "/media/bulk_01/users/essel002/anaconda3/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 513, in main
partialPipeline(args)
File "/media/bulk_01/users/essel002/anaconda3/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 333, in partialPipeline
shortReadBam==args.mappedShortReads
UnboundLocalError: local variable 'shortReadBam' referenced before assignment

Any idea ?

I also ried the whole pipeline but after the markduplicate step the bam file seems to be removed and sorting the bam files gives the error
Here is the error part in the log file:
INFO 2020-09-14 15:25:16 MarkDuplicates Read 448,000,000 records. Elapsed time: 08:23:29s. Time for last 1,000,000: 1,100s. Last read position: StSOLv1.1ch01:34,366,000
INFO 2020-09-14 15:25:16 MarkDuplicates Tracking 55877209 as yet unmatched pairs. 55877209 records in RAM.
INFO 2020-09-14 15:47:00 MarkDuplicates Read 449,000,000 records. Elapsed time: 08:45:13s. Time for last 1,000,000: 1,303s. Last read position: StSOLv1.1ch01:34,414,428
INFO 2020-09-14 15:47:00 MarkDuplicates Tracking 55885188 as yet unmatched pairs. 55885188 records in RAM.
INFO 2020-09-14 16:18:50 MarkDuplicates Read 450,000,000 records. Elapsed time: 09:17:03s. Time for last 1,000,000: 1,909s. Last read position: StSOLv1.1ch01:34,477,480
INFO 2020-09-14 16:18:50 MarkDuplicates Tracking 55925108 as yet unmatched pairs. 55925108 records in RAM.
[Mon Sep 14 16:51:18 CEST 2020] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 589.56 minutes.
Runtime.totalMemory()=29487529984
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.reflect.Array.newArray(Native Method)
at java.lang.reflect.Array.newInstance(Array.java:75)
at java.util.Arrays.parallelSort(Arrays.java:1178)
at htsjdk.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:247)
at htsjdk.samtools.util.SortingCollection.add(SortingCollection.java:182)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:553)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:257)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
Using GATK jar /media/bulk_01/users/essel002/anaconda3/envs/polyploidPhasing/share/gatk4-4.1.8.1-0/gatk-package-4.1.8.1-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /media/bulk_01/users/essel002/anaconda3/envs/polyploidPhasing/share/gatk4-4.1.8.1-0/gatk-package-4.1.8.1-local.jar MarkDuplicates --REMOVE_DUPLICATES true -I ./StSOLv1.1ch01.Colomba/Mapped/shortReads/StSOLv1.1ch01.Colomba.sorted.bam -O ./StSOLv1.1ch01.Colomba/Mapped/shortReads/StSOLv1.1ch01.Colomba.sorted.MD.bam -M /dev/null

STDOUT:

COMMAND: samtools sort ./StSOLv1.1ch01.Colomba/Mapped/shortReads/StSOLv1.1ch01.Colomba.sorted.MD.bam -o ./StSOLv1.1ch01.Colomba/Mapped/shortReads/StSOLv1.1ch01.Colomba.final.bam

STDERR:

[E::hts_open_format] Failed to open file ./StSOLv1.1ch01.Colomba/Mapped/shortReads/StSOLv1.1ch01.Colomba.sorted.MD.bam
samtools sort: can't open "./StSOLv1.1ch01.Colomba/Mapped/shortReads/StSOLv1.1ch01.Colomba.sorted.MD.bam": No such file or directory

averageScore+=QScores[Q] KeyError: 'Q'

nPhase errors out during the long read pre-processing phase when reads with a basequality>42 are present

nPhase on one chromosome

Hi,
I am working on tetraploid potato and trying to run nPhase on my dataset, but as you mention yourself it is very time consuming and I had to terminate the process after running it for 8 days. I was wondering if there is any way to parallelize the process, for example to run it one chromosome at a time like you present in your paper - though I haven't figured out how to do so. Is there an option in the algorithm or have you just subset the dataset before running the algorithm?
Best regards,
Elsa

error in : Identified heterozygous SNPs in short read VCF

I installed phase in a singularity container (actually I installed the condo in the container).
I have an issue while running nphase and I am not sure if it's a problem due to the container or to nphase itself :

Long reads successfully mapped to reference
Removing quality 0 (multimapped) reads, turning to bam and sorting it
Short reads successfully mapped to reference
Identified heterozygous SNPs in short read VCF
Traceback (most recent call last):
File "/opt/miniconda/bin/nphase", line 11, in
sys.exit(main())
File "/opt/miniconda/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 298, in main
nPhasePipeline(args)
File "/opt/miniconda/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 140, in nPhasePipeline
line=[line[0],str(int(line[1])-1),line[1]]
ValueError: invalid literal for int() with base 10: '62166_obj'

Can you help me ?

Running nphase partial with a .bam file or a .sam file that isn't sorted or has a header

Currently nphase partial fails if you don't provide the perfect formal for mapped longReads. I'll close this issue when I've updated it to automatically detect if the user submits a .bam or .sam file and to process it accordingly.

If you run into any issue and you use nphase partial with a mapped long read file, follow these steps:

samtools view -h -t reference.fa -@ threadNumber -F 260 longReads.bam -o longReads.sam
samtools sort longReads.sam -@ threadNumber -o longReads.sorted.sam
samtools view longReads.sorted.sam -@ threadNumber -o longReads.sorted.noHeader.sam

Then re-run your command and replace longReads.bam with longReads.sorted.noHeader.sam

Can run nPhase due to lack of matplotlib._contour

I was trying to call nPhase for the first time, but encountered this error:

ModuleNotFoundError: No module named 'matplotlib.contour'

How to fix this?

nPhase has no option to set max heterozygosity level

Whenever someone wants to run nPhase on a very heterozygous, large genome and runs into memory issues/takes too long to run, I recommend subsampling the vcf down to a 0.5~1% heterozygosity level since nPhase will still perform well (and significantly faster). This is probably a pain for everyone and it would be good to include an option to set a maximum heterozygosity level so that the pipeline can do the subsampling it on its own.

nphase takes a long time at GATK step

I am now running nPhase on my real data and noticed it takes a very long time to finish. My data is yeast ONT reads with ~120X coverage and Illumina reads with ~100X coverage.

It seems nPhase is stuck in the GATK HaplotypeCaller step of the short reads, because this is the only process that is running (at only ~100% CPU):

michaeld  40243  40242 98 Jan02 ?        9-21:53:35 java -Dsamjdk.use_async_io_read_samtools=false 
-Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 
-jar /Home/ii/michaeld/miniconda2/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar HaplotypeCaller 
-R ../../reference_genome/GCF_000146045.2/GCF_000146045.2_R64_genomic.fna -ploidy 2 -I ./nphase_out/Kveik_sample_6/Mapped/shortReads/Kveik_sample_6.final.bam -O ./nphase_out/Kveik_sample_6/VariantCalls/shortReads/Kveik_sample_6.vcf

This is the full command line I am using to run it:

nohup nice -2 nphase pipeline --sampleName Kveik_sample_6 --longReads sample6_cleaned_trimmed.fastq.gz 
--longReadPlatform  ont --R1 ../../Illumina_seq/Fastq/6-1_S8_R1_001.fastq.gz --R2 ../../Illumina_seq/Fastq/6-1_S8_R2_001.fastq.gz  \
--reference  ../../reference_genome/GCF_000146045.2/GCF_000146045.2_R64_genomic.fna --output ./nphase_out --threads 120 >nphase_nohup.out &

The machine has 144 threads and 2TB RAM, so it should be fine, also I remember that variant calling based on the short reads didn't take that long when I ran GATK directly. Possibly, one could set some performance options for java/gatk to inscrease the
speed?

Nphase partial error

Hi,

I was hoping to use your tool to phase a draft assembly I have using a VCF file generated from medaka. I have retried the tool three times and I keep getting the below error. Please advise on how I can run the tool?

Traceback (most recent call last):
  File "/home/FCAM/mxaba/miniconda3/envs/polyploidPhasing/bin/nphase", line 11, in <module>
    sys.exit(main())
  File "/home/FCAM/mxaba/miniconda3/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 560, in main
    partialPipeline(args)
  File "/home/FCAM/mxaba/miniconda3/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 449, in partialPipeline
    phaseTool.nPhase(validatedSNPAssignmentsFile,args.strainName,contextDepthsFile,phasedPath,basePath,args.reference,args.minSim,args.minOvl,args.$  File "/home/FCAM/mxaba/miniconda3/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhase.py", line 499, in nPhase
    visDataTextFull=nPhaseFunctions.giveMeFullData(cleanAllClusters)
  File "/home/FCAM/mxaba/miniconda3/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhasePipelineFunctions.py", line 766, in giveMeFullData
    previouslySeen={sortedClusterLines[0][1]:i}
  File "/home/FCAM/mxaba/miniconda3/envs/polyploidPhasing/lib/python3.8/site-packages/sortedcontainers/sortedlist.py", line 891, in __getitem__
    raise IndexError('list index out of range')
IndexError: list index out of range```

The output file reads as follows:

```Tue Oct 26 03:23:32 EDT 2021
Identified heterozygous SNPs in short read VCF
Extracted heterozygous SNP info based on short read VCF
1 %
2 %
3 %
4 %
5 %
6 %
7 %
8 %
9 %
10 %
11 %
12 %
13 %
14 %
15 %
16 %
17 %
18 %
19 %
20 %
21 %
22 %
23 %
24 %
25 %
26 %
27 %
28 %
29 %
30 %
31 %
32 %
33 %
34 %
35 %
36 %
37 %
38 %
39 %
40 %
41 %
42 %
43 %
44 %
45 %
46 %
47 %
48 %
49 %
50 %
51 %
52 %
53 %
54 %
55 %
56 %
57 %
58 %
59 %
60 %
61 %
62 %
63 %
64 %
65 %
66 %
67 %
68 %
69 %
70 %
71 %
72 %
73 %
74 %
75 %
76 %
77 %
78 %
79 %
80 %
81 %
82 %
83 %
84 %
85 %
86 %
87 %
88 %
89 %
90 %
91 %
92 %
93 %
94 %
95 %
96 %
97 %
98 %
99 %
100 %
Loading reads.
Reads loaded.
Loading context depth.
Context depth loaded
Split reads identified.
Split reads processed.
All reads processed.
Initializing cachedCluster
Filling cachedCluster with similarity information
Preparing initial similarity index
Starting clustering loop (0 sequences)
No clusters.
Initializing cachedCluster
Filling cachedCluster with similarity information
Preparing initial similarity index
Starting clustering loop (0 sequences)
No clusters. ```

The code I used is as follows: 
```samtools view -h -t Racon_purge.fa -@ 16 -F 260 Racon_aligned.bam -o /core/Racon_aligned.sam
samtools sort /core/Racon_aligned.sam -@ 16 -o /core/Racon_aligned_sorted.sam
samtools view /core/Racon_aligned_sorted.sam -@ 16 -o /core/Racon_aligned_sorted.noHeader.sam

conda activate polyploidPhasing

nphase partial --sampleName Racon_phase--reference Racon_purge.fasta --output /core/Phased --longReads /core/PromethionFlonge_Porechopped.fasta_no_contaminants.fa --vcf medaka_variant/round_1_phased.vcf --mappedLongReads /core/Racon_aligned_sorted.noHeader.sam --threads 16

Thank you.

No plots and phased FastQ generated, AttributeError: 'DataFrame' object has no attribute 'append'

I am trying to run nPhase on long read and short read data from tetraploid yeast strain. However, after running nphase pipeline or algorithm, no plots are generated in Phased/Plots and no files in Phased/FastQ while .tsv files are generated.

Command:

          nphase algorithm --sampleName Kveik_sample_6 --longReads   sample6_cleaned_trimmed.fastq.gz --contextDepth  \
          nphase_out/Kveik_sample_6/Overlaps/Kveik_sample_6.contextDepths.tsv --processedLongReads \
          nphase_out/Kveik_sample_6/VariantCalls/longReads/Kveik_sample_6.hetPositions.SNPxLongReads.validated.tsv  \
          --output ./nphase_out --threads 20 --reference ../../reference_genome/GCF_000146045.2/GCF_000146045.2_R64_genomic.fna

There is no error message in the log file, but an error is printed on STDERR. Here is the output of the run:

Loading reads.
Reads loaded.
Loading context depth.
Context depth loaded
Split reads identified.
Split reads processed.
All reads processed.
Initializing cachedCluster
Filling cachedCluster with similarity information
Preparing initial similarity index
Starting clustering loop (163818 sequences)
334 clusters
Initializing cachedCluster
Filling cachedCluster with similarity information
/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py:727: PlotnineWarning: Saving 18 x 10 in image.
/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py:730: PlotnineWarning: Filename: ./nphase_out/Kveik_sample_6/Phased/Plots/Kveik_sample_6_0.1_0.01_0.05_0_phasedVis.svg
Preparing initial similarity index
Starting clustering loop (168228 sequences)
187 clusters

Phased files can be found at ./nphase_out/Kveik_sample_6/Phased
The *_variants.tsv file contains information on the consensus heterozygous variants present in each predicted haplotig.
The *_clusterReadNames.tsv file contains information on the reads which comprise each cluster.
Traceback (most recent call last):
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/bin/nphase", line 11, in <module>
    sys.exit(main())
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 589, in main
    nPhaseAlgorithm(args)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 256, in nPhaseAlgorithm
    nPhaseFunctions.generatePhasingVis(simpleOutPath,datavisPath)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhasePipelineFunctions.py", line 594, in generatePhasingVis
    ggsave(g,filename=outputSVG,width=18,height=10)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 761, in ggsave
    return plot.save(*arg, **kwargs)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 750, in save
    raise err
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 747, in save
    _save()
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 734, in _save
    fig = figure[0] = self.draw()
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 181, in draw
    return self._draw(return_ggplot)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 188, in _draw
    self._build()
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 284, in _build
    layout.setup(layers, self)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/facets/layout.py", line 64, in setup
    layer.data = self.facet.map(ldata, self.layout)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/facets/facet_wrap.py", line 136, in map
    keys = join_keys(facet_vals, layout, self.vars)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/utils.py", line 370, in join_keys
    joint = x[by].append(y[by], ignore_index=True)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/pandas/core/generic.py", line 5989, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'append'

nPhase was installed using conda in Ubuntu as to the instructions.

Python and pandas version:

Python 3.8.5 | packaged by conda-forge | (default, Jul 31 2020, 02:39:48)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> pandas.__version__
'2.0.3'

Linux ubuntu-compute-2 5.-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

NameError: name 'cachedSimilarities' is not defined

I keep getting this error which, I suspect that , terminate clustering iteration at 374 clusters every time, no matter what parameters I change.

With this error, nPHASE still could produce output. But 374 clusters is too many, as I am expecting 3 clusters/haplotypes from this data.

Is this NameError normal? Is there any possible way to debug this on my end? Thank you in advance!

Traceback (most recent call last):
  File "/Users/naphat/miniconda3/miniconda3/envs/polyploidPhasing/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/Users/naphat/miniconda3/miniconda3/envs/polyploidPhasing/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/naphat/miniconda3/miniconda3/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhase.py", line 179, in cacheFiller
    cacheA = cachedSimilarities[cacheKey]
NameError: name 'cachedSimilarities' is not defined

Running nPhase partial with mapped short reads

It seems the pipeline is skipping the Variant Calling from the mapped short reads? Is there something I can do to get this to run?
Thanks!

Duration and Heterozygosity

Hi I'm trying to run nPhase on a tetraploid yeast genome I seqeunced with PacBio HiFi reads.

Here is my code:

nphase pipeline --sampleName trialRun1 --reference s288c.fa --longReads pacbioHiFireads.fastq.gz --longReadPlatform pacbio --R1 illuminaShortRead_1.fastq --R2 illuminaShortRead_2.fastq --output trialRun1Output --threads 8

The issue I am having is that it's taking a long time to run - my computer accidentally restarted during its run so the progress I made running it for the past day or two is gone and before I run the code again I wanted to see if there was a tag or something I could add or do to the data to make it go faster.

I saw your messages in issue #11 and #10 but was wondering if there was anything done to the nPhase code to account for this. I also read up on your solution to #12 but I'm not sure how to go about doing that. I can run the code using the command I have written out above and I don't mind if it takes a longer amount of time but how much time on average would it take for a tetraploid yeast genome to run through nPhase?

fastQ file creation failed

In v1.0.8 and prior versions the fastQ file code can error out saying that it can't find a given read.

No clusters found

I have an error when I sued phase with pacbio data:

Long reads successfully mapped to reference
Removing quality 0 (multimapped) reads, turning to bam and sorting it
Short reads successfully mapped to reference
Identified heterozygous SNPs in short read VCF
Extracted heterozygous SNP info based on short read VCF
1 %
2 %
3 %
4 %
5 %
6 %
7 %
8 %
9 %
10 %
11 %
12 %
13 %
14 %
15 %
16 %
17 %
18 %
19 %
20 %
21 %
22 %
23 %
24 %
25 %
26 %
27 %
28 %
29 %
30 %
31 %
32 %
33 %
34 %
35 %
36 %
37 %
38 %
39 %
40 %
41 %
42 %
43 %
44 %
45 %
46 %
47 %
48 %
49 %
50 %
51 %
52 %
53 %
54 %
55 %
56 %
57 %
58 %
59 %
60 %
61 %
62 %
63 %
64 %
65 %
66 %
67 %
68 %
69 %
70 %
71 %
72 %
73 %
74 %
75 %
76 %
77 %
78 %
79 %
80 %
81 %
82 %
83 %
84 %
85 %
86 %
87 %
88 %
89 %
90 %
91 %
92 %
93 %
94 %
95 %
96 %
97 %
98 %
99 %
100 %
Loading reads.
Reads loaded.
Loading context depth.
Context depth loaded
Split reads identified.
Split reads processed.
All reads processed.
Initializing cachedCluster
99 %
100 %
Loading reads.
Reads loaded.
Loading context depth.
Context depth loaded
Split reads identified.
Split reads processed.
All reads processed.
Initializing cachedCluster
Filling cachedCluster with similarity information
Preparing initial similarity index
Starting clustering loop (0 sequences)
No clusters.
Initializing cachedCluster
Filling cachedCluster with similarity information
Preparing initial similarity index
Starting clustering loop (0 sequences)
No clusters.
Traceback (most recent call last):
File "/usr/local/bin/nphase", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 562, in main
nPhasePipeline(args)
File "/usr/local/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 173, in nPhasePipeline
phaseTool.nPhase(validatedSNPAssignmentsFile,args.strainName,contextDepthsFile,phasedPath,basePath,args.reference,arg
s.minSim,args.minOvl,args.minLen,args.maxID)
File "/usr/local/lib/python3.8/site-packages/bin/nPhase.py", line 499, in nPhase
visDataTextFull=nPhaseFunctions.giveMeFullData(cleanAllClusters)
File "/usr/local/lib/python3.8/site-packages/bin/nPhasePipelineFunctions.py", line 766, in giveMeFullData
previouslySeen={sortedClusterLines[0][1]:i}
File "/usr/local/lib/python3.8/site-packages/sortedcontainers/sortedlist.py", line 891, in getitem
raise IndexError('list index out of range')
IndexError: list index out of range

phase was launched like this:
nphase pipeline --sampleName 52594 --reference /mnt/CANU-pilon.fasta --R1 /mnt/52594_R1-fastp.fastq --R2 /mnt/52594_R2-fastp.fastq --longReads /mnt/52594.fastq --longReadPlatform pacbio --output /mnt/outNPHASE --threads 20

Can you help me to solve this?

Output FQ file is empty

Hi,
I wanna to seperate my fq file by nphase. But the output folder (OUTPUT_FOLDER/Phased/FastQ/) is empty.
Do you know the reason? Many thanks!

Process killed after variant calling on short reads

Hi Omar, I have tried to run nPhase on one sample of tetraploid potato. It takes a lot of time, but today it finally finished with variant calling on short reads. I have the output vcf file and also the vcf file with selected SNPs - but then the process stops. I tried to start the process again with nPhase partial, but it is killed again. I can't seem to figure out what the problem is, do you have an idea? I've attached the log file here.

Best, Elsa

fullLog.txt