zstephens / neat-genreads Goto Github PK
View Code? Open in Web Editor NEWNEAT read simulation tools
License: Other
NEAT read simulation tools
License: Other
Hello Zach,
I was generating simulated reads using the following command:
"python /u/sbundhoo/neat-genreads/genReads.py -r /.mounts/labs/PDE/data/reference/hg19/fasta/UCSC/chr21.fa -R 101 -o /scratch2/users/sbundhoo/02_02 --bam --vcf --gz --pe 300 30 -c 50 -to 0-t /u/sbundhoo/bed_files/sorted_offtarget500b_chr21_all.bed"
However, when I specify -to to 0, I get an error as such:
"ZeroDivisionError: float division by zero"
Isn't the off-target coverage option supposed to work with 0 as the argument?
Thanks,
Shaheen
Hi Zach,
I recently noticed that when a trinucleotide mutation (in this case, GAT->GCT) is not found in the sample, that key is not included in the dictionary. Does this sound like an easy fix? Do we need to draw keys from a hard-coded list of 192, or is there a more efficient way to solve this?
Thanks,
Matt
Hello,
I'm trying to perform reads generation on a sample data coming from https://github.com/slzhao/QC3#sExample. I am using two FASTQ files (3_1.example.fastq
, 3_2.example.fastq
) and a BAM file 3.example.bam
.
I've used genSeqErrorModel.py
(providing both FASTQ files using -i and -i2 options) for generation of sequencing error model.
I am using hg38 reference file truncated to contain only chr1.
I am also using a simple .bed file to simulate targeted sequencing:
chr1 10021 914744 1.797693135e+308 0 0
I am trying to generate single read data, with different values of read length and coverage. The command looks as follows:
python /path/to/neat-genreads/genReads.py -r '/path/to/hg38/chr1.fa' -R 76 -c 4 -o '/path/to/out' -t '/path/to/target_region.bed' -to 0 -e '/path/to/sequencing_error_model.pickle' --bam --vcf --rng 0
For read length of 125, read generation executes successfully (for coverage of 4X, and for coverage of 10X).
For read length of 76, read generation fails (both for 4X and 10X coverage), with an exception:
Using default gc-bias model.
found index /path/to/hg38/chr1.fa.fai
reading chr1... 54.779 (sec)
--------------------------------
sampling reads...
1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11% 12% 13% 14% 15% 16% 17% 18% 19% 20% 21% 22% 23% 24% 25% 26% 27% 28% 29% 30% 31% 32% 33% 34% 35% 36% 37% 38% 39% 40% 41% 42% 43% 44% 45% 46% 47% 48% 49% 50% 51% 52% 53% Traceback (most recent call last):
File "/path/to/neat-genreads/genReads.py", line 749, in <module>
main()
File "/path/to/neat-genreads/genReads.py", line 550, in main
sequences.update(start,refSequence[start:end],PLOIDS,overlap,READLEN,[MUT_MODEL]*PLOIDS,MUT_RATE)
File "/path/to/neat-genreads/py/SequenceContainer.py", line 233, in update
self.init_basicVars(xOffset, sequence, ploidy, windowOverlap, readLen)
File "/path/to/neat-genreads/py/SequenceContainer.py", line 67, in init_basicVars
self.blackList[p][-self.winBuffer] = 3
IndexError: index -76 is out of bounds for axis 0 with size 75
I just generated 3 datasets with 3 different random seeds and they all pairwise diff to 0.
Hi,
I was trying to run genReads.py and provided it with a VCF to add in variants. I also specified that this is a monoploid genome. None of my variants were present in the output. I then tried adjusting my VCF to have two alleles, and then set the ploidy to diploid. In this case, the variants were present in the output. So there seems to be something wrong with the monoploid settings and how it pulls in variants from a VCF file.
Thanks,
Jessica
Hi Zachary,
Thank you for making this! Is it possible to make it clearer in the documentation about running genReads.py using a VCF? What is the exact VCF format that is expected? I am trying to a VCF from COSMIC however am getting errors presumably due to missing GT/AF fields? I did modify the parseVCF function to have "INCLUDE_FAIL = True" however am still getting errors. Any suggestions on how to mod the COSMIC VCF for compatibility would be greatly appreciated!
Example from CSOMIC VCF:
##fileformat=VCFv4.1
##source=COSMICv77
##reference=GRCh37
##fileDate=20160407
##comment="Missing nucleotide details indicate ambiguity during curation process"
##comment="URL stub for COSM ID field (use numeric portion of ID)='http://grch37-cancer.sanger.ac.uk/cosmic/mutation/overview?id='"
##comment="REF and ALT sequences are both forward strand
##INFO=<ID=GENE,Number=1,Type=String,Description="Gene name">
##INFO=<ID=STRAND,Number=1,Type=String,Description="Gene strand">
##INFO=<ID=CDS,Number=1,Type=String,Description="CDS annotation">
##INFO=<ID=AA,Number=1,Type=String,Description="Peptide annotation">
##INFO=<ID=CNT,Number=1,Type=Integer,Description="How many samples have this mutation">
##INFO=<ID=SNP,Number=0,Type=Flag,Description="classified as SNP">
#CHROM POS ID REF ALT QUAL FILTER INFO
7 55242448 COSM3734669 A G . . GENE=EGFR;STRAND=+;CDS=c.2218A>G;AA=p.I740V;CNT=1
7 55242449 COSM85892 T TTCCCGTCGCTATTAAAAT . . GENE=EGFR;STRAND=+;CDS=c.2219_2220ins18;AA=p.K745_E746insIPVAIK;CNT=1
7 55242449 COSM52934 T C . . GENE=EGFR;STRAND=+;CDS=c.2219T>C;AA=p.I740T;CNT=1
7 55242463 COSM18420 AAGG A . . GENE=EGFR;STRAND=+;CDS=c.2234_2236delAGG;AA=p.E746delE;CNT=2
7 55242463 COSM1190791 AAGGAATTAAGAGAAG A . . GENE=EGFR;STRAND=+;CDS=c.2234_2248del15;AA=p.K745_A750>T;CNT=4
7 55242464 COSM13549 AGGAATTAAGAGAAGCAA AAG . . GENE=EGFR;STRAND=+;CDS=c.2235_2251>AG;AA=p.E746_T751>A;CNT=1
7 55242464 COSM255152 A AAACTCCCGTCGCTATCAA . . GENE=EGFR;STRAND=+;CDS=c.2234_2235ins18;AA=p.K745_E746insTPVAIK;CNT=1
7 55242464 COSM13550 AGGAATTAAGAGAAG AAATTC . . GENE=EGFR;STRAND=+;CDS=c.2235_2248>AATTC;AA=p.E746_A750>IP;CNT=3
7 55242464 COSM24869 AGGAATTAAGAGAAGCAAC A . . GENE=EGFR;STRAND=+;CDS=c.2235_2252del18;AA=p.E746_T751delELREAT;CNT=1
7 55242464 COSM6223 AGGAATTAAGAGAAGC A . . GENE=EGFR;STRAND=+;CDS=c.2235_2249del15;AA=p.E746_A750delELREA;CNT=1039
7 55242464 COSM51504 A AAATTCCCGTCGCTATCAA . . GENE=EGFR;STRAND=+;CDS=c.2234_2235ins18;AA=p.K745_E746insIPVAIK;CNT=6
Regards,
Palak Sheth
It seems that genMutModel has an import statement for matplotlib but is never used.
import matplotlib.pyplot as mpl
Dear Zach,
I trying to computeGC.py and I have generated a genomecov file from bedtools as fellow:
bedtools genomecov -d -ibam id.sorted.bam -g ${ref} > id.genomecov
then I took id.genomecov to computeGC by neat:
python ${neat}utilities/computeGC.py -r ${ref} -i id.genomecov -w 50000 -o id_GC_models.p
I got this result:
line 103, in
avgCov = allMean/float(runningTot)
ZeroDivisionError: float division by zer
Any suggestions on how I can fix it or maybe change the sliding window? Any help is appreciated !
Thank you in advance.
Regards
Shatha
When I try to simulate reads for the masked version of A.thaliana genome, after some progress, I experience the following error:
Error: weight or value vector given to DiscreteDistribution() are 0-length.
Data set:
ftp://ftp.ensemblgenomes.org/pub/plants/release-39/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna_rm.toplevel.fa.gz
Parameters:
-R 101 --pe 300 30 -c 20 --vcf --bam
It does not happen with the unmasked version.
I created a toolshed package for this great tool. You can find it in the main toolshed under "neat-genreads".
https://toolshed.g2.bx.psu.edu/
It also has all the utilities (except the plotting tools) that you need to create the model files.
It uses the most recent release of neat-genReads.
ENJOY!
Hello Zach,
I was trying to generate some tumor normal pair using your suggested method provided in the tutorial. I used the same script with the slightest changes(including file names, directories, pathnames etc). However, I am running into the following error: (copied from error log file)
Traceback (most recent call last):
File "neat-genreads/genReads.py", line 150, in
MUT_MODEL = parseInputMutationModel(MUT_MODEL,1)
File "/u/sbundhoo/neat-genreads/py/SequenceContainer.py", line 725, in parseInputMutationModel
outModel[8][i][j][l] /= float(sum(outModel[8][i][j]))
ZeroDivisionError: float division by zero
When I ran the script with the files you provided in the tutorial folder, it seems to work fine though. I am not sure what may be causing this error.
Thanks,
Shaheen
Hello,
I've performed reads generation on a sample data coming from https://github.com/slzhao/QC3#sExample.
I have used two FASTQ files (3_1.example.fastq
, 3_2.example.fastq
) and a BAM file 3.example.bam
. I want to generate paired-end data.
I have used the following utils to provide additional models for the generation:
computeFraglen.py
genSeqErrorModel.py
(providing both FASTQ files using -i
and -i2
options)Then, I've run FastQC on both input and output (generated) files.
For the first generated FASTQ file, results were quite similar to the original FASTQ file.
Although, for the second generated FASTQ file, the quality values seem to be reversed (with regard to position in read), see the attached FastQC graphs below:
Is it a bug in neat-genreads? Or is this behavior expected? Am I missing some knowledge about how R2 reads are stored?
The below commands create empty VCFs. The tumor run only had a vcf given for the last 6 runs.
I imagine this is because the normal run (a) had no VCF given and (b) no mutation rate (-M 0
), while the tumor run (c) also had no mutation rate and (d) only included mutations present in the given VCF. It makes sense in this scenario that there is no need to create a VCF--is this correct / the expected behavior?
python /neat-genreads/genReads.py \
-R 126 -c 100 -M 0 --pe 300 30 \
-o /data/neat/vaf_15/normal \
-r /refs/ensembl/ensembl_build_75.fa \
-t /data/regions.bed \
--bam --vcf --job N 36
python /neat-genreads/genReads.py \
-R 126 --pe 300 30 -c 100 -M 0 \
-o /data/neat/vaf_15/tumor11 \
-t /data/regions.bed \
-r /refs/ensembl/ensembl_build_75.fa \
-v /refs/exac/exac.vcf \
--bam --vcf --job 1 36
Here's the directory output:
-rw-r--r-- 1 root root 821 Oct 13 12:24 tumor_golden.vcf.job01of36
-rw-r--r-- 1 root root 0 Oct 13 09:36 tumor_golden.vcf.job02of36
-rw-r--r-- 1 root root 0 Oct 13 09:30 tumor_golden.vcf.job03of36
-rw-r--r-- 1 root root 0 Oct 13 09:32 tumor_golden.vcf.job04of36
-rw-r--r-- 1 root root 0 Oct 13 09:29 tumor_golden.vcf.job05of36
-rw-r--r-- 1 root root 0 Oct 13 09:35 tumor_golden.vcf.job06of36
-rw-r--r-- 1 root root 0 Oct 13 09:32 tumor_golden.vcf.job07of36
-rw-r--r-- 1 root root 0 Oct 13 09:35 tumor_golden.vcf.job08of36
-rw-r--r-- 1 root root 0 Oct 13 09:29 tumor_golden.vcf.job09of36
-rw-r--r-- 1 root root 0 Oct 13 09:35 tumor_golden.vcf.job10of36
-rw-r--r-- 1 root root 0 Oct 13 09:34 tumor_golden.vcf.job11of36
-rw-r--r-- 1 root root 0 Oct 13 09:34 tumor_golden.vcf.job12of36
-rw-r--r-- 1 root root 0 Oct 13 09:35 tumor_golden.vcf.job13of36
-rw-r--r-- 1 root root 0 Oct 13 09:29 tumor_golden.vcf.job14of36
-rw-r--r-- 1 root root 0 Oct 13 09:33 tumor_golden.vcf.job15of36
-rw-r--r-- 1 root root 0 Oct 13 09:35 tumor_golden.vcf.job16of36
-rw-r--r-- 1 root root 0 Oct 13 09:29 tumor_golden.vcf.job17of36
-rw-r--r-- 1 root root 0 Oct 13 09:33 tumor_golden.vcf.job18of36
-rw-r--r-- 1 root root 0 Oct 13 09:30 tumor_golden.vcf.job19of36
-rw-r--r-- 1 root root 0 Oct 13 09:35 tumor_golden.vcf.job20of36
-rw-r--r-- 1 root root 0 Oct 13 09:36 tumor_golden.vcf.job21of36
-rw-r--r-- 1 root root 0 Oct 13 09:29 tumor_golden.vcf.job22of36
-rw-r--r-- 1 root root 0 Oct 13 09:30 tumor_golden.vcf.job23of36
-rw-r--r-- 1 root root 0 Oct 13 09:35 tumor_golden.vcf.job24of36
-rw-r--r-- 1 root root 0 Oct 13 09:34 tumor_golden.vcf.job25of36
-rw-r--r-- 1 root root 0 Oct 13 09:29 tumor_golden.vcf.job26of36
-rw-r--r-- 1 root root 0 Oct 13 09:35 tumor_golden.vcf.job27of36
-rw-r--r-- 1 root root 0 Oct 13 09:35 tumor_golden.vcf.job28of36
-rw-r--r-- 1 root root 0 Oct 13 09:29 tumor_golden.vcf.job29of36
-rw-r--r-- 1 root root 22125196019 Oct 13 15:06 tumor_merged_golden.bam
-rw-r--r-- 1 root root 3284 Oct 13 15:06 tumor_merged_golden.vcf
In genReads.py
~ line 764-766
if myChr not in inputRegions:
mutRateRegions[myChr] = [-1]
mutRateValues[myChr] = [0.0]
I think inputRegions
is supposed to be mutRateRegions
.
For some reason, when I generate reads using a VCF, most of the threads exit without a sound a few minutes after starting the thread. This seems to happen before the reads are assigned parts of the genome. It is not always the same thread, nor is it always the same number of threads (8-10 still remain)
Command:
REGIONS="/data/regions.bed"
REFERENCE="/refs/ensembl/ensembl_build_75.fa"
VARIANTS="/scripts/cosmic_fix.vcf"
OUT=/data
VAF=15
TUMOR_PRCNT=$(( ${VAF} * 2 ))
G_COV=$(echo "scale=2; (100 - ${TUMOR_PRCNT}) * 4" | bc -l | xargs printf "%.0f")
S_COV=$(echo "scale=2; ${TUMOR_PRCNT} * 4" | bc -l | xargs printf "%.0f")
G_PREFIX=${OUT_DIR}/tumor_g
S_PREFIX=${OUT_DIR}/tumor_s
for i in $(seq 1 ${MAX_JOB})
do
set -x
python /neat-genreads/genReads.py \
-r "${REFERENCE}" \
-R 126 \
-o "${S_PREFIX}" \
--pe 300 30 \
-t "${REGIONS}" \
-c ${S_COV} \
-M 0 \
-v "${VARIANTS}" \
--vcf --bam \
--job $i ${MAX_JOB} 2>&1 | tee ${S_PREFIX}_${i}.log &
set +x
done
wait
for i in $(seq 1 ${MAX_JOB})
do
set -x
python /neat-genreads/genReads.py \
-r "${REFERENCE}" \
-R 126 \
-o ${G_PREFIX} \
--pe 300 30 \
-t "${REGIONS}" \
-c ${G_COV} \
-M 0 \
--vcf --bam \
--job $i ${MAX_JOB} 2>&1 | tee ${G_PREFIX}_${i}.log &
set +x
done
wait
Output (before fail)
Using default sequencing error model.
Warning: Read length of error model (101) does not match -R value (126), rescaling model...
Using default gc-bias model.
Using artificial fragment length distribution. mean=300, std=30
found index /refs/ensembl/ensembl_build_75.fa.fai
--------------------------------
reading input VCF...
Edit: This may be due to memory requirements. the parseVCF
contains the unclosed open
mentioned in a different issue; however, this still occurs when fixed. The VCF file I'm using is 800MB.
For the most part the code not too difficult to follow, but readability could be improved by limiting the line length to 80 columns and using better whitespace. I would recommend using a linter and following pep8 standards.
This is less an issue and more of a suggestion. Whether or not these recommendations are implemented is, of course, entirely up to you. Either way, thank you for taking the time to write this software!
For example, line 58
in probability.py
:
# current
return str(self.weights)+' '+str(self.values)+' '+self.method
# better
return str(self.weights) + ' ' + str(self.values) + ' ' + self.method
# best
return '{} {} {}'.format(self.weights, self.values, self.method)
# another option
return ' '.join([self.weights, self.values, self.method])
Similarly, lines 22-24
in probability.py
:
if len(self.values) != len(self.weights):
print '\nError: length and weights and values vectors must be the same.\n'
exit(1)
could be more pythonic by using exceptions:
if len(self.values) != len(self.weights):
msg = 'Length and weight vectors must be the same length'
raise ValueError(msg)
(also, take a look at the logging
package instead of using print
).
Finally, tabs. I personally prefer using spaces (pep8 also recommends spaces), but the decision is ultimately yours; however, please try to be consistent. For example, lines 98-101
in SequenceContainer.py
are indented using tabs up to the normal indentation, but the arguments are aligned to the [
using spaces.
Hi,
Just a recommendation: don't you think it'd be a good idea to update the genReads.py and mergeJobs.py scripts so that it takes into consideration those using Python 3? If you include parentheses, the script would still work in Python 2.
Thank you.
At higher coverage paramters (e.g., -c 500
), I am getting the following quite often:
Error: Something went wrong!
(14487, 'TCC', 'TTT') TTC
from this command
for i in $(seq 1 10)
do
( python /neat-genreads/genReads.py \
-r ${REFERENCE} \
-R 126 \
-o "${prefix}" \
--pe 300 30 \
-t "${REGIONS}" \
-c 500 \
-M 0 \
--bam --vcf \
--job $i 10 ) &
done
I am unsure what the exact error is, but as the script exists when this occurs, the resulting files are incomplete.
-rw-r--r-- 1 root root 151M Oct 12 16:55 normal_golden.bam.job01of10
-rw-r--r-- 1 root root 264M Oct 12 17:06 normal_golden.bam.job03of10
-rw-r--r-- 1 root root 45M Oct 12 16:43 normal_golden.bam.job04of10
-rw-r--r-- 1 root root 172M Oct 12 16:57 normal_golden.bam.job05of10
-rw-r--r-- 1 root root 164M Oct 12 16:56 normal_golden.bam.job06of10
-rw-r--r-- 1 root root 1.1M Oct 12 16:39 normal_golden.bam.job07of10
-rw-r--r-- 1 root root 85M Oct 12 16:48 normal_golden.bam.job08of10
-rw-r--r-- 1 root root 30M Oct 12 16:42 normal_golden.bam.job09of10
-rw-r--r-- 1 root root 137M Oct 12 16:53 normal_golden.bam.job10of10
-rw-r--r-- 1 root root 821 Oct 12 16:55 normal_golden.vcf.job01of10
-rw-r--r-- 1 root root 843K Oct 12 17:06 normal_golden.vcf.job03of10
-rw-r--r-- 1 root root 0 Oct 12 16:38 normal_golden.vcf.job04of10
-rw-r--r-- 1 root root 0 Oct 12 16:38 normal_golden.vcf.job05of10
-rw-r--r-- 1 root root 0 Oct 12 16:38 normal_golden.vcf.job06of10
-rw-r--r-- 1 root root 0 Oct 12 16:38 normal_golden.vcf.job07of10
-rw-r--r-- 1 root root 0 Oct 12 16:38 normal_golden.vcf.job08of10
-rw-r--r-- 1 root root 89K Oct 12 16:42 normal_golden.vcf.job09of10
-rw-r--r-- 1 root root 16K Oct 12 16:53 normal_golden.vcf.job10of10
-rw-r--r-- 1 root root 1.7G Oct 12 17:09 normal_merged_golden.bam
-rw-r--r-- 1 root root 947K Oct 12 17:09 normal_merged_golden.vcf
-rw-r--r-- 1 root root 2.0G Oct 12 17:07 normal_merged_read1.fq
-rw-r--r-- 1 root root 2.0G Oct 12 17:08 normal_merged_read2.fq
-rw-r--r-- 1 root root 180M Oct 12 16:55 normal_read1.fq.job01of10
-rw-r--r-- 1 root root 312M Oct 12 17:06 normal_read1.fq.job03of10
-rw-r--r-- 1 root root 53M Oct 12 16:43 normal_read1.fq.job04of10
-rw-r--r-- 1 root root 203M Oct 12 16:57 normal_read1.fq.job05of10
-rw-r--r-- 1 root root 193M Oct 12 16:56 normal_read1.fq.job06of10
-rw-r--r-- 1 root root 1.3M Oct 12 16:39 normal_read1.fq.job07of10
-rw-r--r-- 1 root root 100M Oct 12 16:48 normal_read1.fq.job08of10
-rw-r--r-- 1 root root 35M Oct 12 16:42 normal_read1.fq.job09of10
-rw-r--r-- 1 root root 165M Oct 12 16:53 normal_read1.fq.job10of10
-rw-r--r-- 1 root root 180M Oct 12 16:55 normal_read2.fq.job01of10
-rw-r--r-- 1 root root 312M Oct 12 17:06 normal_read2.fq.job03of10
-rw-r--r-- 1 root root 53M Oct 12 16:43 normal_read2.fq.job04of10
-rw-r--r-- 1 root root 203M Oct 12 16:57 normal_read2.fq.job05of10
-rw-r--r-- 1 root root 193M Oct 12 16:56 normal_read2.fq.job06of10
-rw-r--r-- 1 root root 1.3M Oct 12 16:39 normal_read2.fq.job07of10
-rw-r--r-- 1 root root 100M Oct 12 16:48 normal_read2.fq.job08of10
-rw-r--r-- 1 root root 35M Oct 12 16:42 normal_read2.fq.job09of10
-rw-r--r-- 1 root root 165M Oct 12 16:53 normal_read2.fq.job10of10
Hi
I download latest neat from this site and try to simulate some reads containing variants from a vcf, but the output vcf contains no AF information, here is the comman I use
python neat-genreads-master/genReads.py -r human_g1k_v37.fasta -t test.bed -v test.vcf --bam --vcf -R 50 -o test --pe 300 30 -M 0.1 -p 10
Hi zstephens,
I am a new in simulation, I occasionally find your paper in plos one journal.
so I decide to try your simulator to generate simulated normal/tumor to evaluate CNV tools
In case of --pe parameter, can I interpret this to be "Mean insert size" ?
in our lab real data , the expected insert size is 500 bp
so can I set this parameter to be "--pe 500 30"?
Many thanks!
Hi folks,
First of all, many thanks for providing this simulator.
I am trying to generate targeted region reads and I get the following error:
reading chr1... 39.328 (sec)
--------------------------------
sampling reads...
Traceback (most recent call last):
File "../genReads.py", line 662, in <module>
main()
File "../genReads.py", line 508, in main
tCov = not(bisect.bisect(inputRegions[refIndex[RI][0]],rInd)%2) or not(bisect.bisect(inputRegions[refIndex[RI][0]],rInd+GC_WINDOW_SIZE)%2)
KeyError: 'chr1'
Would be grateful if you could please help me to resolve this.
Best,
MH
Hey there!
I run the following line
python genReads.py -r /shared/seq/hg38.fa -R 126 -o /home/eap/simulated_reads/phenyl2 -t /home/eap/docencia/biocomp/phenylk.bed
but I get reads from the whole genome! Any help? I attach the bed file.
phenylk.bed.txt
I keep getting this error, but I cannot tell if this is an issue or not. Looking at the code, the regions should be spread evenly across the threads, and if there are too many threads, it will lower the number of threads.
This job id has no regions to process, exiting...
I wonder--is this check working? (And does it happen before the reference is loaded in memory?)
Hello,
(the following error is present on commit e96b43f)
I am trying to generate only the VCF file using the HG19 reference (chr1 only), given a target file (attached to the issue).
The command is:
python $PATH_TO_NEAT/genReads.py -r $PATH_TO_HG19_CHR1_FASTA -R 10 -o $OUTPUT_PREFIX -t target.bed -to 0 --vcf --no-fastq --rng 0
I get the following error:
generating vcf...
1% 2%
Error: Attempting to insert variant out of window bounds:
(-984, 'C', 'T') --> blackList[0:1000]
The following bug is present in 5b8bc84.
I want to generate a VCF file only. I am using the following command:
python $NEAT2_PATH/genReads.py -r "$HG38_CHR1" -R 50 -c 1 -o "$WORKDIR/common-vcf/" -t "$TARGET_BED" -to 0 --no-fastq --vcf --rng 0
The reference sequence is chr1 from HG38.
Target BED content is as follows: chr1 10021 914744 1.797693135e+308 0 0
I get a "variable referenced before assignment" error:
Only producing VCF output, that should speed things up a bit...
Using default sequencing error model.
Warning: Read length of error model (101) does not match -R value (50), rescaling model...
Using default gc-bias model.
Bypassing FASTQ generation...
found index ...
reading chr1... 60.868 (sec)
--------------------------------
generating vcf...
Traceback (most recent call last):
File ".../neat-genreads2-wip/genReads.py", line 789, in <module>
main()
File ".../neat-genreads2-wip/genReads.py", line 616, in main
if sum(coverage_dat[2]) >= LOW_COV_THRESH:
UnboundLocalError: local variable 'coverage_dat' referenced before assignment
I've generally been setting the number of threads equal to the coverage, but if I set the coverage too high (200x), this results in MemoryErrors
.
I need about 100x coverage (a single sample with -c 100
nets me about 30-40x). I've been generating two sets of samples at 100x and combining them, but I'm not sure what sorts of side effects this may have, plus it takes 2x the time.
Do you have an estimate of the optimal coverage / thread settings?
In py/vcfFunc.py
, line 88: for line in open(vcfPath,'r'):
contains an unclosed file handle. Also, there's a mixture of using the context manager with open(...) as fh:
and just calling open fh = open(...)
. The latter is fine so long as the handle is closed even if any exceptions occur, but the former is better (e.g., refFunc.py:83,160
)
Also, OutputFileWriter
should open files as needed (just 'append') rather than keeping them open. Opening a file is a very low cost. Another option would be to add a context manager to OutputFileWriter
, e.g., with ofw.openVcf(...) as fh:
, so the files remain open only as long as you intend.
Hi,
When I run the following line:
python genReads.py -r test1.fa -R 101 -o test3.fq --bam --vcf --pe 300 30
I receiving this error:
UnboundLocalError: local variable 'havePrinted100' referenced before assignment
I already looked at other answers and can't seem to fix this.
Hi there! Any plans to make this into an installable package via pip and/or conda?
Hi,
Just wonder if there's any option to compare bam files. I tried to write to script for that purpose. With the neat version of fastq, there is huge difference (the absolute distance of mapped positions) between golden bam given by neat and bam from bwa. But with art-simulated fastq as input, the bam from bwa is very close to the true bam.
Thanks a lot!
-Han
when using the --bam
options I keep getting the following error with the resultant file:
[E::bam_read1] CIGAR and query sequence lengths differ for simulatedReads-FN677756.1-74597/1
on samtools 1.6
I'm pretty sure this is a recent thing because it seems to work before, but maybe on another version of samtools.
I'm trying to simulate normal and tumor samples with high levels of coverage, but the coverage is much lower than the given value.
I generated samples with the below command, which resulted in two 6.2G FASTQ files, one 5.2G BAM file, and an empty VCF file.
for i in $(seq 1 100)
do
( python /neat-genreads/genReads.py \
-r /refs/ensembl/ensembl_build_75.fa \
-R 126 \
-o /data/neat/normal \
--pe 300 30 \
-t /data/regions.bed \
-c 100 \
-M 0 \
--bam --vcf \
--job $i 100 ) &
done
wait
python /neat-genreads/mergeJobs.py \
-i /data/neat/normal -o /data/neat/normal_merged -s /bin/samtools
Calling samtools bedcov
with the region file showed that several thousand reads were aligning within each 2-7Mbp region, and samtools mpileup
showed that the bam was averaging about 17 reads per bp (for a specified region).
# bedcov
...
1 XXX YYY 2885
1 XXX YYY 11359
1 XXX YYY 3256
...
# mpileup (within region with 11359 reads)
...
1 XXX N 16 NNNN NNNN
1 XXX N 16 NNNN NNNN
1 XXX N 18 NNNN NNNN
1 XXX N 18 NNNN NNNN
1 XXX N 17 NNNN NNNN
...
Similarly, I tried to generate a synthetic tumor file with the following:
for i in $(seq 1 80)
do
( python /neat-genreads/genReads.py \
-r /refs/ensembl/ensembl_build_75.fa \
-R 126 \
-o /data/neat/tumor \
--pe 300 30 \
-t /data/regions.bed \
-c 80 \
-M 0 \
--bam --vcf \
--job $i 100 ) &
done
for i in $(seq 81 100)
do
( python /neat-genreads/genReads.py \
-r /refs/ensembl/ensembl_build_75.fa \
-R 126 \
-o /data/neat/tumor \
--pe 300 30 \
-t /data/regions.bed \
-c 20 \
-M 0 \
--bam --vcf \
-v /refs/cosmic/cosmic.vcf.gz \
--job $i 100 ) &
done
wait
python /neat-genreads/mergeJobs.py \
-i /data/neat/tumor -o /data/neat/tumor_merged -s /bin/samtools
This resulted in two 11G FASTQ, one 9.3G BAM, and another empty VCF. The subsequent bedcov
and mpileup
showed better coverage:
# bedcov
1 XXX YYY 4887
1 XXX YYY 20249
1 XXX YYY 7470
# mpileup (for 20249)
1 XXX N 25
1 XXX N 25
1 XXX N 27
1 XXX N 27
1 XXX N 27
I jumped up the coverage to -c 1000
(still using 100
jobs), and (not unexpectedly) bash killed the jobs.
I simulated reads ~1million 2x150bp with rescaled error rates using the -E
parameter. Using the golden bam file, I ran samtools calmd
to generate the MD and NM (edit distance to the reference) tags because it doesn't contain them by default.
full set of options:
--pe 500 100 -E 0.1 -c 40 --bam -M 0 -p 1 --rng 1 -R 150
Here is a plot of the NM values for -E 0.1 (expecting a median of 150*0.1 = 15):
median is 6
And for -E 0.3 (the maximum)
median is 10 but would have expected 45
This is using the default error model. Because at the very least -E
seems to rescale the error rate at higher values this isn't bug per say but it could be critically misleading. That is, if you are benchmarking a tool it could mislead you into thinking the tool can tolerate much higher error rates than it can in reality.
I think you have a circular dependency in your imports.
Using a fresh clone of the repo:
python genReads.py -r test.fasta -R 101 -o simulated_data
Traceback (most recent call last):
File "genReads.py", line 38, in <module>
from SequenceContainer import SequenceContainer, ReadContainer, parseInputMutationModel
File "/neat-genreads/py/SequenceContainer.py", line 10, in <module>
from cigar import CigarString
ImportError: cannot import name CigarString
Also, any plan to make this an actual package with a setup.py
, etc ?
After aligning and SNP calling using mpileup I realized no (-) strand were being produced (reverse sequence from provided fasta). Is there an option for this?
Thanks
Hi Zachary,
Thank you for making this! Is it possible to make it clearer in the documentation about running genReads.py using multiple threads? I couldn't exactly understand what arguments are expected with the command "--job" to achieve this.
I really like to be able to use your tool inside of Galaxy, but before I embark on doing a custom install, I was wondering if there is a version already out there.
Tried the usual places (tool-shed, google) but did not find anything so guessing I should start embarking?
After sorting the golden bam I encounter this message:
Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: WARNING: Read name A.tha_01-1-35477/2, No M or N operator between pair of I operators in CIGAR at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:454) at htsjdk.samtools.BAMRecord.getCigar(BAMRecord.java:253) at htsjdk.samtools.SAMRecord.getAlignmentEnd(SAMRecord.java:606) at htsjdk.samtools.SAMRecord.computeIndexingBin(SAMRecord.java:1575) at htsjdk.samtools.SAMRecord.isValid(SAMRecord.java:2087) at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:811) at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:797) at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:765) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:576) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:548) at picard.sam.AddOrReplaceReadGroups.doWork(AddOrReplaceReadGroups.java:182) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
This is the read:
A.tha_01-1-35477/2 147 1 24491 70 85M2I1D6I57M = 24142 -499 AAAATTGGTTACCTATTGGTTTTGGTTGGTTGATTTATCTTACGTTATCAGCAAGCAGAAGTGATCCGTAATCAAACTTGTTTTCAATTGGACATTATTGTCCGAGGGGCGGTGAGATGGGACAGGACTTTTGGGATTCTCGAAGGTGGC 06FFFE>FC>F?FF?@FFFF9GAEGGGEGGGGD5FD)5FFGFGEFE@DGGFG3@FAGFGFGA>GG3GGGEFBFGG4G><;GG>G:F&GGGGBGGE2GFF6F9*B85GE6?FCEEBFG@GGGFE@3DCDGG;E<;FEDG:GEFA-3GG<FD
Also:
Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 9111, Read name A.tha_02-1-6417/1, CIGAR covers 148 bases but the sequence is 150 read bases at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:454) at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:812) at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:797) at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:765) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:576) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:548) at picard.sam.AddOrReplaceReadGroups.doWork(AddOrReplaceReadGroups.java:182) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
python genReads.py -r hg19.fa -R 150 -c 3000 -t target.bed -o sim1_norm --bam --vcf --pe 350 30
The program was still trying to sample all reads from the genome, but not limited to the target.bed regions.
Hi,
I was using the genSeqErrorModel.py
script with two input FASTQ files (using the -i2
option), and what caught my attention is that the script actually reads the same file twice (looking at the reading xxx.fastq
log line).
A brief investigation of the script proved that the parseFQ
function takes inf
argument, but what it actually uses is the INF
variable, which corresponds to the first FASTQ file.
Hi
Thanks to all the authors for such a well-designed tool!
I'm a novice in the Bioinformatics field. I want to generate paired tumor/normal sequence for testing variant callers but I have no idea what parameters should be set.
Also, based on the default Mutation Models, would it be possible to do this?
Hi,
How can I determine the exact coverage and no the average one using your tool?
Thanks for your attention!
When I try to build the index of the golden.bam I encounter an error:
java -jar $PICARD BuildBamIndex I=golden.bam O=golden.bam.bai
Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing SAM header. @RG line missing SM tag. Line:
@RG ID:NEAT; File /home/sequentia1/Alvaro/todo/ggg/masked_2n_control_golden.bam; Line number 9
at htsjdk.samtools.SAMTextHeaderCodec.reportErrorParsingLine(SAMTextHeaderCodec.java:265)
at htsjdk.samtools.SAMTextHeaderCodec.access$200(SAMTextHeaderCodec.java:43)
at htsjdk.samtools.SAMTextHeaderCodec$ParsedHeaderLine.requireTag(SAMTextHeaderCodec.java:346)
at htsjdk.samtools.SAMTextHeaderCodec.parseRGLine(SAMTextHeaderCodec.java:175)
at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:108)
at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:667)
at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:298)
at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:176)
at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:396)
at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:208)
at picard.sam.BuildBamIndex.doWork(BuildBamIndex.java:137)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)
When I try to correct it by AddOrReplaceReadGroups, there is another error:
java -jar $PICARD AddOrReplaceReadGroups I=golden.bam O=golden_RG.bam RGID=4 RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=20
Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing SAM header. @RG line missing SM tag. Line:
@RG ID:NEAT; File /home/sequentia1/Alvaro/todo/ggg/masked_2n_control_golden.bam; Line number 9
at htsjdk.samtools.SAMTextHeaderCodec.reportErrorParsingLine(SAMTextHeaderCodec.java:265)
at htsjdk.samtools.SAMTextHeaderCodec.access$200(SAMTextHeaderCodec.java:43)
at htsjdk.samtools.SAMTextHeaderCodec$ParsedHeaderLine.requireTag(SAMTextHeaderCodec.java:346)
at htsjdk.samtools.SAMTextHeaderCodec.parseRGLine(SAMTextHeaderCodec.java:175)
at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:108)
at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:667)
at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:298)
at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:176)
at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:396)
at picard.sam.AddOrReplaceReadGroups.doWork(AddOrReplaceReadGroups.java:152)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)
Hello Zach,
I have been generating simulated data using different options. I noticed that when I specify the no-fastq option with --bam, the bam file generated is very very small and does not contain any data. However when I run the very same command without the "--no-fastq" option, bam file generated looks good.
Shaheen
The coverage across most of the sites specified in the given BED file is extremely low, typically 10 or less.
This is not always true, as some regions are extremely saturated.
Thank you so much for the prompt reply!
I'd also like to ask:
(1) Why is the read length in the "Usage" description different from the command line? Is it a typo or a tricky thing I should be cautious dealing with?
For example, I want to set read length with 150 bps, but in the command line -R should be 151?
(2) After generating the sequence with this command, I got the golden VCF:
python genReads.py -r chr21.fa -R 150 -o /home/phoebe/neat-genreads/0728_chr21_simulated --bam --vcf --pe 300 30
What does the WP in the column INFO stand for?
Getting an error that I think is due to not using --bam
.
sampling reads...
Traceback (most recent call last):
File "/neat-genreads/genReads.py", line 637, in <module>
main()
File "/neat-genreads/genReads.py", line 585, in main
myRefIndex = indices_by_refName[bamHeader[0][RI][0]]
TypeError: 'NoneType' object has no attribute '__getitem__'
40.290 (sec)
Hi,
I'm trying to use the genReads script and am running into following error:
Using default sequencing error model.
Warning: Read length of error model (101) does not match -R value (150), rescaling model...
Using default gc-bias model.
Using artificial fragment length distribution. mean=300, std=30
index not found, creating one... 5.100 (sec)
Traceback (most recent call last):
File "neat-genreads-master/genReads.py", line 791, in
main()
File "neat-genreads-master/genReads.py", line 353, in main
OFW = OutputFileWriter(OUT_PREFIX,paired=PAIRED_END,BAM_header=bamHeader,VCF_header=vcfHeader,gzipped=GZIPPED_OUT,jobTuple=(MYJOB,corrected_nJobs),noFASTQ=NO_FASTQ)
File "/Users/paulhager/Documents/SmartPhase/Publication/Comparison/neat-genreads-master/py/OutputFileWriter.py", line 120, in init
self.bam_file.write(n[0]+'\0')
File "/Users/paulhager/Documents/SmartPhase/Publication/Comparison/neat-genreads-master/py/biopython_modified_bgzf.py", line 73, in write
self._write_block(self._buffer[:65536])
File "/Users/paulhager/Documents/SmartPhase/Publication/Comparison/neat-genreads-master/py/biopython_modified_bgzf.py", line 59, in _write_block
bsize = struct.pack("<H", len(compressed) + 25) # includes -1
struct.error: ushort format requires 0 <= number <= USHRT_MAX
I'm using osX High Sierra.
My command was:
python neat-genreads-master/genReads.py -r reference/human_g1k_v37.fasta -R 150 -o simNEAT/simulated.NEAT.R150.rg1kv37.c150.pe300_30 -c 150 --pe 300 30 --bam --vcf
Any ideas why it's giving me this error? By my limited tests it seems to occur with the --vcf tag.
Just came across the following error in the master
branch. I haven't had a chance to track it down, but I'll update if I do.
for i in $(seq 1 100)
do
python genReads.py \
-r /refs/ensembl/ensembl_build_75.fa \
-R 126 \
-o /data/neat/normal \
--pe 300 30 \
-t /data/regions.bed \
-c 100 \
-M 0 \
--bam --vcf \
--job $i 100
done
A single command is giving the following error:
Using default sequencing error model.
Warning: Read length of error model (101) does not match -R value (126), rescaling model...
Using default gc-bias model.
Using artificial fragment length distribution. mean=300, std=30
found index /refs/ensembl/ensembl_build_75.fa.fai
enumerating all non-N regions in reference sequence...
reading 1... 33.343 (sec)
/neat-genreads/py/probability.py:86: RuntimeWarning: divide by zero encountered in log
w_range = [np.exp(k*np.log(l) - l - logFactorial_list[k]) for k in k_range]
/neat-genreads/py/probability.py:86: RuntimeWarning: invalid value encountered in double_scalars
w_range = [np.exp(k*np.log(l) - l - logFactorial_list[k]) for k in k_range]
Traceback (most recent call last):
File "/neat-genreads/genReads.py", line 622, in <module>
main()
File "/neat-genreads/genReads.py", line 497, in main
sequences = SequenceContainer(start,refSequence[start:end],PLOIDS,overlap,READLEN,[MUT_MODEL]*PLOIDS,MUT_RATE,coverage_dat,onlyVCF=ONLY_VCF)
File "/neat-genreads/py/SequenceContainer.py", line 33, in __init__
self.indelsToAdd = [n.sample() for n in self.ind_pois]
File "/neat-genreads/py/probability.py", line 63, in sample
return degenerateVal
NameError: global name 'degenerateVal' is not defined
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.