Comments (7)
Yes, NanoSim works fine with large genomes, like human. The run time should range from minutes to hours, depending on the desired number of reads. If it has been running for 3 days already, you can stop it now, because it means something is going wrong.
Please make sure you got version 1.1.0, as it fixed several bugs. If you still run into the problem, let me know which dataset you used to generate the models, and you can also email me the log file so I have a better understanding of the problem.
Thanks!
from nanosim.
Please make sure you got version 1.1.0, as it fixed several bugs.
I'm using the latest commit. And other tools meet the requirements:
$ lastal --version
lastal 847
$ R --version
R version 3.3.3 (2017-03-06) -- "Another Canoe"...
$ python --version
Python 2.7.5
$ python -c "import numpy; print numpy.version.version;"
1.12.1
let me know which dataset you used to generate the models
I obtained the dataset from Nanopore WGS Consortium repository. Here is the link to the fastq file:
http://s3.amazonaws.com/nanopore-human-wgs/rel3-nanopore-wgs-3306352129-FAB42798.fastq.gz
Here are the steps I followed:
- Make a set of training reads from top 5000 reads in the fastq file:
$ cat rel3-nanopore-wgs-3306352129-FAB42798.fastq | awk 'NR % 4 == 1 {print ">s" (++x)} NR % 4 == 2' | head -n 10000 > real.fasta
- Run the read_analysis.py script:
$ read_analysis.py -i real.fasta -r hg38.fa
2017-04-20 18:00:03: Read pre-process and unaligned reads analysis
2017-04-20 18:00:03: Alignment with LAST
2017-04-20 23:27:25: Aligned reads analysis
2017-04-20 23:27:25: match and error models
2017-04-20 23:27:33: Model fitting
2017-04-20 23:35:42: Finished!
$ ls
hg38.fa training_besthit.maf
model_fitting.Rout training_del.hist
real.fasta training_error_markov_model
ref_genome.bck training.fasta
ref_genome.des training_first_match.hist
ref_genome.prj training_ht_ratio
ref_genome.sds training_ins.hist
ref_genome.ssp training.maf
ref_genome.suf training_match.hist
ref_genome.tis training_match_markov_model
rel3-nanopore-wgs-3306352129-FAB42798.fastq training_mis.hist
training_aligned_length_ecdf training_model_profile
training_aligned_reads_ecdf training_unaligned_length_ecdf
training_align_ratio
- Run the simulator.py script:
$ simulator.py linear -r hg38.fa
But the simulator.py script keeps running without generating any simulated reads. Also simulated.log
is empty.
Please let me know if you need any of the files.
Thanks.
from nanosim.
Hi there,
Did you have a chance to look into this issue?
Thanks
from nanosim.
I repeated your steps and didn't notice any error. One thing to notice is it takes long to read in the reference genome given the size. I haven't finished reading in yet, but the simulated.log
is not empty. If you force quit the program, It will show information like this:
2017-04-25 10:16:15: ../simulator/src/simulator.py linear -r ~/genome.fa -n 10
2017-04-25 10:16:15: Read error profile
2017-04-25 10:16:15: Read ECDF of unaligned reads
2017-04-25 10:16:15: Read ECDF of aligned reads
2017-04-25 10:16:15: Read in reference genome
If you do have such log file, it means it's trying to read in the reference genome, and probably better to choose a faster machine, or just wait. If your log file is empty, unfortunately I cannot reproduce this problem.
from nanosim.
I get the same log when I kill the process. But I don't think reading the reference genome should take more than 3 days as I am using a powerful workstation.
I start to suspect that there is a problem with handling large genome. Please let me know if waiting helps you with generating reads.
from nanosim.
NanoSim reads in the genome and converts all bases to upper case. I think this is the most time consuming part for large genome right now. I'll fix the code and let you know. You don't have to wait for it to run. Thanks for letting me know about this!
from nanosim.
I actually fixed the code and will send a pull request. There were two very slow parts in the code.
- reading the reference fasta file as you mentioned. It can be fixed with a generator function (I used the one from here)
- fixing ambiguous cases in the reference. I used a method that avoids slicing the string.
Note that this code is still not optimized in using memory which can be fixed.
from nanosim.
Related Issues (20)
- Could not retrieve index file for alignments HOT 2
- Abnormal error distribution in reads HOT 3
- Simulating exact number of reads for each input -- transcriptome mode HOT 2
- Reads come mostly from one chromosome only HOT 1
- hangs at start of simulation HOT 5
- Can NanoSim be applied to simulate single cell Nanopore reads HOT 1
- Invalid fastq format HOT 4
- mis.hist missing when running characterization stage
- file has no sequences defined (mode='r') - is it SAM/BAM format? Consider opening with check_sq=False HOT 1
- Nanosim hangs in the middle HOT 18
- Infinite loop in function extract_reads in metagenome mode when length equals max length HOT 2
- Transcriptome mode error rate tsv explanation HOT 2
- Models for R10.3 or R10.4 flow cell
- Option to specify desired read coverage or sequencing depth HOT 2
- ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required. HOT 6
- Please specify the training reads and its reference genome! HOT 3
- Stuck at simulation stage HOT 4
- simulator.py genome FileNotFoundError: [Errno 2] No such file or directory: 'training_model_profile' HOT 1
- NanoSim for tuning Minimap2 parameters? HOT 2
- Models for newer versions of Guppy with sup basecalls HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nanosim.