Hello, I tried to simulate reads using NanoSim for the human genome.

Does NanoSim works for large genomes? about nanosim HOT 7 CLOSED

bcgsc commented on May 30, 2024

Does NanoSim works for large genomes?

from nanosim.

Comments (7)

cheny19 commented on May 30, 2024

Yes, NanoSim works fine with large genomes, like human. The run time should range from minutes to hours, depending on the desired number of reads. If it has been running for 3 days already, you can stop it now, because it means something is going wrong.

Please make sure you got version 1.1.0, as it fixed several bugs. If you still run into the problem, let me know which dataset you used to generate the models, and you can also email me the log file so I have a better understanding of the problem.

Thanks!

from nanosim.

haghshenas commented on May 30, 2024

Please make sure you got version 1.1.0, as it fixed several bugs.

I'm using the latest commit. And other tools meet the requirements:

$ lastal --version
lastal 847

$ R --version
R version 3.3.3 (2017-03-06) -- "Another Canoe"...

$ python --version
Python 2.7.5

$ python -c "import numpy; print numpy.version.version;"
1.12.1

let me know which dataset you used to generate the models

I obtained the dataset from Nanopore WGS Consortium repository. Here is the link to the fastq file:
http://s3.amazonaws.com/nanopore-human-wgs/rel3-nanopore-wgs-3306352129-FAB42798.fastq.gz

Here are the steps I followed:

Make a set of training reads from top 5000 reads in the fastq file:

$ cat rel3-nanopore-wgs-3306352129-FAB42798.fastq | awk 'NR % 4 == 1 {print ">s" (++x)} NR % 4 == 2' | head -n 10000 > real.fasta

Run the read_analysis.py script:

$ read_analysis.py -i real.fasta -r hg38.fa
2017-04-20 18:00:03: Read pre-process and unaligned reads analysis
2017-04-20 18:00:03: Alignment with LAST
2017-04-20 23:27:25: Aligned reads analysis
2017-04-20 23:27:25: match and error models
2017-04-20 23:27:33: Model fitting
2017-04-20 23:35:42: Finished!
$ ls
hg38.fa                                      training_besthit.maf
model_fitting.Rout                           training_del.hist
real.fasta                                   training_error_markov_model
ref_genome.bck                               training.fasta
ref_genome.des                               training_first_match.hist
ref_genome.prj                               training_ht_ratio
ref_genome.sds                               training_ins.hist
ref_genome.ssp                               training.maf
ref_genome.suf                               training_match.hist
ref_genome.tis                               training_match_markov_model
rel3-nanopore-wgs-3306352129-FAB42798.fastq  training_mis.hist
training_aligned_length_ecdf                 training_model_profile
training_aligned_reads_ecdf                  training_unaligned_length_ecdf
training_align_ratio

Run the simulator.py script:

$ simulator.py linear -r hg38.fa

But the simulator.py script keeps running without generating any simulated reads. Also simulated.log is empty.

Please let me know if you need any of the files.
Thanks.

from nanosim.

haghshenas commented on May 30, 2024

Hi there,

Did you have a chance to look into this issue?

Thanks

from nanosim.

cheny19 commented on May 30, 2024

I repeated your steps and didn't notice any error. One thing to notice is it takes long to read in the reference genome given the size. I haven't finished reading in yet, but the simulated.log is not empty. If you force quit the program, It will show information like this:

2017-04-25 10:16:15: ../simulator/src/simulator.py linear -r ~/genome.fa -n 10
2017-04-25 10:16:15: Read error profile
2017-04-25 10:16:15: Read ECDF of unaligned reads
2017-04-25 10:16:15: Read ECDF of aligned reads
2017-04-25 10:16:15: Read in reference genome

If you do have such log file, it means it's trying to read in the reference genome, and probably better to choose a faster machine, or just wait. If your log file is empty, unfortunately I cannot reproduce this problem.

from nanosim.

haghshenas commented on May 30, 2024

I get the same log when I kill the process. But I don't think reading the reference genome should take more than 3 days as I am using a powerful workstation.
I start to suspect that there is a problem with handling large genome. Please let me know if waiting helps you with generating reads.

from nanosim.

cheny19 commented on May 30, 2024

NanoSim reads in the genome and converts all bases to upper case. I think this is the most time consuming part for large genome right now. I'll fix the code and let you know. You don't have to wait for it to run. Thanks for letting me know about this!

from nanosim.

haghshenas commented on May 30, 2024

I actually fixed the code and will send a pull request. There were two very slow parts in the code.

reading the reference fasta file as you mentioned. It can be fixed with a generator function (I used the one from here)
fixing ambiguous cases in the reference. I used a method that avoids slicing the string.

Note that this code is still not optimized in using memory which can be fixed.

from nanosim.

Does NanoSim works for large genomes? about nanosim HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent