Coder Social home page Coder Social logo

Comments (7)

cheny19 avatar cheny19 commented on May 30, 2024

Yes, NanoSim works fine with large genomes, like human. The run time should range from minutes to hours, depending on the desired number of reads. If it has been running for 3 days already, you can stop it now, because it means something is going wrong.

Please make sure you got version 1.1.0, as it fixed several bugs. If you still run into the problem, let me know which dataset you used to generate the models, and you can also email me the log file so I have a better understanding of the problem.

Thanks!

from nanosim.

haghshenas avatar haghshenas commented on May 30, 2024

Please make sure you got version 1.1.0, as it fixed several bugs.

I'm using the latest commit. And other tools meet the requirements:

$ lastal --version
lastal 847

$ R --version
R version 3.3.3 (2017-03-06) -- "Another Canoe"...

$ python --version
Python 2.7.5

$ python -c "import numpy; print numpy.version.version;"
1.12.1

let me know which dataset you used to generate the models

I obtained the dataset from Nanopore WGS Consortium repository. Here is the link to the fastq file:
http://s3.amazonaws.com/nanopore-human-wgs/rel3-nanopore-wgs-3306352129-FAB42798.fastq.gz

Here are the steps I followed:

  1. Make a set of training reads from top 5000 reads in the fastq file:
$ cat rel3-nanopore-wgs-3306352129-FAB42798.fastq | awk 'NR % 4 == 1 {print ">s" (++x)} NR % 4 == 2' | head -n 10000 > real.fasta
  1. Run the read_analysis.py script:
$ read_analysis.py -i real.fasta -r hg38.fa
2017-04-20 18:00:03: Read pre-process and unaligned reads analysis
2017-04-20 18:00:03: Alignment with LAST
2017-04-20 23:27:25: Aligned reads analysis
2017-04-20 23:27:25: match and error models
2017-04-20 23:27:33: Model fitting
2017-04-20 23:35:42: Finished!
$ ls
hg38.fa                                      training_besthit.maf
model_fitting.Rout                           training_del.hist
real.fasta                                   training_error_markov_model
ref_genome.bck                               training.fasta
ref_genome.des                               training_first_match.hist
ref_genome.prj                               training_ht_ratio
ref_genome.sds                               training_ins.hist
ref_genome.ssp                               training.maf
ref_genome.suf                               training_match.hist
ref_genome.tis                               training_match_markov_model
rel3-nanopore-wgs-3306352129-FAB42798.fastq  training_mis.hist
training_aligned_length_ecdf                 training_model_profile
training_aligned_reads_ecdf                  training_unaligned_length_ecdf
training_align_ratio
  1. Run the simulator.py script:
$ simulator.py linear -r hg38.fa

But the simulator.py script keeps running without generating any simulated reads. Also simulated.log is empty.

Please let me know if you need any of the files.
Thanks.

from nanosim.

haghshenas avatar haghshenas commented on May 30, 2024

Hi there,

Did you have a chance to look into this issue?

Thanks

from nanosim.

cheny19 avatar cheny19 commented on May 30, 2024

I repeated your steps and didn't notice any error. One thing to notice is it takes long to read in the reference genome given the size. I haven't finished reading in yet, but the simulated.log is not empty. If you force quit the program, It will show information like this:

2017-04-25 10:16:15: ../simulator/src/simulator.py linear -r ~/genome.fa -n 10
2017-04-25 10:16:15: Read error profile
2017-04-25 10:16:15: Read ECDF of unaligned reads
2017-04-25 10:16:15: Read ECDF of aligned reads
2017-04-25 10:16:15: Read in reference genome

If you do have such log file, it means it's trying to read in the reference genome, and probably better to choose a faster machine, or just wait. If your log file is empty, unfortunately I cannot reproduce this problem.

from nanosim.

haghshenas avatar haghshenas commented on May 30, 2024

I get the same log when I kill the process. But I don't think reading the reference genome should take more than 3 days as I am using a powerful workstation.
I start to suspect that there is a problem with handling large genome. Please let me know if waiting helps you with generating reads.

from nanosim.

cheny19 avatar cheny19 commented on May 30, 2024

NanoSim reads in the genome and converts all bases to upper case. I think this is the most time consuming part for large genome right now. I'll fix the code and let you know. You don't have to wait for it to run. Thanks for letting me know about this!

from nanosim.

haghshenas avatar haghshenas commented on May 30, 2024

I actually fixed the code and will send a pull request. There were two very slow parts in the code.

  1. reading the reference fasta file as you mentioned. It can be fixed with a generator function (I used the one from here)
  2. fixing ambiguous cases in the reference. I used a method that avoids slicing the string.

Note that this code is still not optimized in using memory which can be fixed.

from nanosim.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.