bcgsc / nanosim Goto Github PK

View Code? Open in Web Editor NEW

213.0 16.0 50.0 871.79 MB

Nanopore sequence read simulator

License: Other

Python 99.55% Lua 0.45%

genome transcriptome simulator oxford-nanopore nanopore-sequencing

nanosim's People

Contributors

Stargazers

Watchers

Forkers

emilhaegglund csb5 andrwb jasperlinthorst ocxtal johncava haghshenas python3pkg luptior osilander abremges saberhq jbeaulaurier gsc0107 lucventurini baraaorabi aysunrhn qwang-big robertdigital glin0311 fonyambu jasonshih cytham fairliereese genetronbioinfomatics yhg926 sailfish009 jkomyno shangshanzhizhe bound-to-love jasminequah yuxing123 lanlanla wolongac aafshinfard tidesun gandalfone lokiluciferase nrhorner anthonym5 oracle5th lun8811 santanaw smeanapole pgupta3005 mivmar georgettetanner samadhi-k captainluvy lindacova

nanosim's Issues

[IndexError: string index out of range] Trying to introduce errors at indices > length of read

NanoSim is occusionally producing an error caused by the generating a ref_l greater than the length of a reference contig. In this example, my reference has only one contig of length ~850. Note that I added two lines to print the error_dict and new_read variables for debugging.

[borabi@xavier3 freddie]$ simulator.py linear -r reference.fasta -c training -o simulated -n 3
len(new_read)
851
error_dict:
{13: ['del', 1], 22.5: ['ins', 4], 48: ['mis', 1], 52: ['mis', 1], 53: ['mis', 1], 134: ['del', 1], 147: ['del', 1], 165: ['mis', 1], 166: ['mis', 1], 177: ['del', 1], 228: ['mis', 1], 229: ['mis', 1], 237.5: ['ins', 1], 246: ['mis', 1], 250: ['mis', 1], 265: ['del', 1], 288: ['del', 1], 310: ['del', 1], 337: ['mis', 1], 370: ['del', 1], 371: ['mis', 2], 494: ['del', 1], 507.5: ['ins', 3], 521: ['mis', 1], 531.5: ['ins', 5], 535.5: ['ins', 3], 573: ['del', 3], 577: ['del', 1], 578.5: ['ins', 1], 590: ['del', 3], 615: ['mis', 2], 618: ['del', 1], 638: ['del', 1], 671: ['mis', 1], 674: ['mis', 2], 699: ['del', 2], 701: ['mis', 1], 752: ['del', 2], 754: ['mis', 2], 756.5: ['ins', 1], 759: ['mis', 1], 763: ['mis', 1], 770: ['mis', 1], 775: ['mis', 2], 778: ['mis', 1], 782: ['del', 1], 810: ['mis', 1], 819.5: ['ins', 2], 845.5: ['ins', 2], 870: ['mis', 1], 890: ['del', 2], 909: ['del', 1], 928: ['del', 2], 942.5: ['ins', 2], 943.5: ['ins', 2], 974: ['mis', 1], 983: ['del', 3], 996.5: ['ins', 1], 1018: ['mis', 1], 1054: ['mis', 2], 1067: ['del', 1], 1072: ['del', 3], 1078: ['mis', 1], 1090: ['mis', 1], 1104.5: ['ins', 3], 1107: ['mis', 1], 1122: ['del', 1], 1125: ['mis', 2], 1127: ['mis', 2], 1130: ['mis', 2], 1132: ['mis', 1], 1136: ['del', 2], 1149: ['del', 2], 1153: ['del', 1], 1163: ['mis', 1], 1170.5: ['ins', 1], 1211.5: ['ins', 2], 1212: ['mis', 1], 1214: ['del', 1], 1215: ['mis', 1], 1223: ['del', 1], 1233: ['mis', 1], 1258: ['mis', 1], 1291: ['mis', 1], 1299: ['mis', 1], 1303.5: ['ins', 1], 1305: ['mis', 2], 1310.5: ['ins', 4], 1314: ['mis', 1], 1316: ['del', 1], 1322.5: ['ins', 2], 1341.5: ['ins', 1], 1358: ['del', 1]}
Traceback (most recent call last):
  File "simulator.py", line 741, in <module>
    main()
  File "simulator.py", line 735, in main
    max_readlength, min_readlength)
  File "simulator.py", line 384, in simulation
    read_mutated = mutate_read(new_read, new_read_name, out_error, error_dict, kmer_bias)
  File "simulator.py", line 581, in mutate_read
    tmp_bases.remove(read[key + i])
IndexError: string index out of range

Unaligned read fraction is sampled from only a single strand of refernce

The unaligned read would make sense to be sampled from both strands with random chance.

Make easy installable package using pip or bioconda

To increase ease of installation and dependency management, please consider adding your tool to pip/PyPI and/or bioconda. If you need any help getting this done please let me know.

Covering the entire genome

Does Nanosim simulate reads from every part of the reference genome? Is there a way to configure coverage or a minimum number of reads to ensure that the entire genome is covered. My requirement is that WGS assembly should be possible from the artificial reads generated.

NanoSim Simulation and reference use

Hi,
I'm actually working with NanoSim to simulate reads from customized reference:
to do That I'm using:

predefined error profile (from nanosim-h) which is ecoli_R9_1D.
2 differents customized references (to see the impact of homopolymers on reads).
But according to my results, I think that change the reference file has no impact on reads simulation the error profile put same errors type and number for same read length sequences.
Is my results correct (i.e nanoSim use just the error profile) or it consider the reference file (number of homopolymers and other things) ?
thank's.

Division by zero and other edge case errors

There are couple of bugs I encountered which are mostly of the same nature:

No unmapped reads means that read_analysis.py will not generate a training_unaligned_length.pkl file which breaks simulator.py script:

Traceback (most recent call last):
  File "extern/nanosim/src/simulator.py", line 739, in <module>
    main()
  File "extern/nanosim/src/simulator.py", line 723, in main
    read_profile(number, model_prefix, perfect)
  File "extern/nanosim/src/simulator.py", line 167, in read_profile
    kde_unaligned = joblib.load(model_prefix + "_unaligned_length.pkl")
  File "/home/borabi/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 570, in load
    with open(filename, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'genes/E2F1/P000R001/training_unaligned_length.pkl'

No head soft clipping
No mismatch/ins/del in a read:
Example of the last:

2019-04-03 16:49:11: match and error models
Traceback (most recent call last):
  File "extern/nanosim/src/read_analysis.py", line 190, in <module>
    main(sys.argv[1:])
  File "extern/nanosim/src/read_analysis.py", line 180, in main
    error_model.hist(prefix, file_extension)
  File "extern/nanosim/src/besthit_to_histogram.py", line 383, in hist
    out_error_rate.write("Mismatch rate:\t" + str(total_mis * 1.0 / (total_mis + total_match + total_del)) + '\n')
ZeroDivisionError: float division by zero

Change read length distribution simulated reads

I don't think it is possible, but I wanted to change some parameters in my simulated reads, to test different factors. One of the parameters I really want to change is the read length like max, average, min. I tried changing some parameters in the ecdf files (_aligned_reads_ecdf, _aligned_length_ecdf and _unaligned_length_ecdf) by changing the bins and the percentages after them, but that doesn't seem to work. Only changing _aligned_reads_ecdf (what I tried first) doesn't even seem to do anything.

Do you know how I could possibly change the read lengths of the reads I am going to simulate? I believe that the error rates should be enough to make the reads? I guess they do not change according to the size of the reads (that is only library preparation)?

Feature request: reporting SAM file for simulated reads

It would be really helpful to report the true alignment of reads to the reference in SAM format. For example, Simlord does for simulating PacBio reads.

Error encountered in the KernelDensity function

Dear authors,

The error encountered is ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required.

I used the datasets provided by the Loman lab and an ecoli reference genome as input.

The error message prevents me from running the simulater.py script.

Any suggestions you have regarding how to resolve the issue are appreciated! Or can you make the precomputed profiles available again?

Thank you!

root@549758d01510:/# INPUT=/data/R9_Ecoli_K12_MG1655_lambda_MinKNOW_0.51.1.62.all.fasta
root@549758d01510:/# REF=/data/13002263354.fna
root@549758d01510:/# PROFILE=/data/ecoli
root@549758d01510:/#
root@549758d01510:/# read_analysis.py -i $INPUT -r $REF -o $PROFILE
2019-02-21 18:10:06: Read pre-process and unaligned reads analysis
2019-02-21 18:10:23: Alignment with minimap2
[M::mm_idx_gen::0.187*0.85] collected minimizers
[M::mm_idx_gen::0.271*0.88] sorted minimizers
[M::main::0.272*0.88] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.288*0.87] mid_occ = 12
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.304*0.86] distinct minimizers: 838544 (98.18% are singletons); average occurrences: 1.034; average spacing: 5.352
Killed
2019-02-21 18:13:00: Aligned reads analysis
Traceback (most recent call last):
  File "/NanoSim/src/read_analysis.py", line 190, in <module>
    main(sys.argv[1:])
  File "/NanoSim/src/read_analysis.py", line 161, in main
    num_aligned = align.head_align_tail(prefix, file_extension)
  File "/NanoSim/src/head_align_tail_dist.py", line 175, in head_align_tail
    kde_aligned = KernelDensity(bandwidth=10).fit(aligned_2d)
  File "/root/.local/lib/python3.7/site-packages/sklearn/neighbors/kde.py", line 128, in fit
    X = check_array(X, order='C', dtype=DTYPE)
  File "/root/.local/lib/python3.7/site-packages/sklearn/utils/validation.py", line 582, in check_array
    context))
ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required.
root@549758d01510:/# ls -lah /data
total 1.8G
drwxr-xr-x 7 root root  224 Feb 21 18:12 .
drwxr-xr-x 1 root root 4.0K Feb 21 18:08 ..
-rw-rw-r-- 1 root root 4.6M Feb 20 21:55 13002263354.fna
-rw-r--r-- 1 root root 901M May 25  2016 R9_Ecoli_K12_MG1655_lambda_MinKNOW_0.51.1.62.all.fasta
-rw-r--r-- 1 root root    0 Feb 21 18:10 ecoli.sam
-rw-r--r-- 1 root root    0 Feb 21 18:12 ecoli_primary.sam
-rw-r--r-- 1 root root 901M Feb 21 18:10 ecoli_processed.fasta

problem with dna_type

dear Developper,
I get this error when i run the tool with this command:
command:
~/Downloads/NanoSim/src/simulator.py genome -dna_type linear -rg Fusve2_AssemblyScaffolds.fasta -c training

error:
`running the code with following parameters:

ref_g Fusve2_AssemblyScaffolds.fasta
model_prefix training
out simulated
number 20000
perfect False
kmer_bias 0
dna_type linear
strandness None
sd_readlength None
median_readlength None
2019-05-21 08:25:33: /home/lfaino/Downloads/NanoSim/src/simulator.py genome -dna_type linear -rg Fusve2_AssemblyScaffolds.fasta -c training
mkdir: missing operand
Try 'mkdir --help' for more information.
2019-05-21 08:25:33: Read in reference
Traceback (most recent call last):
File "/home/lfaino/Downloads/NanoSim/src/simulator.py", line 1184, in
main()
File "/home/lfaino/Downloads/NanoSim/src/simulator.py", line 1123, in main
read_profile(ref_g, None, number, model_prefix, perfect, args.mode, strandness)
File "/home/lfaino/Downloads/NanoSim/src/simulator.py", line 281, in read_profile
if len(seq_dict) > 1 and dna_type == "circular":
NameError: global name 'dna_type' is not defined`

Add option to control error percentage

Hey there,

Would it be possible to have an additional option to be able to control the average error rate in the simulated reads?

Usage being that I would like to create a range of simulated reads with varying error profiles in order to test the error tolerances of various other tools. This tool would be fantastic for this because it uses actual ONT characteristics.

Maybe by editing the following?

 with open(model_prefix + "_unaligned_length_ecdf", 'r') as u_profile:
        new = u_profile.readline().strip()
        rate = new.split('\t')[1]
        # if parameter perfect is used, all reads should be aligned, number_aligned equals total number of reads.
        if per or rate == "100%":
            number_aligned = number
        else:
            number_aligned = int(round(number * float(rate) / (float(rate) + 1)))
        number_unaligned = number - number_aligned
        unaligned_dict = read_ecdf(u_profile)

What do you think?

In the meantime, i'm just going to hack out some kind of solution, but it would be great if it was part of the tool.

Cheers

Will NanoSim adapt to revisions in nanopore chemistry?

Irrespective of the nanopore chemistry, will Nanosim be able to model the error profile for a given MinION read?

Calculate length of simulated reads

Hi,

I have a question regarding the header notation of the simulated reads.
In the given example, for the "aligned" reads you say "92_12710_2 means that this read has 92-base head region (cannot be aligned), followed by 12710 bases of middle region, and then 2-base tail region.".
Does this mean that the total length of the simulated read is 12,804bp (92 + 12710 + 2)?

This is what I thought, but when I compared the header info from NanoSim (first column in the example below) with the length of the sequence itself (second column in the example below), these numbers don't match (the length is always longer):
ENA|U00096|U00096.3-Escherichia-coli-str.-K-12-substr.-MG1655,-complete-genome.1213634_aligned_7_R9_8481_25 | 8563
ENA|U00096|U00096.3-Escherichia-coli-str.-K-12-substr.-MG1655,-complete-genome.989475_aligned_8_F31_6280_22 | 6406
ENA|U00096|U00096.3-Escherichia-coli-str.-K-12-substr.-MG1655,-complete-genome.2385960_aligned_9_R1_3551_22 | 3642

Can you please help me understand how can I calculate the length of the simulated reads based on the information provided in the header and/or the other model files?

Thank you,
Natasha

Simulator fails using different reads

Hi,
The read_analysis.py and simulator.py steps are working fine when I use the supplied ecoli and yeast reads in the profile stage. However, when I use Nick Lomans MAP-006-2 reads ( http://lab.loman.net/2015/09/24/first-sqk-map-006-experiment/ ) I get this error in the simulation stage:

Traceback (most recent call last):
File "/home/mike/NanoSim/src/simulator.py", line 643, in
main()
File "/home/mike/NanoSim/src/simulator.py", line 635, in main
read_profile(number, model_prefix, perfect)
File "/home/mike/NanoSim/src/simulator.py", line 149, in read_profile
number_aligned = int(round(number * float(rate) / (float(rate) + 1)))
ValueError: invalid literal for float(): 100%

These were my commands:

read_analysis.py -i ecoli_K12_MG1655_ref.fa -r MAP006-2_2D_pass.fasta -o MAP

simulator.py circular -r Pmarinus.fasta -c MAP -n 10 -o pmarinus_MAP_10

This data set has longer reads (>9kbp) and higher accuracy than the supplied data sets I think.
Any ideas?

Many thanks,
Mike

Excuse me. You program can take consider the coverage information?

Excuse me. If I consider the coverage of simulated reads, what should I do?

For example, I have 10 contigs in the reference fasta file. If I want to generate simulated reads with a 10X coverage, how can I set the parameters to get reads?

Thank you!

using nanosim to generate fastq files?

Is there a way to make fastq files with nanosim? It seems to me that it would be preferable to generate fastq versus fasta files, so that nanopore QC software (eg., nanofilt) can be run on the fastq files.

Read Simulation: reference length limitation?

Hi,
I was hoping to use this tool to simulate some reads for a set of amplicons. However, it gives me an error when I try it:

src/simulator.py linear -r ~/R/projects/umi.sim/nanosim.input.fasta -c training.juplasmid -o
sim.pcrprod
Traceback (most recent call last):
  File "src/simulator.py", line 716, in <module>
    main()
  File "src/simulator.py", line 710, in main
    simulation(ref, out, dna_type, perfect, kmer_bias, max_readlength, min_readlength)
  File "src/simulator.py", line 284, in simulation
    read_mutated = mutate_read(new_read, new_read_name, out_error, error_dict, kmer_bias, False)
  File "src/simulator.py", line 577, in mutate_read
    tmp_bases.remove(read[key + i])
IndexError: string index out of range

the reference file contains 10 000 entries with a size of 1,3 kb. So I thought maybe it can not deal with so many entries. That is why I tried the simulation with a plasmid of 7.4 kb, but that outputs the same error.

When I try it with a 49kb lambda genome it works, that is why I assume there is a limitation in the reference size?

Is it possible to adjust NanoSim for shorter references or am I missing something?

simulator.py gets stuck (training done on NGMLR bam file)

Hello,

I wanted to compare the results of an SV detection program when using minimap2 mapping (minisv) or NGMLR mapping (ngmlrsv) as input, on reads simulated by nanosim.

In order not to favour minisv (minimap2 being nanosim default mapper), I wanted to use nanosim with NGMLR mapping as input.

Although read_analysis.py works fine with an NGMLR bam file, when I run simulator.py asking for 100,000 simulated reads, it gets stuck, each time at a different number of generated reads, usually above 20,000.

I am using nanosim 2.2.0 and the following command lines (2nd step asking for 80G of ram):
read_analysis.py -i sub10k.fa -m sub10kreads.genome.aln.sam -r Perca_flavescens.PFLA1.1.dna.toplevel.okseqids.fa.gz -t 4 > read_analysis.out 2> read_analysis.log

simulator.py linear -r Perca_flavescens.PFLA1.1.dna.toplevel.okseqids.fa -n 100000 -c training -o simulating > simulated.out 2> simulated.err

I have put the input files needed for the second step here, let me know if you need anything else?
http://genoweb.toulouse.inra.fr/~sdjebali/issues/nanosim/tosend.tar.gz

Best,
Sarah

AttributeError: 'HTSeq._HTSeq.SAM_Alignment' object has no attribute 'supplementary

problem ftp adress with example data

Dear team,

I have a problem with the ftp adress with example data. I used the one in the example.sh file:
ftp://ftp.bcgsc.ca/supplementary/NanoSim/
but the directory "upplementary/NanoSim" does not exist ?
Do you have an updated link ?

Thanks,

Best regards,
Maud

Identifying if bacteria is circular or linear

Is it okay to assume most bacteria genomes are circular? Or what is to be provided as input to the simulation step if we are not sure of the genome being circular or linear.

ecdf files are incorrectly written in head_align_tail_dist.py if max_length <25000

The last bin in the header is 25000-4194, this, along with the fact that the entire column of 4000-5000 is 0, results in a crash in simulator.py at

NanoSim/src/simulator.py

Line 95 in 5c02a2e

last_key = sorted(ecdf_dict[ecdf_key[i]].keys())[-1]

with IndexError
vanilla_training_align_ratio.txt

Thank you for your attention!

If the average read length is very similar to your reference length then nanosim goes into an endless loop while trying to get a read to simulate

E.g. full-length amplicon sequencing

What will NanoSim do when it encounters NNNN region?

Hi,

I am using NanoSim to simulate nanopore reads from hg38 genome. I am wondering what NanoSim is programmed to do when it encounters NNNN regions in the hg38 genome. I am asking because I have encountered some simulated reads that actually produced nucleotide (ATCG) sequences at NNNN regions, and these nucleotide sequence are unmappable to anywhere in the genome.

To explain it clearer,
Region in hg38 genome: AGCTCATGCAAGGGANNNNNNNNNNNNNNNNNNNNN
NanoSim simulate read: AGCTCATGCAAGGGATCGAGCTGATCGGATCGGATGC

Please explain how this sequence (TCGAGCTGATCGGATCGGATGC) is generated.

Thanks

Best,
Tham

power overflow

Hi,
When I was running the read_analysis file, I got this:

/mixed_model.py:25: RuntimeWarning: overflow encountered in power
wei_cdf = 1 - np.exp(-1 * np.power(x / l, k))

I was wondering if any action is required for this warning.

Training model failure

Hi,

I run the scripts as;

read_analysis.py -i C002_05_6_50kb_TemplateFail.fasta -r E_coli_K12_NC000913.1.fasta

simulator.py circular -r E_coli_K12_NC000913.1.fasta -n 200000 --max_len 200000

and I get the following errors;

2017-02-01 20:24:49: Read pre-process and unaligned reads analysis
2017-02-01 20:24:51: Alignment with LAST
2017-02-01 20:36:07: Aligned reads analysis
2017-02-01 20:36:08: match and error models
2017-02-01 20:36:34: Model fitting
2017-02-01 20:41:16: Finished!
Traceback (most recent call last):
File "/gs/project/wst-164-ab/anthony/software/NanoSim-master/src/simulator.py", line 671, in
main()
File "/gs/project/wst-164-ab/anthony/software/NanoSim-master/src/simulator.py", line 663, in main
read_profile(number, model_prefix, perfect, max_readlength, min_readlength)
File "/gs/project/wst-164-ab/anthony/software/NanoSim-master/src/simulator.py", line 129, in read_profile
with open(model_profile, 'r') as mod_profile:
IOError: [Errno 2] No such file or directory: 'training_model_profile'

Could you kindly help me understand what the problem is.

Thanks

Options min_len and max_len throw an error or do not work

Trying to use min_len gives:

python NanoSim/src/simulator.py linear -r ecoli.fa -c projects/cheny_prj/nanopore/paper/R9/1D/ecoli -n 500 --min_len 1500
./simulator.py [command]
[command] circular | linear
Do not choose 'circular' when there is more than one sequence in the reference
:
-h : print usage message
-r : reference genome in fasta file, specify path and file name, REQUIRED
-c : The prefix of training set profiles, same as the output prefix in read_analysis.py, default = training
-o : The prefix of output file, default = 'simulated'
-n : Number of generated reads, default = 20,000 reads
--max_len : Maximum read length, default = Inf
--min_len : Minimum read length, default = 50
--perfect: Output perfect reads, no mutations, default = False
--KmerBias: prohibits homopolymers with length >= n bases in output reads, default = 6

Trying to use max_len gives an error:

python NanoSim/src/simulator.py linear -r ecoli.fa -c projects/cheny_prj/nanopore/paper/R9/1D/ecoli -n 500 --max_len 1000
Traceback (most recent call last):
File "NanoSim/src/simulator.py", line 671, in
main()
File "NanoSim/src/simulator.py", line 665, in main
simulation(ref, out, dna_type, perfect, kmer_bias, max_readlength, min_readlength)
File "NanoSim/src/simulator.py", line 344, in simulation
read_mutated = ''.join(np.random.choice(BASES, head)) + read_mutated
AttributeError: 'module' object has no attribute 'choice'

Index out of range

Hello I'm trying to run nanofilt and after running simulation.py I get the following error message :

File "~/soft/nanosim/src/simulator.py", line 739, in <module> main() File "~/soft/nanosim/src/simulator.py", line 733, in main max_readlength, min_readlength) File "~/soft/nanosim/src/simulator.py", line 295, in simulation read_mutated = mutate_read(new_read, new_read_name, out_error, error_dict, kmer_bias, False) File ~/soft/nanosim/src/simulator.py", line 579, in mutate_read tmp_bases.remove(read[key + i]) IndexError: string index out of range
I tried to change --max_len however it doesn't have any effect.

Do you know what's the issue ?

Thanks for your help

nanopore data

Does Nanosim work on generating RNA data as well?

Be able to specify threads with when using LAST

I'm running the read_analyis.py tool, but I'm finding the LAST alignment quite slow. I'm pretty sure LAST can be threaded and this should be an option.

all reads aligned bug?

Hi developers
With a particular simulation we repeatedly get an error when starting the sim, because the file _unaligned_length.pkl has not been generated during training. This has not been a problem when running nanosim on our other datasets. Is this happening because there are no unaligned reads in this particular training set? Is there a way to sort this out so we can still get a sim from this dataset?
thanks!

.log file for read_analysis.py? (Killed error)

I try to run read_analysis.py for some time now. Now I am at a stage where I get "Killed" everytime after a while, with no error code or other explanation. Other errors in the past I could solve with the python errors I got, but here I got nothing. The simulator.py does seem to give a log file (according to the readme), but the read_analysis.py doesn' t seem to have this, so I cannot look into what goes wrong.

#copy/paste of my terminal screen:

[user@01 ~]$ cd Documents/tools/NanoSim/src/
[user@01 src]$ python read_analysis.py -i /path/to/reads/ERR2173373.fasta -r /path/to/ref/OMOL01.fasta -m /path/to/sam/arabidopsis_minimap_CStag.sam -o /path/to/output 
2018-10-26 08:04:04: Read pre-process and unaligned reads analysis
2018-10-26 08:04:59: Processing alignment file: sam
2018-10-26 08:24:27: Aligned reads analysis
2018-10-26 08:41:20: match and error models
Killed
[user@01 src]$

p.s I made my own minimap2.sam, because the NanoSim variant keeps giving me a memory error. I did use the same parameters though.

1D simulation reads

Could NanoSim generate simulated Oxford Nanopore 1D reads?

Using "sam" alignments with "read_analysis.py"

Hi,

I would like to use my own alignment file with "read_analysis.py" and skip the default alignment with LAST. However, the alignment file I have is in "sam" format, and "read_analysis.py" requires alignment file in "maf" format. I was wondering if it is possible to use "sam" alignments in this step, or if not, do you know how can I convert the "sam" files to "maf" ones ?

Thank you,
Natasha

Minimap2 vs Last running time

Hi,

I tried "NanoSim-2.0" with both "minimap2" and "last" on my data.
The alignment finishes quickly; however, the run with "minimap2" took 10 hours compared to the run with "last" that took about an hour.

For "minimap2", you can see below that there have been almost 10 hours before the simulation was reported as finished.
2018-05-01 02:44:54: Model fitting
2018-05-01 12:20:20: Finished!
while for "last" on the same data, there has been only an hour:
2018-05-01 02:51:10: Model fitting
2018-05-01 03:43:43: Finished!

This is when I ran "read_analysis.py", so I was wondering if this is normal, or I am missing some optimization step for "minimap2"? I did use "minimap2" with 4 threads.

Thank you,
Natasha

Is it possible to train the error model using a human DNA read and then simulate bacetrial genomic reads?

The reason I want to do this is because for the latest MinION chemistry, I have only human DNA reads with me and I want to simulate bacterial reads for the latest chemistry

Define read length parameters

Hello,

Is it possible to use a generic genome to train for error rates and the distribution, and to run the simulation stage with manually defined mean read length ?

Thanks for your help

"Index out of range" Error in read_analysis.py

When tested, the read_analysis.py step outputs the following message after minimap2 alignment:

2018-06-26 02:00:48: Read pre-process and unaligned reads analysis
2018-06-26 02:01:23: Alignment with minimap2
2018-06-26 03:39:49: Aligned reads analysis
2018-06-26 04:12:57: match and error models
Traceback (most recent call last):
  File "/lab01/Tools/NanoSim-2.0.0/src/read_analysis.py", line 202, in <module>
    main(sys.argv[1:])
  File "/lab01/Tools/NanoSim-2.0.0/src/read_analysis.py", line 186, in main
    error_model.hist(prefix, file_extension)
  File "/lab01/Tools/NanoSim-2.0.0/src/besthit_to_histogram.py", line 330, in hist
    add_dict(list_hist[i], dic_mis)
IndexError: list index out of range

The input is a downloaded a fastq file from Nanopore WGS Consortium for NA12878 and converted to fasta using seqtk. The reference is GRCh38. I've tested both version 2.1.0 and version 2.0.0.

example files absent

trying to run the example.sh, I get folder not found errors
could it be that the files have been moved around?
thanks

wget ftp://ftp.bcgsc.ca/supplementary/NanoSim/ecoli_R7_2D.fasta
--2019-06-06 10:56:57--  ftp://ftp.bcgsc.ca/supplementary/NanoSim/ecoli_R7_2D.fasta
           => ‘ecoli_R7_2D.fasta’
Resolving ftp.bcgsc.ca (ftp.bcgsc.ca)... 134.87.4.91
Connecting to ftp.bcgsc.ca (ftp.bcgsc.ca)|134.87.4.91|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /supplementary/NanoSim ... 
No such directory ‘supplementary/NanoSim’.

=> it seems the files are now @ http://www.bcgsc.ca/downloads/supplementary/NanoSim/ which could be corrected in the example.sh wget commands

Requirement of latest nanopore MinION reads

Could you specify the nanopore version used for the input ecoli reads mentioned here? Is there latest MinION simulated reads with you, if so could you please share?

Read Output ID

Relative to the comments in the readme:

ref|NC-001137|-[chromosome=V]_468529_unaligned_0_F_0_3236_0
All information before the first _ are chromosome information. 468529 is the start position and unaligned suggesting it should be unaligned to the reference. The first 0 is the sequence index. F represents a forward strand. 0_3236_0 means that sequence length extracted from the reference is 3236 bases.
ref|NC-001143|-[chromosome=XI]_115406_aligned_16565_R_92_12710_2
This is an aligned read coming from chromosome XI at position 115406. 16565 is the index of simulation. R represents a reverse complement strand. 92_12710_2 means that this read has 92-base head region (cannot be aligned), followed by 12710 bases of middle region, and then 2-base tail region.

The 4th field confuses me. Is it "index of simulation" or "seqeunce index" and what does this mean? Is it basically a unique read ID? If so, why does the numbering reset when transitioning between unaligned reads and aligned reads?

eg. an example of what I see in the output:

ref|NC-001137|-[chromosome=V]_468529_unaligned_123_F_0_3236_0
ref|NC-001143|-[chromosome=XI]_115406_aligned_0_R_92_12710_2
The numbering resets, if intended to be a ID of sorts (which will be useful, as I may want to convert the header to something simpler) they should not reset.

simulator.py: UnicodeDecodeError

I used the command below and obtained an error:

$ python ~/repositories/NanoSim/src/simulator.py linear -r ~/GRCh38_recommended/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz -o nanosim-10000 --median_len 10000 --sd_len 1
/home/wdecoster/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
Traceback (most recent call last):
  File "/home/wdecoster/repositories/NanoSim/src/simulator.py", line 739, in <module>
    main()
  File "/home/wdecoster/repositories/NanoSim/src/simulator.py", line 730, in main
    max_readlength, min_readlength, median_readlength, sd_readlength)
  File "/home/wdecoster/repositories/NanoSim/src/simulator.py", line 254, in simulation
    for seqN, seqS, seqQ in readfq(infile):
  File "/home/wdecoster/repositories/NanoSim/src/simulator.py", line 211, in readfq
    for l in fp:  # search for the start of the next record
  File "/home/wdecoster/anaconda3/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

This error was resolved by using a non-compressed reference genome, but it would be good if you could either support compressed reference genomes or alternatively give a more informative error message.

Cheers,
Wouter

indexerror: string index out of range

What can I do if my reference genome is smaller than the reference used for training the error model? I get the error "indexerror: string index out of range" in this case.

Maybe a bug in the error_list function

Hi, I am just wondering if the error_list function works correctly. When creating an e_dict, it sometimes introduces steps of a negative length. I think that it might cause problems in the mutate_read function as it assumes (without any checking) that the step length (val[1]) will be positive. The program does not crash but I am afraid that the resulting data do not fully correspond to the used model. To observe the problem, you can use assert key>=0 after key = int(round(key)).

Unavailability of long Read MinION template

Hi,

As I understand, we need a long raw read from MinION for the simulator to do a read error profile characterization. I am unable to find raw long reads online apart from ecoli and yeast. Do we need reads of each genus to simulate artificial reads of that genus? For ex, I need artificial reads of 10 types of bacteria, do I need template reads for all 10 of them to configure the simulator?

Does NanoSim works for large genomes?

Hello,

I tried to simulate reads using NanoSim for the human genome.
The read_analysis.py script worked and generated the required models.
However, simulator.py script is running for 3 days now and is not generating any output or simulated reads.
Is there anything special I need to do for large genomes?
Can you please comment on how NanoSim scales for human size genome?

Perfect option gives error

The --perfect option give the following error for me:

Traceback (most recent call last):
  File "./simulator.py", line 646, in <module>
    main()
  File "./simulator.py", line 638, in main
    read_profile(number, model_prefix, perfect)
  File "./simulator.py", line 156, in read_profile
    for i in xrange(number_unaligned):
UnboundLocalError: local variable 'number_unaligned' referenced before assignment

not python>=2.6 compatible

The README.md for nanosim states:

Python (2.6 or above)

Thus, nanosim should be compatible with Python>=2.6 and Python>=3. However, when using nanosim v1.3.0, I got the following error:

Traceback (most recent call last):
  File "/ebio/abt3_projects/software/dev/miniconda3_dev/envs/read_sim/bin/simulator.py", line 716, in <module>
    main()
  File "/ebio/abt3_projects/software/dev/miniconda3_dev/envs/read_sim/bin/simulator.py", line 708, in main
    read_profile(number, model_prefix, perfect, max_readlength, min_readlength)
  File "/ebio/abt3_projects/software/dev/miniconda3_dev/envs/read_sim/bin/simulator.py", line 148, in read_profile
    match_ht_list = read_ecdf(fm_profile)
  File "/ebio/abt3_projects/software/dev/miniconda3_dev/envs/read_sim/bin/simulator.py", line 79, in read_ecdf
    for i in xrange(lanes):
NameError: name 'xrange' is not defined

...which suggests that the code is not actually python>=3 compatible. It would help to clarify the README or update the python code. Also, the bioconda recipe for nanosim should specify python>=2.6.

Read length distribution unexpected peaks

I used the following commands to simulate reads for an E.coli strain from the NCBI database, I simulated 2 sets of reads with different length parameters but in both cases when I plotted the distribution of read lengths, regular but unexpected peaks in frequency of certain read lengths can be seen.
Is there an explanation for this and is it avoidable?

`~/NanoSim-2.1.0$simulator.py circular -r ../MG1655_reference.fasta -c ecoli -n 200000 -o MG1655_Q3_simulation --max_len 2500

~/NanoSim-2.1.0$simulator.py circular -r ../MG1655_reference.fasta -c ecoli -n 200000 -o MG1655_Q2_simulation --min_len 2500 --max_len 5000`

model fitting

Hi,

I am trying to run NanoSim on a Drosophila dataset, I ran it fine on E. coli and other organisms but now I am getting this repeatedly (~10,000 times) in the model_fitting.Rout file

...

}

mis.fit.tmp1 <- mis_fit_func(LL.mis)
[1] "Try different initial value of MLE"
[1] "Try different initial value of MLE"

Do you have any ideas how to get around this or what is wrong? Thanks!

bcgsc / nanosim Goto Github PK

nanosim's People

Contributors

Stargazers

Watchers

Forkers

nanosim's Issues

Recommend Projects

Recommend Topics

Recommend Org