I used the following commands to simulate reads for an E.coli strain from the NCBI dat

Read length distribution unexpected peaks about nanosim HOT 9 CLOSED

bcgsc commented on May 30, 2024

Read length distribution unexpected peaks

from nanosim.

Comments (9)

cheny19 commented on May 30, 2024

NanoSim tries to simulate the length distribution from the profile. Do you also see this type of peaks in your ecoli profile? If so, that means NanoSim is simply doing its job. If not, please send me your training data and I'll investigate more into this issue. Thanks!

from nanosim.

GeorgiaBreckell commented on May 30, 2024

I used the default R9 1D E.coli profile provided with Nanosim, Thanks

…

On 11 July 2018 at 06:18, Chen Yang ***@***.***> wrote: NanoSim tries to simulate the length distribution from the profile. Do you also see this type of peaks in your ecoli profile? If so, that means NanoSim is simply doing its job. If not, please send me your training data and I'll investigate more into this issue. Thanks! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#31 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AYS_FFcZ7uRranafYstD0o0rch1io2PUks5uFPANgaJpZM4VIqMF> .

from nanosim.

GeorgiaBreckell commented on May 30, 2024

Have you had a chance to look into this further, we have continued to use Nanosim with the R9 1D ecoli profile but if possible would like an explantation for these peaks, Thanks for your time!

from nanosim.

harisankar991 commented on May 30, 2024

Did you see the similar peaks with the training dataset as well? Any leads on this?

from nanosim.

cheny19 commented on May 30, 2024

I finally got time to sit down and seriously look at this issue. I don't see similar peaks with the training set, but I'll need to inspect more to figure it and fix this behaviour.

from nanosim.

cheny19 commented on May 30, 2024

@georgiab7 Sorry about the late reply. After some careful inspection, I found that this behaviour is due to the binning strategy in simulating align ratio of each reads. Basically during training, all reads are divided into several bins according to their length and each bin has its own ECDF to be simulated. Using this strategy, the general CDF of total read length and align ratio can be captured quite well, but thanks to you, we noticed that there some spikes in the PDF/histogram.

I hope this didn't affect your work with the simulated data. And we currently are trying to find a better way to simulate the align ratio and will incorporate it in our next release. Stay tuned!

from nanosim.

FiniDG commented on May 30, 2024

I also find a weird peak in the original publically available data minION fastq files (so not by NanoSim).

And I found out via this webpage nr1 and nr2 that this is called a spike-in, which they put in there deliberately. Maybe if you train with this, it results in these peaks as well?

not sure if related, but I want to know, because I don't know if I am able to remove these reads? I will try to make these pictures of my training data as well, to see how it is distributed, but I think that minimap2 mapping beforehand already should solve this issue?

EDIT:

The trained dataset also contains this peak

So, It could be a cause. So I think I need to find out how to get rid of these spike-in data in my dataset.

from nanosim.

cheny19 commented on May 30, 2024

Thanks for the inspection! However the peaks in simulated reads based on R9 1D E. coli profile are different from the spike-in sequences here @FiniDG-HAN. To remove the spike-in data, you can do a sequence alignment against the spike-in reference and remove the aligned reads. Normally the spike-in sequences are manually generated and there should be a reference for them.

from nanosim.

cheny19 commented on May 30, 2024

@georgiab7 and @FiniDG-HAN, we recently changed the simulation algorithm and used Kernel Density Functions to simulate the read lengths and align ratios on each reads. Simulated reads using this strategy do not have these peaks and the read lengths are more smooth. Distributions of the length of unaligned region, length of aligned region, and total read length are closer to the training data than previsous releases. Please download the latest version and give it a shot. Thanks for your patience!

from nanosim.

Read length distribution unexpected peaks about nanosim HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent