khanghoj / dammet Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 1.0 192 KB

License: MIT License

Makefile 2.16% C++ 97.84%

dammet's People

Contributors

Stargazers

Watchers

Forkers

mikkelschubert

dammet's Issues

getSites' arguments-output-file name

Hi,

This is just a quick report that when running DamMet getSites the file [prefix].getSites.args is created as indicated by the log, but the file created is named [prefix].estDEAM.args instead.

DamMet v1.0.4

Thank you!
Best.

Deamination rate matrix for nonmethylated and methylated cytosines, input gargammel

Hi Kristian,

Me again 0:D I'm simulating aDNA sequencing data using gargammel and its new feature to add DNA methylation patterns that you and Gabriel Renaud added to it to test DamMet. All works fine, but my question today is about the deamination rates matrix for nonmethylated and methylated cytosines that one needs to feed gargammel (using the options --methyl -matfilenonmeth -matfilemeth).

How exactly do you build the tables for nonmethylated and for methylated cytosines? I know how this tables are supposed to look, as explained in gargammel's GitHub page:

A->C	A->G	A->T	C->A	C->G	C->T	G->A	G->C	G->T	T->A	T->C	T->G
0	1.853e-3 [1.726e-3..1.989e-3]	4.064e-3 [3.875e-3..4.263e-3]	3.269e-3 [3.099e-3..3.448e-3]	6.661e-3 [6.254e-3..7.094e-3] 3.057e-3 [2.785e-3..3.355e-3] 8.004e-2 [7.865e-2..8.145e-2] 1.236e-2 [    1.183e-2..1.292e-2] 4.131e-3 [3.828e-3..4.459e-3] 6.703e-3 [6.314e-3..7.116e-3] 3.845e-3 [3.624e-3..4.079e-3] 4.581e-3 [4.339e-3..4.836e-3] 2.169e-3 [2.005e-3..2.347e-3]
1	1.986e-3 [1.849e-3..2.134e-3]	4.273e-3 [4.070e-3..4.487e-3]	3.030e-3 [2.859e-3..3.211e-3]	5.357e-3 [5.001e-3..5.738e-3] 3.188e-3 [2.916e-3..3.485e-3] 1.427e-2 [1.369e-2..1.488e-2] 9.514e-3 [    9.075e-3..9.974e-3]	3.316e-3 [3.061e-3..3.593e-3] 5.061e-3 [4.743e-3..5.400e-3] 3.421e-3 [3.216e-3..3.639e-3] 4.865e-3 [4.620e-3..5.124e-3]	2.201e-3 [2.038e-3..2.377e-3]

And I know you can get this information from a PMD profiling tool as mapDamage. However, after having this information and in order to use the --methyl option in gargammel, I would have to duplicate this tables and have different values(rates) for the "C->T" (5p matrix) and "G->A" (3p matrix) columns; one for the nonmethylated cytosines (-matfilenonmeth) and another for the methylated cytosines (-matfilemeth). Am I right? If so, the only way (to me, so far) to obtain specific deamination rates for nonmethylated and methylated cytosines is using DamMet. Therefore, I would have to combine the information about all possible bases misincorporation rate from mapDamage and the "C->T" and "G->A" misincorporation rate, for nonmethylated and methylated positions, from DamMet to create the mentioned tables. Is this correct or I'm getting it all wrong? This is probably a stupid question and I'm missing something, but I just can't be sure without asking you about it.

Thanks a lot in advance!

All the best,
Katterinne

Bug: Segmentation fault (core dumped)

Dear Hanghoj,

This is to report a bug in DamMet that is preventing the successful use of the tool on real aDNA data.

Briefly, DamMet estDEAM stops when reading the reads and shows the error message: "Segmentation fault (core dumped)". This happens with different real aDNA sequencing data with an average read depth ranging from ~3.8X to ~50X, and using different computers (16 CPUs/32 GiB RAM and 32CPUs/256 GiB RAM) and OS (Linux Ubuntu, and MacOS). Interestingly, DamMet works fine for some samples and fails for others.

If you're interested in the details, let me walk you through it...

I have run DamMet on 16 real aDNA sequencing data (input: BAM file of the alignment of the aDNA sequences against the human reference genome) with an average read depth ranging from ~1.5X to ~50X. For 5 of these samples, DamMet worked just fine (both steps: estDEAM and estF). These 5 samples have an average read depth ranging from ~1.5X to ~10.6X. On the other hand, for the other 11 samples with an average read depth ranging from ~3.8X to ~50X, DamMet failed. DamMet estDEAM stops when reading the reads and shows the error message: "Segmentation fault (core dumped)". I tried to run DamMet again with these samples in different computers with the following specifications: 16 CPUs/32 GiB RAM and 32CPUs/256 GiB RAM, Linux (Ubuntu) and MacOS. Same thing happened every time. The example command line I was using:

DamMet estDEAM -b sample.bam -r human_ref_genome.fasta -cf chromosomes.list -O sample -R sample_RGs.list

Notes:

I was using -cf with a list of all chromosomes chromosomes.list(1 to 22, X, Y, and MT).
I tried with and without providing a read group list (-R). It didn't make a difference.

Additional information I have from experimenting with the tool, trying to understand why it wouldn't work...

Additionally, even knowing that DamMet runs on one chromosome at a time, I tried again using the -c option to run one chromosome at a time. By doing this I could run DamMet with most of the chromosomes of the samples that previously failed, but for several of the chromosomes, again, DamMet will stop when reading the reads (either giving a read group list or not) and shows the error message: "Segmentation fault (core dumped)". The number and identity of the chromosomes that failed would depend on each sample. The example command line I was using:

DamMet estDEAM -b sample.bam -r human_ref_genome.fasta -c 1 -O sample_chr1 -R sample_RGs.list

Next, I tried again, this time using the -L option to filter the reads by length for those chromosomes that failed. I found that, by trying increasing read length values on each chromosome, eventually DamMet will run successfully. The example command line I was using:

DamMet estDEAM -b sample.bam -r human_ref_genome.fasta -c 6 -L 80 -O sample_chr6 -R sample_RGs.list

Notes:

I used the -L option starting from 35 with an increase of 5 (the minimum read length in all my samples is 30).

Finally, when I wanted to run DamMet's second step on the same chromosomes I had to run independently and with a read length filter for the first step, it was necessary to do the same for estF to work.

Example data for reproducing error...

Using this Dropbox link you can download a zip file with the necessary files to reproduce the error:

The files you will find:
Alignment file and its index: SF12_DS_05.bam, SF12_DS_05.bam.bai
(This sample has an average read depth of ~5X)
Human reference fasta file and index: hs37d5.fa, hs37d5.fa.fai
Read group list file: SF12DS05_RGs.list

When running DamMet one chromosome at a time using the -c option, the chromosome that fails is: 6.

DamMet estDEAM -b SF12_DS_05.bam -r hs37d5.fa -c 6 -O SF12DS05_chr6 -R SF12DS05_RGs.list

Note: It works when using -L 80.

I don't want to have to reduce the total number of reads in order to run DamMet, for some samples and chromosomes I had to use high read length filters for DamMet to work. This is wasting valuable information. Besides, I didn't find a correlation between BAM file size or depth and the bug. Either regarding computational resources.

I hope something can be done about this, I'm looking forward to use DamMet to infer methylation in my ancient genomes, which I need for further analysis. Please, let me know if you need more details or if I can help somehow.

Thank you very much in advance. And thanks for your work in epiPALEOMIX, gargammel, and DamMet!

Regards,
Katterinne

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.