uds-helms / beclear Goto Github PK
View Code? Open in Web Editor NEWCorrection of batch effects in DNA methylation data
Home Page: https://bioconductor.org/packages/release/bioc/html/BEclear.html
License: GNU General Public License v3.0
Correction of batch effects in DNA methylation data
Home Page: https://bioconductor.org/packages/release/bioc/html/BEclear.html
License: GNU General Public License v3.0
Simplify gdepoch and dlossp function so that it only uses the gdepoch function and works with the matrix instead of iterating over each cell of the matrix.
This should improve the performance as well.
Do this after issue #10
Is your feature request related to a problem? Please describe.
At the moment the LFM has to account for all the variation in the data.
It could however improve the data imputation to add a bias, which accounts for sample and feature specific effects so that the LFM would only need to account for the effect of the interactions of samples and features.
Describe the solution you'd like
As described by Koren et al
Bias could either just be row and column means or also be trained during the GD
Additional context
It could also be interesting to return the bias to recieve some feedback. Maybe also for further analyses.
Is your feature request related to a problem? Please describe.
ALS is an alternative to the GD in solving the LFM. It has some advantages, when it comes to parallelisation.
Describe the solution you'd like
As described by Koren et al
Additional context
See also https://github.com/Livia-Rasp/Raspository/blob/master/R/imputeALS.R
for an untested implementation.
Now those functions only accept data.tables
One past user asked the question per email on how to use "own" data as an input.
We could probably provide an example in a vignette where data is read in from a file.
Is your feature request related to a problem? Please describe.
One possible idea is to split the overall matrix into small blocks first and do GD, then merge them into larger blocks and continue GD.
Describe alternatives you've considered
It is not clear at the moment, if this method ois feasible or if it is even possible easily to merge the Latent Factors of the blocks.
The Error, difference between D - L*R is already calculated in the loss function, but is than calculated again during the gradient descent.
Saving the Error should save time. Implement this after issue #11
Change return value to a named list to make it more readable
Usage of data.table for the block matrices to improve the performance, as a lot of the runtime seems to get lost due to matrix accessions.
Error with bigger dataset:
In median_batch - median_others :
longer object length is not a multiple of shorter object length
The calculation of the p-values returned:
ks.test - not enough 'x' data
for some users. It's most probably because of already existing NAs in their matrix.
Is your feature request related to a problem? Please describe.
When looking at a group of batches and their BEscore it can be of interest to find which batches are outliers regarding to their BEscore.
Describe the solution you'd like
As described by Akulenko et al.
Describe alternatives you've considered
Maybe use another package for outlier detection
Check for malformatted data e.g. This could help with issues like issue #6
Is your feature request related to a problem? Please describe.
For now BEclear is only tested on methylation, but it could also possibly applied to other kinds of data.
High memory usage, about 5 times higher than the actual input data, during the usage of calcMedians.
This is probably because of temporary copies of the data matrix and could probably be avoided using data.table.
Is your feature request related to a problem? Please describe.
During the GD there are so many epochs executed as predefined, it might however fasten the computation, if the method would stop after convergence.
Describe the solution you'd like
Define a treshold for convergence and then test for it during each epoch.
Additional context
Return the Loss and some information about its convergence to give user a confirmation, if it converges at all .
Error with bigger dataset:
Error in serialize(data, node$con, xdr = FALSE) :
error writing to connection
Error in unserialize(node$con) :
embedded nul in string: 'B\n\002\0\0\0\001\003\003\0'
Error in serialize(data, node$con, xdr = FALSE) : ignoring SIGPIPE signal
Extract preprocessing (finding rows with only NA's etc) and make it optional.
Is your feature request related to a problem? Please describe.
The imputation of BEclear saves on the disk the solution for each block by default. Afterwards they are loaded and merged again.
This however doesn't help a lot with memory consumption and it could help run time to don't save them on the disk.
Describe the solution you'd like
Make saving the temporary solutions optional.
Is your feature request related to a problem? Please describe.
Those two functions are right now implemented with for loops instead of apply functions or clever usage of data.table features. This makes them unnecessarily slow.
Describe the solution you'd like
Replace them through straight forward use data.table functions.
Describe alternatives you've considered
Using lapply instead.
Change the is and js of long and wide matrices
Is your feature request related to a problem? Please describe.
Even though BEclear is in first line for detecting and correcting batch effects, it could be sensible to provide some possibility to ork with data, where there are no batches defined
Describe the solution you'd like
Just test each sample against all other samples, i.e. treat each sample as a batch.
Describe alternatives you've considered
Define batches de novo by e.g. clustering the samples.
Some users might want to set the threshold differently.
The calculation of the medians takes a long time on big datasets.
Maybe I could find a better implemenation through Rcpp reverse imports.
Write at least one test case with help of testthat for the gdepoch function
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.