uds-helms / beclear Goto Github PK

Correction of batch effects in DNA methylation data

Home Page: https://bioconductor.org/packages/release/bioc/html/BEclear.html

License: GNU General Public License v3.0

R 90.67% TeX 7.11% C++ 2.22%

bioconductor-package dna-methylation rpackage missing-values batch-effects methylation missing-data latent-factor-model stochastic-gradient-descent

beclear's People

Contributors

Stargazers

Watchers

Forkers

krferrier

beclear's Issues

Simplify gradient descent function

Simplify gdepoch and dlossp function so that it only uses the gdepoch function and works with the matrix instead of iterating over each cell of the matrix.
This should improve the performance as well.

Do this after issue #10

[FEATURE] Bias modelling

Is your feature request related to a problem? Please describe.
At the moment the LFM has to account for all the variation in the data.
It could however improve the data imputation to add a bias, which accounts for sample and feature specific effects so that the LFM would only need to account for the effect of the interactions of samples and features.

Describe the solution you'd like
As described by Koren et al
Bias could either just be row and column means or also be trained during the GD

Additional context
It could also be interesting to return the bias to recieve some feedback. Maybe also for further analyses.

[FEATURE] Alternating Least Squares (ALS)

Is your feature request related to a problem? Please describe.
ALS is an alternative to the GD in solving the LFM. It has some advantages, when it comes to parallelisation.

Describe the solution you'd like
As described by Koren et al

Additional context
See also https://github.com/Livia-Rasp/Raspository/blob/master/R/imputeALS.R
for an untested implementation.

Use matrix for calculations of Medians and P-Values

Now those functions only accept data.tables

Input of data

One past user asked the question per email on how to use "own" data as an input.
We could probably provide an example in a vignette where data is read in from a file.

[FEATURE] After Merging Blocks Continue GD

Is your feature request related to a problem? Please describe.
One possible idea is to split the overall matrix into small blocks first and do GD, then merge them into larger blocks and continue GD.

Describe alternatives you've considered
It is not clear at the moment, if this method ois feasible or if it is even possible easily to merge the Latent Factors of the blocks.

Reuse error from the loss function

The Error, difference between D - L*R is already calculated in the loss function, but is than calculated again during the gradient descent.
Saving the Error should save time. Implement this after issue #11

calcPositions cange return value

Change return value to a named list to make it more readable

Use data.table in data imputation

Usage of data.table for the block matrices to improve the performance, as a lot of the runtime seems to get lost due to matrix accessions.

In median_batch - median_others : longer object length is not a multiple of shorter object length

Error with bigger dataset:

In median_batch - median_others :
longer object length is not a multiple of shorter object length

ks.test - not enough 'x' data

The calculation of the p-values returned:
ks.test - not enough 'x' data
for some users. It's most probably because of already existing NAs in their matrix.

[FEATURE] Dixon Test for Outlier Detection

Is your feature request related to a problem? Please describe.
When looking at a group of batches and their BEscore it can be of interest to find which batches are outliers regarding to their BEscore.

Describe the solution you'd like
As described by Akulenko et al.

Describe alternatives you've considered
Maybe use another package for outlier detection

Add more checks at the beginning of the functions

Check for malformatted data e.g. This could help with issues like issue #6

[FEATURE] Testing BEclear on other data-sets

Is your feature request related to a problem? Please describe.
For now BEclear is only tested on methylation, but it could also possibly applied to other kinds of data.

Limit memory during calcMedian

High memory usage, about 5 times higher than the actual input data, during the usage of calcMedians.
This is probably because of temporary copies of the data matrix and could probably be avoided using data.table.

[FEATURE] Test For Convergence In Epochs

Is your feature request related to a problem? Please describe.
During the GD there are so many epochs executed as predefined, it might however fasten the computation, if the method would stop after convergence.

Describe the solution you'd like
Define a treshold for convergence and then test for it during each epoch.

Additional context
Return the Loss and some information about its convergence to give user a confirmation, if it converges at all .

Error in serialize(data, node$con, xdr = FALSE)

Error with bigger dataset:

Error in serialize(data, node$con, xdr = FALSE) :
error writing to connection
Error in unserialize(node$con) :
embedded nul in string: 'B\n\002\0\0\0\001\003\003\0'
Error in serialize(data, node$con, xdr = FALSE) : ignoring SIGPIPE signal

Extract preprocessing

Extract preprocessing (finding rows with only NA's etc) and make it optional.

[FEATURE] Don't Save Temporary Results By Default

Is your feature request related to a problem? Please describe.
The imputation of BEclear saves on the disk the solution for each block by default. Afterwards they are loaded and merged again.
This however doesn't help a lot with memory consumption and it could help run time to don't save them on the disk.

Describe the solution you'd like
Make saving the temporary solutions optional.

[FEATURE] Replace for loops in calcSummary and calcScore functions

Is your feature request related to a problem? Please describe.
Those two functions are right now implemented with for loops instead of apply functions or clever usage of data.table features. This makes them unnecessarily slow.

Describe the solution you'd like
Replace them through straight forward use data.table functions.

Describe alternatives you've considered
Using lapply instead.

test_localLoss is and js

Change the is and js of long and wide matrices

[FEATURE] Treating Data-Sets Without Batches

Is your feature request related to a problem? Please describe.
Even though BEclear is in first line for detecting and correcting batch effects, it could be sensible to provide some possibility to ork with data, where there are no batches defined

Describe the solution you'd like
Just test each sample against all other samples, i.e. treat each sample as a batch.

Describe alternatives you've considered
Define batches de novo by e.g. clustering the samples.