datashield / dsbase Goto Github PK

DataSHIELD server side base functions

License: GNU General Public License v3.0

R 100.00%

dsbase's Introduction

dsBase

DataSHIELD server side base R library.

Branch	dsBase status	dsBaseClient status	dsBaseClient tests
Master
6.0
6.0.1
6.1		Tests
6.1.1		Tests
6.2		Tests
6.3		Tests

About

DataSHIELD is a software package which allows you to do non-disclosive federated analysis on sensitive data. Our website (https://www.datashield.org) has in depth descriptions of what it is, how it works and how to install it. A key point to highlight is that DataSHIELD has a client-server infrastructure, so the dsBase package (https://github.com/datashield/dsBase) needs to be used in conjuction with the dsBaseClient package (https://github.com/datashield/dsBaseClient) - trying to use one without the other makes no sense.

Detailed instructions on how to install DataSHIELD are at https://www.datashield.org/wiki. The code here is organised as:

Location	What is it?
obiba CRAN	Where you probably should install DataSHIELD from.
releases	Stable releases.
master branch	Mostly in sync with the latest release, changes rarely.

dsbase's People

Contributors

Stargazers

Watchers

dsbase's Issues

Is there a way to manually validate requests?

Hey, thanks a lot for the great work here!

I wonder if there is a way to set users that would have their requests manually validated by an administrator?

As far as I understand : I can allow a specific set of functions (should they be aggregations or not) for a user A and then A can run them as much as they want, possibly on the dataset or on a subset of the dataset. Is this correct?

Thanks a lot for your help!

permissive mode and booleDS, c and rep

I am currently revisiting some of the disclosure attacks and was wondering about the functions booleDS, c and rep which don't seem to be included in the list to be blocked with the introduction of permissive mode.

booleDS can be used for the same purpose as subsetting because you can perform a logical test (e.g. on an index) to mark a single row with a 1 and the rest with zeros. Then you can multiply, take the mean etc to recover the value.

the c and rep functions are normally natively available via ds.assign:

ds.assign(toAssign = "c(D$LAB_TSC,D$LAB_TSC)", newobj = "fact1")

ds.assign(toAssign = "c(1,rep(0,4127))", newobj = "fact1")

These allow the same tricks as cDS and repDS

Add in concept of "concentration" for disclosure control

DataSHIELD doesn't currently appear to have the concept of concentration as one of the disclosure controls.

The idea is to limit the proportion of a statistic that can be made by a single value from the set of values being sampled. In simple terms, if we have the numbers 0.1, 0.2, 0.3, 0.5, 4e6, 0.6, 0.5, then we should block the mean of this because one value dominates and it is disclosive. At the moment, this passes the standard nfilter.tab test.

The limit could be set to no value should be more than 0.9 of the statistic.

The first functions where this will be implemented are ds.mean() and similar. One of the attack modes is to create a vector of all 0s except a single 1, multiply this with the column of interest and take the mean. Knowing the length allows recreation of a value. Moving the 1 allows all values to be recreated. This change will stop this attack.

This control will not help with other differencing attacks (as per Stefan's work)

lsDS not compatible with DSLite

Because it calls for .GlobalEnv in place of parent.frame():
https://github.com/datashield/dsBase/blob/master/R/lsDS.R#L40

lines #65-#80 in glmSLMADS1 are not needed and cause errors in v6.0

in glmSLMADS1 we have the lines:

#Remember model.variables and then varnames INCLUDE BOTH yvect AND linear predictor components 
	model.variables <- unlist(strsplit(formulatext, split="|", fixed=TRUE))
 
	 varnames <- c()
	  for(i in 1:length(model.variables)){
	    elt <- unlist(strsplit(model.variables[i], split="$", fixed=TRUE))
	    if(length(elt) > 1){
	      assign(elt[length(elt)], eval(parse(text=model.variables[i])))
	      originalFormula <- gsub(model.variables[i], elt[length(elt)], originalFormula, fixed=TRUE)
	      varnames <- append(varnames, elt[length(elt)])
	    }else{
	      varnames <- append(varnames, elt)
	    }
	  }

	varnames <- unique(varnames)

the variables varnames and originalFormula are not used in the rest of the function. Furthermore, the eval(parse expression throws an error under v6.0-dev using DSLite because it asks for objects from the D dataframe in the parent.frame. However, seeing as the whole code block is not used I propose removing it.

mg object failed to be created when executing dsBase::glmerSLMADS2

As I was carrying out some analysis using the command ds.glmerSLMA I received a DataSHIELD error saying that the object mg from the function dsBase::glmerSLMADS2 was not found.
When I checked the source code of the function dsBase::glmerSLMADS2 in GitHub, I then realized that this object is created within a try in line 392 (see code below) from this page: https://github.com/datashield/dsBase/blob/master/R/glmerSLMADS2.R

At line 392: iterations <- utils::capture.output(try(mg <- lme4::glmer(formula2use, offset=offset, weights=weights, data=dataDF,
family = family, nAGQ=nAGQ,verbose = verbose, control=control.obj, start = start)))

It seems that for my case the function fails to create the mg object inside a try.

Datashield error Message I got:
Error while evaluating 'dsBase::glmerSLMADS2(flag_pavk~sex_cd+age+flag_nicotine+flag_aat+yyy1xxxpatient_numzzz, NULL, NULL, "D", "binomial", NULL, NULL, 1L, 0, NULL, NULL)' -> Error in summary(mg) : object 'mg' not found\n"

Can I have more details when or why the try can fail to create the mg object?

Thanks In advance for your explanations.

lines #90-#157 in glmSLMADS2 can probably be replaced with 4 lines of code and prevent errors for DSLite in v6.0

In glmSLMADS2, we have the following code:

# Rewrite formula extracting variables nested in strutures like data frame or list
# (e.g. D$A~D$B will be re-written A~B)
# Note final product is a list of the variables in the model (yvector and covariates)
# it is NOT a list of model terms - these are derived later

# Convert formula into an editable character string
  formulatext <- Reduce(paste, deparse(formula))

# First save original model formala
  originalFormula <- formulatext

# Convert formula string into separate variable names split by |
  formulatext <- gsub(" ", "", formulatext, fixed=TRUE)
  formulatext <- gsub("~", "|", formulatext, fixed=TRUE)
  formulatext <- gsub("+", "|", formulatext, fixed=TRUE)
  formulatext <- gsub("*", "|", formulatext, fixed=TRUE)
  formulatext <- gsub("||", "|", formulatext, fixed=TRUE)




#Remember model.variables and then varnames INCLUDE BOTH yvect AND linear predictor components 
	model.variables <- unlist(strsplit(formulatext, split="|", fixed=TRUE))
 
	varnames <- c()
  for(i in 1:length(model.variables)){
    elt <- unlist(strsplit(model.variables[i], split="$", fixed=TRUE))
    if(length(elt) > 1){
      assign(elt[length(elt)], eval(parse(text=model.variables[i])))
      originalFormula.modified <- gsub(model.variables[i], elt[length(elt)], originalFormula, fixed=TRUE)
      varnames <- append(varnames, elt[length(elt)])
    }else{
      varnames <- append(varnames, elt)
    }
  }
	varnames <- unique(varnames)

  if(!is.null(dataName)){
      for(v in 1:length(varnames)){
	varnames[v] <- paste0(dataName,"$",varnames[v])

	test.string.0 <- paste0(dataName,"$","0")
	test.string.1 <- paste0(dataName,"$","1")
	if(varnames[v]==test.string.0) varnames[v] <- "0"
	if(varnames[v]==test.string.1) varnames[v] <- "1"
      }
	  	cbindraw.text <- paste0("cbind(", paste(varnames, collapse=","), ")")	
  }else{
  	    cbindraw.text <- paste0("cbind(", paste(varnames, collapse=","), ")")
		}
 
	#Identify and use variable names to count missings

	all.data <- eval(parse(text=cbindraw.text))
	
	Ntotal <- dim(all.data)[1]
	
	nomiss.any <- stats::complete.cases(all.data)
	nomiss.any.data <- all.data[nomiss.any,]
	N.nomiss.any <- dim(nomiss.any.data)[1]

	Nvalid <- N.nomiss.any
	Nmissing <- Ntotal-Nvalid

	formula2use <- stats::as.formula(paste0(Reduce(paste, deparse(originalFormula)))) # here we need the formula as a 'call' object

Briefly, the code takes the formula and extracts the variables, and then builds a matrix of them, inspecting the number of rows with a missing value. The net result of this is we have:

Ntotal - the number of rows of data in the model
Nvalid - the number of rows of data in the model without missings
Nmissing - difference between above 2 numbers
formula2use - exactly the same as input formula variable

The N* variables are just returned at the end and are not used for disclosure checks etc. This is also the case with varnames - it is not used anywhere else. They can simply be obtained by doing the following after fitting the model:

	Nvalid <- length(mg$residuals)
	Nmissing <- length(mg$na.action)
	Ntotal <- Nvalid+Nmissing

and to get the formula we can just do

formula2use = formula

These changes also prevent some potential difficulties with environments when using DSLite, as well as making the code simpler.

Thoughts about disclosure traps for repDS, rbindDS and cDS

Attacks that use some kind of index to isolate or duplicate particular rows rely on creating some kind of index vector. Modifications to repDS, rbindDS and cDS might help protect against this style of attack while allowing legitimate use to continue without issue.

The proposals are:

Add concentration traps to rbindDS and cDS to stop vectors like (1,0,0,0,0,0,0,0,0) that can be used to isolate rows
Stop the generation of unique values (ie an index) using repDS. We need to check whether there would be a legitimate reason for doing this. Note that this is not a complete solution, and it is still possible to generate a few different vectors where the rows aren't unique and then combine them with arithmetic operations that result in a vector with unique rows.

The next step might be to stop arithmetic that results in a vector of unique values, but that would stop genuine usage commands

Installing DataSHIELD packages fails

Hi,
I'm trying to install DataSHIELD through Opal web admin interface. The process seems to die silently. I see this in the logs:

ERROR org.obiba.opal.r.rserve.RserveService - DataShield packages properties extraction failed
org.obiba.opal.spi.r.RRuntimeException: Error in strsplit(pp, ",") : non-character argument

I can install/compile all DataSHIELD packages and dependencies through R. They appear in packages in the R administration page in Opal, but not in the DataSHIELD admin page. Packages are physically located in either /usr/lib64/R/library or /var/lib/rserver/R/library.

I'm on CentOS 7, using Opal 4.1.3.

Any clue?

Move Datashield settings out of DESCRIPTION

The DS specific entries in the DESCRIPTION file are not CRAN compliant. You should move them to the file inst/DATASHIELD. As an example, see how done in dsSurvival.

offsets and weights don't work in v6.0-dev for glmSLMA (and therefore for the lmerSLMA functions too)

In v5.1 offsets and weights work as expected for glmSLMA

In v6.0-dev, the error is returned:

invalid type (closure) for variable '(offset)'

I believe this is related to the modification of line 167 onwards:

	if(!(is.null(offset)))
		{
		cbindtext.offset <- paste0("cbind(", offset,")")
		offset <- eval(parse(text=cbindtext.offset))
		}

where in v6.0 we have the addition of env = parent.frame()

I will attempt to figure out what is going on, because I don't have a good understanding of the workings of R environments with DSI. But I guess that the offset vector is in the 'wrong' environment to be accessed by the glm model...

lsDS does not work with DSLite

lsDS assues that the symbols are living in the R Global Environment. In DSLite, this is not the case: the server symbols are in the parent.frame of the function call and R GlobalEnv is the one of the client. As a comparison, see the DSI::datashield.symbols() output.

To reproduce:

library(DSLite)
library(dsBaseClient)

# prepare DSLite server
data("CNSIM1")
dslite.server <- newDSLiteServer(tables=list(CNSIM1=CNSIM1))

builder <- DSI::newDSLoginBuilder()
builder$append(server = "study1",  url = "dslite.server",
               table = "CNSIM1", driver = "DSLiteDriver")
logindata <- builder$build()
conns <- DSI::datashield.login(logins = logindata, assign = TRUE)
ds.ls()
DSI::datashield.symbols(conns)
datashield.logout(conns)

Output:

> ds.ls()
$study1
$study1$environment.searched
[1] "R_GlobalEnv"

$study1$objects.found
[1] "builder"       "CNSIM1"        "conns"         "dslite.server" "logindata"    

> DSI::datashield.symbols(conns)
$study1
[1] "D"

subsetByClass and friends

Hello,

I think I found a problem in the subsetByClassHelper functions. Either that or I'm missing something obvious. Here's a little test case:

opals <- datashield.login(logindata)
# a factor with 7 levels:
datashield.assign(opals, 'fact', quote(rep(c('a','b', 'c', 'd', 'e', 'f', 'g'),10)))
ds.asFactor('fact', 'fact')
#check:
ds.levels('fact')
ds.subsetByClass('fact')
ds.summary('subClasses')

And the result:

....
[1] "fact.level_a_EMPTY" "fact.level_b_EMPTY" "fact.level_c_EMPTY" "fact.level_d_EMPTY" "fact.level_e_EMPTY" "fact.level_f_EMPTY" "fact.level_g_EMPTY"

I had a quick look in the code and I see this in all subsetByClassHelpers:

...
        for (j in 1:length(categories)) {
            indices <- which(var == as.numeric(categories[j]))
...

Why as.numeric? I really didn't look too closely but if I remove as.numeric it seems to work, it populates the subsets. Am I missing something here about how R handles factors?

Thanks,
Iulian

Subset with Columns and Complete Cases does not perform as expected because it performs a complete cases subset first.

Hello,

I am not sure if this is an intended pattern but its something I didn't expect. Also I don't know if the new release already addresses the issue but if it didn't I don't want to forget about this one.

As part of getting descriptive summaries about the number of participants I currently have available in a subsetted dataframe I run:

Example:

# D is a 100 row dataframe with columns a,b,c
# a has 100 values
# b has 87 values, rest are miissing
# c has 1 value, rest are missing
ds.subset(x = 'D', subset = 'D2', cols = c('a','b'), completeCases = TRUE)

I would expect D2 to have 87 rows because row b is the limiting column with 87 values.

however

ds.length('D2$a')
# returns
1

The complete cases directive was run on dataframe 'D' and then subsetted into columns a and b, hence the only rows with complete cases on 'D' was 1 row due to column c.

This is verified in the serverside function https://github.com/datashield/dsBase/blob/master/R/subsetDS.R lines 48-58 where the completeCases parameter is handled to take complete cases of the dataframe that is to be subsetted into columns.

This is a problem we often work with large 'D' dataframes with many columns that have varying amounts of missing values, if the completeCases parameter always shortens the dataframe to 'least defined variable' we could miss out on a lot of information when we perform analyses with the subset. This is particularly apparent when we are working with variables whose harmonisations have not been finalised or have very few values by design.

I suggest either changing the few lines to make the subset occur before the complete casing, or changing the documentation of the ds.subset function so that this is clear. Should I make a pull request for this?

Cheers,
Paul

subsetDS

I think an additional check is needed in subsetDS. At the moment we check that if subset by rows is used, the resulting subset should contain more than say 5 rows. However, we should probably also use the 5 value to enforce that we remove more than 5 rows. Otherwise if you take the mean, and then remove 1 row and retake the mean, you can work out individual values.

how to run a code profiler for heavy processing loads

Normally when you are doing an analysis and it fails due to the load on the system, a code profiler can help you identify what step caused the failure. This allows you to put in a work around to stop the failure from happening.

In a DataSHIELD context this is both more serious and challenging to solve. It is more serious because the person running the code does not have control of the server. If it crashes it is a big delay as they will have to email the server owner to get them to restart it. Thus this has been a big usability issue for us previously and in current work on InterConnect.

In terms of solving the problem, you cannot just run the profiler on the client side, it needs to look at how the code is running on the server side. Maybe as a first attempt the profiler could run on the client at least to determine which call to the server causes the failure.... but it does not help you identify what is causing the failure on the server side. Perhaps we could develop a ds.profiler that could wrap another call and run it with a profiler on the server side? I don't think it will be easy! :-)

rbindDS converts characters to numerics

rbindDS converts character variables to numeric variables because is using the data.matrix function. We can replace this with the R rbind function to avoid this issue

isVariableValidDS.R missing (branch v6.2-dev_prw)

isVariableValidDS.R missing from R directory, cause test to fail.

Prototype cell key perturbation to enhance disclosure control against difference attacks

Stefan has illustrated two methods of retrieving data from DataSHIELD with difference attacks. In short:

(1) by comparing the mean of a column with all rows and with one row removed
(2) by comparing the mean of a column with all rows and with one row duplicated

This is hard to protect against because it is done by creating two subsets that generally have large numbers of rows.

Research indicates that the best protection against difference attacks is to add noise. There is a package cellKey which provides the ability to add noise to a table in R. This could be repackaged for DataSHIELD use.

The issues to address are:

when to apply the noise - on import of data into the session? Or when the data are split into subsets?
the cell key process has been used for census data, and tends to be evaluated on a particular data set to see if it is appropriate. How would that work for DataSHIELD with diverse datasets?

How do clients and servers communicate?

What is the communication method between the client and the server, via TCP?

Add garbage collector agg function

Add the possibility to trigger the R garbage collector on the server side by calling gc(), as a complement of the rm() function (called by rmDS).

> gc()
          used (Mb) gc trigger  (Mb) max used  (Mb)
Ncells 1695520 90.6    2607968 139.3  2607968 139.3
Vcells 3576174 27.3    8388608  64.0  4930149  37.7

Bonus is to return the output of gc: it is not disclosive, and it may be useful for the user.

This would encourage good memory management practices on the R server side, for a better usage of the shared system resources.

Subsetting a table is broken in DSI / v6 branch

I have a table patient with column age.
I want to subset the table to get all patients with age >= 40.

ds.subset(x = 'patient', subset = 'less_patients', logicalOperator = 'age>=', threshold = 40, datasources = conns)

Expected

A new table less_patients is created with the patients of age>=40.
Or perhaps a message that I do not have enough patients to filter them.

Actual

I get an error message

Error: Command 'subsetDS("patient", FALSE, NULL, NULL, 2, 40, "age")' failed on 'opal-dsi': Error while evaluating 'is.null(base::assign('less_patients', value={dsBase::subsetDS("patient", FALSE, NULL, NULL, 2, 40, "age")}))' ->
Error in D[, 1] : object of type 'closure' is not subsettable

finish error handling on glmerSLMADS2

At the moment the mg object is created within a try statement. Therefore errors should be caught and returned to the user but they are not. And subsequent code will still use the mg object even if it has not been created, leading to an ungraceful failure.