Coder Social home page Coder Social logo

Comments (15)

PascalCrepey avatar PascalCrepey commented on July 26, 2024

It's true that at some point, we need to assume that the base that is passed is good and has been checked. I'm ok with option 1 if we rename the functions so that it is clear for the user when the data is supposed to be clean or not. It would allow us to discriminate between high level functions (which call other functions to do the cleaning/checking and building) and low-level ones which only do what they are supposed to do. For example, we could say that all functions like *_from_base need to have checked and clean data, and functions like *_from_rawdata call functions to do the cleaning and potentially call *_from_base functions.
I think it may simplify the building blocks, what do you think ?

from hospitalnetwork.

ClementMassonnaud avatar ClementMassonnaud commented on July 26, 2024

I agree that the best workflow is to perform checkBase() separately.
Here the call to checkBase() is only to see if the database was previously checked. If there is an issue with the base it will not be fixed because no arguments are passed to checkBase(). So it would raise an error and the user would have to use checkBase() separately to fix the database.

I agree that is we must worry about the running time, but I'm not sure that it would be an issue here. checkBase() is merely checking column names, column classes, missing values, which are not very resource-intensive tasks. Even the adjust_overlapping_stays() function is just comparing dates between rows, which is quite fast with data.table, I think. What could take time with the checkBase() function is actually changing the database, but in this case the function would simply raise an error.
I could run some benchmarks to see how the function behaves with larger data sets.

from hospitalnetwork.

ClementMassonnaud avatar ClementMassonnaud commented on July 26, 2024

It's true that at some point, we need to assume that the base that is passed is good and has been checked. I'm ok with option 1 if we rename the functions so that it is clear for the user when the data is supposed to be clean or not.

I am always a bit concerned when we let the responsibility to the users when it can be quite easily handled by the program. Of course, the idea is not to assume the users won't understand the functions, but since we have a rather simple function at hand to double-check, we might as well use it.

As for adding various wrappers of functions, I am afraid it would add some unnecessary complexity to the package. I know that igraph is doing something similar, but I am not so fan. But this is a personal opinion really

from hospitalnetwork.

MikeLydeamore avatar MikeLydeamore commented on July 26, 2024

On a large (~2 million admissions) database, the checkBase() function does take a noticeable amount of time.

Could you add an extra class to the output of the base from checkBase() and then assume if you have an object of that class there is no need to run checkBase() again? If this sounds sensible, happy to patch it in.

from hospitalnetwork.

ClementMassonnaud avatar ClementMassonnaud commented on July 26, 2024

Yes, I thought about that too, but I don't know enough about R classes to implement that.
It seemed like the ideal solution, but I was wondering if creating a new class would make it a bit more difficult for the users to deal with the newly cleaned base (for instance if they want to write it as csv file, etc.)

P.S: could you also detail the benchmark you did on the function? I'm curious to see

from hospitalnetwork.

MikeLydeamore avatar MikeLydeamore commented on July 26, 2024

I sourced all the package files so I could use profvis to have a look, and I also ran checkBase(base) on the base file I got out of a previous checkBase (with relevant arguments), as I think this is what is happening inside the function. Benchmark says that 80 of the 117 seconds are taken up by checkBase.
I'm running this on real data so I can't share, but if you're interested I'm happy to give the simulated data a go and see what happens.

Haven't yet tried not checking the base and comparing, but I'll try and look into that and report back.
image

from hospitalnetwork.

ClementMassonnaud avatar ClementMassonnaud commented on July 26, 2024

Thank you very much for the details
It seems I was misleading, checkBase() takes indeed too much time to be called systematically within the function.
Looking again at the code, it might be due to checking of missing values, which, as I wrote it, involves a lot of string comparisons.

So, I would vote either for your solution of the extra class, or for option 2 proposed by @tjibbed

from hospitalnetwork.

tjibbed avatar tjibbed commented on July 26, 2024

Good to see @MikeLydeamore joining the discussion!

I'm tempted to agree that an extra class as output of checkBase(), and possible input of the edgelist_from_base() [and relating functions) might be the best way forward. We could carry the report on numbers of errors forward in that class as well, so that in the end we have a measure for input data quality, which can be exported.

@ClementMassonnaud 's point is important though:

but I was wondering if creating a new class would make it a bit more difficult for the users to deal with the newly cleaned base (for instance if they want to write it as csv file, etc.)

However, I'd imagine that the user would still be able to simply export the database element from that class (e.g. checkedData$base) to obtain a csv file with just the checked data. Right?

from hospitalnetwork.

PascalCrepey avatar PascalCrepey commented on July 26, 2024

Well, without creating a new object or anything like that, we can just use attributes.
At the end of checkBase(), instead of

if (returnReport) {
  return(report)
} else {
  return(report$base)
}

we could do :

#add the report as a named list attribute to the data.table
attr(report$base, "report") <- list(
  failedParse = report$failedParse,
  removedMissing = report$removedMissing, 
  missing = report$missing, 
  negativeLOS = report$negativeLOS,
  removedErrors = report$removedErrors, 
  removedDuplicates = report$removedDuplicates,
  neededIterations = report$neededIterations, 
  allIterations = report$allIterations, 
  addedAOS = report$addedAOS)

#add the class "checked.base" to the list of class so that we can easily 
#identify whether the base has been checked or not. 
if(!inherits(report$base, "checked.base") class(report$base) <- c("checked.base", class(report$base))

return(report$base)

Later on we can check whether the base has been checked by

if(inherits(my_potentially_checked_base, "checked.base")){
  ##do stuff on the checked base
}

and the report can be accessed via

attr(my_checked_base, "report")

but other than that, the object is still a data.table and behaves as it should...

Would it solve the issue at stake here ?

from hospitalnetwork.

MikeLydeamore avatar MikeLydeamore commented on July 26, 2024

I think so - classes in R aren't exclusive so just tagging the checked.base on the end won't stop the data.table functionality.
Given that the report isn't a thing the user is likely to want frequently, it seems OK to set as an attribute in my opinion.

from hospitalnetwork.

tjibbed avatar tjibbed commented on July 26, 2024

Sounds like a good solution (this really pushes the boundaries of my R-knowledge... again, learning a lot here)

from hospitalnetwork.

ClementMassonnaud avatar ClementMassonnaud commented on July 26, 2024

Sounds good to me

from hospitalnetwork.

PascalCrepey avatar PascalCrepey commented on July 26, 2024

One additional thing:
I can't remember what we decided regarding the naming/renaming of the columns in the database. I know it's done in checkBase, but since we now have a way of checking if the base has been checked, I don't really see the point of still putting the alternative column names in argument to some of our functions... and anyway when checkbase() was called with no argument, it would have stopped the running without even using the alternative column names... 
I think it's still useful to allow alternative names for functions that one should be able to use on its own (not called by another function) but my guess is that it's a limited list... 
Which functions do you think the user may want to call directly ? which one do you call directly ?

from hospitalnetwork.

tjibbed avatar tjibbed commented on July 26, 2024

I would say we'd only need the possibility of alternative column names in

  • checkBase
  • hospinet_from_subject_database

If we rename the columns in checkBase(), the alternative names can be removed from the following functions:

  • checkFormat
  • checkMissingErrors
  • adjust_overlapping_stays
  • matrix_from_base [Actually, completely deprecated?]
  • edgelist_from_base

I would say we keep these functions internal, and have the users create a hospiNet first, and from there they can get the edgelist and matrix directly.
In that case, any reconstruction of the matrix is through the edgelist, and there would be no direct route from base to matrix. I'm fine with this, as generally the edgelist is much smaller than the matrix. matrix_from_base would therefore lose it's usefulness.

I'll have a go at it.

from hospitalnetwork.

tjibbed avatar tjibbed commented on July 26, 2024

Oh by the way: I noticed @MikeLydeamore removed checkBase() from hospinet_from_subject_database() when inserting the check for hospinet.base.

I inserted it again, but within that check. It now reads

if (!inherits(base, "hospinet.base")) { message("Input database was not checked yet, which is required for network reconstruction. Running 'checkBase()' with default parameters. If this doesn't work, please run checkBase() separatelty with custom parameters first.") base=checkBase(base, verbose=verbose, ... ) }

So if an unchecked database is input, the script will first try to check the database with the available information. The user can input any parameters for checkBase() as parameters in hospinet_from_subject_database(), which passes them on.

This runs fine on my large real database.

If the database was already checked, it will skip it and go straight to calculating the edgelist and the rest of the hospinet object.

from hospitalnetwork.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.