r-lib / memoise Goto Github PK

View Code? Open in Web Editor NEW

314.0 10.0 59.0 308 KB

Easy memoisation for R

Home Page: https://memoise.r-lib.org

License: Other

R 100.00%

r memoise

memoise's Introduction

memoise

The memoise package makes it easy to memoise R functions. Memoisation (https://en.wikipedia.org/wiki/Memoization) caches function calls so that if a previously seen set of inputs is seen, it can return the previously computed output.

Installation

Install from CRAN with

install.packages("memoise")

Usage

To memoise a function, use memoise():

library(memoise)
f <- function(x) {
  Sys.sleep(1)
  mean(x)
}
mf <- memoise(f)

system.time(mf(1:10))
#>    user  system elapsed
#>   0.002   0.000   1.003
system.time(mf(1:10))
#>    user  system elapsed
#>   0.000   0.000   0.001

You can clear mf’s cache with:

forget(mf)

And you can test whether a function is memoised with is.memoised().

Caches

By default, memoise uses an in-memory cache, using cache_mem() from the cachem package. cachem::cache_disk() allows caching using files on a local filesystem.

Both cachem::cache_mem() and cachem::cache_disk() support automatic pruning by default; this means that they will not keep growing past a certain size, and eventually older items will be removed from the cache. The default size cache_mem() is 512 MB, and the default size for a cache_disk() is 1 GB, but this can be customized by specifying max_size:

# 100 MB limit
cm <- cachem::cache_mem(max_size = 100 * 1024^2)

mf <- memoise(f, cache = cm)

You can also change the maximum age of items in the cache with max_age:

# Expire items in cache after 15 minutes
cm <- cachem::cache_mem(max_age = 15 * 60)

mf <- memoise(f, cache = cm)

By default, a cache_disk() uses a subdirectory the R process’s temp directory, but it is possible to specify the directory. This is useful for persisting a cache across R sessions, sharing a cache among different processes, or even for synchronizing across the network.

# Store in "R-myapp" directory inside of user-level cache directory
cd <- cachem::cache_disk(rappdirs::user_cache_dir("R-myapp"))

# Store in Dropbox
cdb <- cachem::cache_disk("~/Dropbox/.rcache")

A single cache object can be shared among multiple memoised functions. By default, the cache key includes not only the arguments to the function, but also the body of the function. This essentially eliminates the possibility of a cache collision, even if two memoised functions are called with the same arguments.

m <- cachem::cache_mem()

times2 <- memoise(function(x) { x * 2 }, cache = m)
times4 <- memoise(function(x) { x * 4 }, cache = m)

times2(10)
#> [1] 20
times4(10)
#> [1] 40

Cache API

It is possible to use other caching backends with memoise. These caching objects must be key-value stores which use the same API as those from the cachem package. The following methods are required for full compatibiltiy with memoise:

$set(key, value): Sets a key to value in the cache.
$get(key): Gets the value associated with key. If the key is not in the cache, this returns an object with class "key_missing".
$exists(key): Checks for the existence of key in the cache.
$remove(key): Removes the value for key from the cache.
$reset(): Resets the cache, clearing all key/value pairs.

Note that the sentinel value for missing keys can be created by calling cachem::key_missing(), or structure(list(), class = "key_missing").

Old-style cache objects

Before version 2.0, memoise used different caching objects, which did not have automatic pruning and had a slightly different API. These caching objects can still be used, but we recommend using the caching objects from cachem when possible.

With the old-style caching objects, memoise first checks for the existence of a key in the cache, and if present, it fetches the value. This results in a possible race condition (when using caches other than the memory cache): an object could be deleted from the cache after the existence check, but before the value is fetched. With the new cachem-style caching objects, the possibility of a a race condition is eliminated: memoise simply tries to fetch the key, and if it’s not present in the cache, the cache returns a sentinel value indicating that it’s missing. (Note that the caching objects must also be designed to avoid a similar race condition internally.)

The following cache objects do not currently have an equivalent in cachem.

cache_s3() allows caching on Amazon S3 Requires you to specify a bucket using cache_name. When creating buckets, they must be unique among all s3 users when created.
```
Sys.setenv(
  "AWS_ACCESS_KEY_ID" = "<access key>",
  "AWS_SECRET_ACCESS_KEY" = "<access secret>"
)
cache <- cache_s3("<unique bucket name>")
```
cache_gcs() saves the cache to Google Cloud Storage. It requires you to authenticate by downloading a JSON authentication file, and specifying a pre-made bucket:
```
Sys.setenv(
  "GCS_AUTH_FILE" = "<google-service-json>",
  "GCS_DEFAULT_BUCKET" = "unique-bucket-name"
)
gcs <- cache_gcs()
```

memoise's People

Stargazers

Watchers

Forkers

sietse wch richierocks noamross piccolbo rm1900 revolutionanalytics tonglu ghane dkesh minang08 mengran-wang jimhester wildoane danielecook rlugojr dy-kim tarakc02 prebours struckma leeper egnha audioelektronik vreuter markedmondson1234 yonicd mpadge applied-statistic-using-r jdeboer strategist922 cpsievert coolbutuseless richardkunze nikolayvoronchikhin civisanalytics kwanern bluaze atusy hongooi73 kieranrabbitt rhjp matthieurouland dmurdoch mbertolacci colinfay da505819 tubbz-alt dpprdan mgirlich jimsforks isabella232 davzim tracykteal

memoise's Issues

Name clashes between arguments and memoising procedure

The problem: Argument names of a memoised function (say, f) can be hijacked by names in the body or enclosing environment of the memoising function (memoise(f)).

Example:

memoise(function(hash) hash)("Am I OK?")
#> [1] "9ec45ba0998d8bc3150c..." (output truncated)

This is not a problem that is likely to be encountered for exotic names, like `_f`. For ordinary syntactic names, like hash or args, the likelihood of a name clash is still very low, but not zero. Nevertheless, it might be worthwhile to reduce this likelihood to zero, as this seems possible with a few small changes to memoise() (and, accordingly, has_cache()).

A solution:

To fix clashes with names in the enclosing environment — invoke such names by explicit reference to bindings in the enclosing environment.
To fix clashes with names assigned in the memoised function body — call the underlying (non-memoised) function f in the calling environment, rather than in the function's execution environment, as currently done.

Implementing 1) amounts to changing `_f` to encl$`_f`, etc., where encl gets parent.env(environment()). Implementing 2) amounts to replacing .init_call in memoise.R with eval.parent(`[[<-`(match.call(), 1L, encl$`_f`)).

I have implement such fixes in commits 018ef0a, 7a46d42, e32fb1b. memoise() is even a bit simpler, now, because bquote()'ing is no longer necessary. All existing tests pass, in addition to a test that verifies the absence of name clashes.

With such changes, argument names won't be hijacked by internal names:

memoise(function(hash) hash)("OK!")
#> [1] "OK!"

Infinite Improbability default_args

Jim,

A possible bug:

Line 123: lapply(default_args, eval, envir = environment()))

If default_args contains one of the already defined symbols in memo_f or its enclosed environment (e.g., _f), it will use this value instead of the default argument specified in the function definition. Improbable example:

# based on test: "argument names don't clash with names in memoised function body"
library(memoise)

f <- function(
  # note that `_f` is not included as argument
  `_cache`, `_additional`,
  mc, encl, called_args, default_args, args, hash, res, xtra = `_f`
) list(`_f`, `_cache`, `_additional`, mc, encl, called_args, default_args, args, hash, res, xtra)
f_mem <- memoise(f)

`_f` <- 100
(unlist(f(1, 2, 3, 4, 5, 6, 7, 8, 9)))
# [1] 100   1   2   3   4   5   6   7   8   9 100
(unlist(f_mem(1, 2, 3, 4, 5, 6, 7, 8, 9)))
# [1] 100   1   2   3   4   5   6   7   8   9 100
# looks good

`_f` <- 200
(unlist(f(1, 2, 3, 4, 5, 6, 7, 8, 9)))
# [1] 200   1   2   3   4   5   6   7   8   9 200
(unlist(f_mem(1, 2, 3, 4, 5, 6, 7, 8, 9)))
# [1] 100   1   2   3   4   5   6   7   8   9 100
# does not match

I believe using

envir = environment(encl$`_f`)

would solve this problem as there is no other eval that takes place inside the memo_f frame / enclosing environment. (All the other tests pass.)

Q: We still have eval of _additional on Line 127 which takes place in encl --> there might be a possible (but improbable) conflict of ... formula with names in encl: _f, _additional and _cache. Is this correct?

Memoise with data.table

I found an interesting issue that could cause some serious mischief.

Because of the way data.table and memoise use environments, if you use memoise within a data.table's group by feature you can get an error where the wrong input / output pair get matched.

library(data.table)
library(memoise)

fib <- function(x) if(x<2) x else fib(x - 2) + fib(x - 1)
fib <- memoise(fib)

dat <- data.table(x=as.numeric(1:10))
dat[ , y := fib(x), x]
dat

fib(1)  ## This returns the largest value for x

I don't know that I would call this an error or a bug in memoise or data.table, but it's good to know.

expose `algo` / digest in cache_

Would it be possible to expose to the user the algo parameter for digest added in 4b3eb9f?

128 characters for a file name might be a little too much for Windows when the total path length should be 260 (although this is changing). Moreover, other hash functions are faster.

Even better, would it be possible to supply the digest function from the cache_ closure (as it is done with reset, set, etc.)? This would allow the user to provide their own digest function.

prevent filesystem indexing for caches?

I apologize if this is out of scope, but I noticed that several independent processes on my Mac have started devoting a lot of resources to indexing my cache directories (Spotlight, Time Machine, my online backup service, and [if the caches were in my working directory] RStudio ).

I'd imagine that Windows and Linux users could have similar problems.

I'm not sure whether there's anything you can do about these from inside your package, but I wanted to put it on your radar just in case. Also, it might be worth pointing out in the documentation that users might want to keep their caches in a folder that has these services turned off.

Limit memory usage in a cache

Would it be possible to have a form of the memory cache that will only use a specific amount of memory or store a particular number of recent function/argument combinations? This strikes me as possibly useful when looping over very large datasets and computing the same expensive object for multiple numbers of elements in the dataset.

If one just caches all the expensive objects the cache will be become very large (is there any garbage collection on cached objects?). One often needs only the most recent function/argument combination if the arguments are ordered in this type of looping over a dataset.

I suppose a replacement for the memory cache which used a linked list of function/argument combinations rather than a hash set could work.

unmemoise function?

Firstly, this package is really great! Thanks for the development and support. I'm writing to see if there'd be interest in an unmemoise(fn) function which will then make sure that this will work:

fn <- memoise(fn)
is.memoised(fn) # TRUE
unmemoise(fn)
is.memoised(fn) # FALSE

A usecase for this would be for example an interactive session where larger requests to the same functions might be suitable for caching but more granular requests might be better off run on real time data.

Has there been already any thought on such feature?

Worth warning about memoising random functions

Hey there,

This isn't necessarily a problem with memoise at all, but rather a warning that might be worth giving to folks who read about the package.

If someone tries to memoise a function that includes any sort of randomization in it, their outputs will not be random if the inputs are identical, and it could be really hard to tell just by looking at the results. Here's a simple example:

rnorm_mem <- memoise(rnorm)
x <- rnorm_mem(10)
y <- rnorm_mem(10)
all.equal(x,y)
#> [1] TRUE

It seems pretty obvious, but I could definitely see someone getting a little trigger-happy with memoise and making this mistake without realizing it.

feature request: improve caching of large objects

When a function is being called with the same object, as opposed to an object which is equal but located somewhere else in memory, there should be a way to avoid the calculation of the checksum or hash. I don't know how to do it, but I have some immutable 100MB objects and it seems like I'm mostly waiting for a checksum to be calculated.

I don't know if the semantics of weak references in R allows us to ensure that a weakly-referenced object has not been modified, but if so then this could be turned into a solution for making Memoise fast for large objects too.

Available Packages

Using version 1.0.0 on a fresh session

 ap = memoise(available.packages)

This works

ap(method="libcurl")

but

> ap()
Error in serialize(object, connection = NULL, ascii = ascii) : 
  argument "method" is missing, with no default

Typo in `README.md`

There are some typos in Filesystem example code
- I'll PR this right away

memoise thinks an optional lmer argument is required

memoise somehow thinks that the "subset" argument in lme4::lmer is required. Not sure if this is lme4's issue or memoise's....

> library(lme4)
Loading required package: Matrix

Attaching package: ‘Matrix’

The following objects are masked from ‘package:base’:

    crossprod, tcrossprod

> data("Arabidopsis")
> library(memoise)
> mem_lmer = memoise(lmer)
> mem_lmer(gen ~ (1|rack) + nutrient, data = Arabidopsis, na.action = na.omit)
Error in serialize(object, connection = NULL, ascii = ascii) : 
  argument "subset" is missing, with no default
> lmer(gen ~ (1|rack) + nutrient, data = Arabidopsis, na.action = na.omit)
Linear mixed model fit by REML ['lmerMod']
Formula: gen ~ (1 | rack) + nutrient
   Data: Arabidopsis
REML criterion at convergence: 4601.868
Random effects:
 Groups   Name        Std.Dev. 
 rack     (Intercept) 5.207e-07
 Residual             9.603e+00
Number of obs: 625, groups:  rack, 2
Fixed Effects:
(Intercept)     nutrient  
   20.74405      0.04816  


> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] memoise_1.0.0 lme4_1.1-10   Matrix_1.2-3 

loaded via a namespace (and not attached):
 [1] minqa_1.2.4     MASS_7.3-45     tools_3.2.3     Rcpp_0.12.3     splines_3.2.3   nlme_3.1-124    grid_3.2.3      digest_0.6.9    nloptr_1.0.4   
[10] lattice_0.20-33

Memoise with plot

I suspect this is the same issue as #19

p = memoise(plot)
p(1:10) # Bad
p(1:10, 1:10) #Good

Also, I'm thinking that this could be useful in a shiny application (wrapped around a plot call). Is this a sensible thought?

path in caching backends should be absolute

For example, with cache_filesystem, if path is relative and the working directory changes, then the caching operations will happen in the wrong directory.

Not working with paste functions

I got the following issue when using the memoise function on paste:

library(memoise)
p <- memoise(paste)

p("luca")
Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
Error during wrapup: evaluation nested too deeply: infinite recursion / options(expressions=)?

Education on MIT license

It looks like memoise relies on the digest package (which is GPL-2). How does memoise not inherit GPL-2 from digest? Thanks for any education!

EDIT: have now posted this here: http://opensource.stackexchange.com/q/4414/6032 Thanks!

False positives for memoised functions in packages

Is it possible to include memoised functions in packages? I did something like this:

dictionary_internal <- function(lang, affix){
  #actual code
}

#' @export
dictionary <- memoise::memoise(dictionary_internal)

However the memoised version of dictionary ignores the parameter values and always returns the memoised value for any value of the arguments.

Clarify that memoise should only apply to pure function?

This is a suggestion on package readme and vignette. I found it may not be obvious for some users that memoise is supposed to only be used on pure functions.

I searched stackoverflow for a question I have about memoise, and I found several of them are trying to memoise database inquiries, which are not right in concept.

Maybe the first line of readme should stress about the pure function and point users to the pure function part of advanced R book?

A version of memoised function which has forgetful nature

I am memoising a function in a shiny application to limit calls to a database. However, I want it to reset its cache at the start of each new day. This is the solution I've come up with:

forget_using <- function(f, value) {
  v <- value()
  mem_f <- memoise::memoise(f)

  function(...) {
    v_ <- value()
    if (v != v_) {
      memoise::forget(mem_f)
      v <<- v_
    }
    mem_f(...)
  }
}

# example use case
mem_run_query <- forget_using(run_query, Sys.Date)

If you think something like this sits in the package let me know, and I'll do a pull request. In that event, I think a better name is probably needed.

Support for storr caches?

I love @richfitz's storr API, and it opens up all sorts of storage backends without requiring extra work from memoise itself.

memoise::has_cache in sapply not finding cached param setting

library(memoise)
a <- function(n) { runif(n) }

memA <- memoise(a)

memA(2)

has_cache(memA)(2)
#[1] TRUE

sapply(1:5,function(x) {cat(x,':',has_cache(memA)(x), ' ');return(x)})
#1 : FALSE  2 : FALSE  3 : FALSE  4 : FALSE  5 : FALSE

for(x in 1:5) cat(x,':',has_cache(memA)(x), ' ')
#1 : FALSE  2 : FALSE  3 : FALSE  4 : FALSE  5 : FALSE

caching strategies and key invalidation

By using ... of memoise, one can have a full cache invalidation, if I understand well. Meaning: when the result of ... (function) is different, the value is recalculated (I hope I'm right, the docs are a bit vague)

Anyway, having a more fine-grained memoization would be nice. Meaning: for example, the timeout function is for full cache invalidation. Having a key-specific invalidation makes more sense to me, so that one key is removed from the cache if it's not valid anymore.
Invalidating a key can be based on different cache strategies (or even any combination thereof). I'm thinking about many, and this may be extensible: 'first in first out' (with max number of keys), 'time to live' (max time) , 'max time idle' (max time not used), 'least recently used' (max number of keys), 'least frequently used' (max number of keys), ... .

It's more a thought/idea, but it be really cool to have this...

hash function key in cached files

I've copied an old folder which contains cache files generated by memoise to a new system. For some reason while memoise generates different cache keys that does not match existing cache key. Can you give a bit of insight on how the cache keys are generated?

Exclude certain parameters from memoisation

Hey everyone,
I really like the idea of caching results to use them later as it promises to increase the performance significantly.
But I ran into a problem so let's say I have the following function:

myFunction <- function(a, b, sep){
   return(paste(a, b, sep =sep))
}

So if I memoise this function each time one of the arguments is changed it will create a new cache so it won't need to rerun in the future. But now what if the argument sep doens't really matter to me as it doesn't really affects the result besides a changed seperator.
Is there a way to exclude it so that there will be only a new cache if a or b changes and not sep?
Or if not will this be a feature in an upcoming version?

Thanks in advance for help

Register default caching strategy

So can do once per session.

memoise breaks autocompletion in RStudio

This is just a small inconvenience.
It appears that applying memoise() to a function breaks the argument autocompletion for such a memoised function –– at least in RStudio.
Not sure on whose end that's a problem.

usage with multiple threads?

This looks like a great package. It's saving me and my collaborators a lot of unnecessary computation time.

I was wondering about how the package would perform if a memoized function were running in parallel on several threads, especially with caches stored on the filesystem. Given that the hashes are deterministic, it doesn't seem like there would be a problem, but I didn't see anything specifically about it in the documentation, so I thought it would be good to ask.

Thanks in advance!

memoise cache seems to be invalidated when re-knitting

I'm hoping to speed up re-knitting an rmarkdown document using memoise(cache=cache_filesystem(.... Unfortunately, the cache seems to be invalidated every time I re-knit. However, caching does work when I run my code in a plain Rgui session outside of RStudio/knitr. Small reproducible example (test_memoise.Rmd):

---
output: html_document
---

```{r}
library(memoise)
cache <- cache_filesystem("~/test_memoise")
fun <- function(input) {Sys.sleep(1); return(input)}
fun.mem <- memoise(fun, cache=cache)
system.time(sapply(1:2, fun.mem))
system.time(sapply(1:2, fun.mem))
```

What I would expect: The last two lines should take around 2 and 0 seconds, respectively, on first knitting. When re-knitting both lines should take negligible time.

What happens instead: Every knitting behaves like the first one.

system.time(sapply(1:2, fun.mem))
##    user  system elapsed 
##    0.03    0.01    2.05
system.time(sapply(1:2, fun.mem))
##    user  system elapsed 
##       0       0       0

But it works in plain Rgui:

> library(memoise)
> cache <- cache_filesystem("~/Dropbox/.rcache/test_memoise")
> fun <- function(input) {Sys.sleep(1); return(input)}
> fun.mem <- memoise(fun, cache=cache)
> system.time(sapply(letters, fun.mem))
   user  system elapsed 
   0.06    0.00    0.06

This is on RStudio 1.0.143 and Windows 10.

> devtools::session_info()
Session info ----------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.0 (2017-04-21)
 system   x86_64, mingw32             
 ui       RStudio (1.0.143)           
 language (EN)                        
 collate  English_United States.1252  
 tz       Europe/Berlin               
 date     2017-05-18                  

Packages --------------------------------------------------------------------------------------------------
 package   * version     date       source                          
 base      * 3.4.0       2017-04-21 local                           
 compiler    3.4.0       2017-04-21 local                           
 datasets  * 3.4.0       2017-04-21 local                           
 devtools    1.13.1.9000 2017-05-18 Github (hadley/devtools@ad6f28e)
 digest      0.6.12      2017-01-27 CRAN (R 3.4.0)                  
 graphics  * 3.4.0       2017-04-21 local                           
 grDevices * 3.4.0       2017-04-21 local                           
 knitr       1.15.1      2016-11-22 CRAN (R 3.4.0)                  
 memoise   * 1.1.0       2017-04-21 CRAN (R 3.4.0)                  
 methods   * 3.4.0       2017-04-21 local                           
 pkgbuild    0.0.0.9000  2017-05-18 Github (r-pkgs/pkgbuild@8aab60b)
 pkgload     0.0.0.9000  2017-05-18 Github (r-pkgs/pkgload@119cf9a) 
 rlang       0.1.1       2017-05-18 Github (hadley/rlang@684221a)   
 stats     * 3.4.0       2017-04-21 local                           
 tools       3.4.0       2017-04-21 local                           
 utils     * 3.4.0       2017-04-21 local                           
 withr       1.0.2       2016-06-20 CRAN (R 3.4.0)

Invalidate cache for particular arguments

I wonder if it would be possible to introduce erase_cache(), which would be similar in interface to has_cache() and would selectively invalidate cache for particular function calls.

The rationale is that sometimes API packages may experience connectivity outages and will therefore cache NULL values for certain argument combinations. Triggering forget() on the whole function seems excessive, especially if calls are expensive and results are stored on disk/S3. Better approach could be to check if function has_cache() and result is.null() and selectively invalidate cache for this combination, triggering repeated request for next round of API call

support for missing arguments non-referentially transparent

I noticed that memoise assumes that the only arguments that can change are those provided by the user as implemented in these two lines of code.

 memo_f <- function(...) {
    hash <- digest(list(...))

If the function to be memoised also has default arguments that can change, this won't work

g = function(x = rnorm(1))x
g()
[1] -0.2685467
g()
[1] -1.901578
g()
[1] -0.9886473
h = memoise(g)
h()
[1] 0.07904293
 h()
[1] 0.07904293
h()
[1] 0.07904293
h()
[1] 0.07904293

My use case for this is when an argument is a filename whose contents can change, but not always, between invocations. Then memoising the function as is won't work, but if we added an argument with a default value set to file.info for that same file, then file.info would change every time the file contents are changed, at least, and trigger the computation. I think one way to implement this would be to access the missing arguments with formals and selectively evaluate them. A completely different approach would be to allow memoise users to specify how to eval certain arguments for the purpose of hashing, as in

memoise(f, custom.eval = list(filename = function(fn) file.info(fn))

This is less intrusive because it doesn't require a change in the f function signature, but makes memoise more complicated. In the filename example we'd have

f = function(fname, timestamp = file.info(fname) ....
h = memoise(f) #will find timestamp through formals

as opposed to

f = function(fname) ...
h = memoise(f, custom.eval = list(fname = file.info(fname))

Makes sense? Would a pull request towards either of this be considered?

How to deal with caching API job queues

Before I knew about this package I've been trying to introduce API caching for my own, which relies on the unique hash of the URL + arguments + any body to decide when to read from cache or not. I hope to perform API tests without need of authentication, and I plan to use memoise rather than .rds files as I'm doing now.

One problem I have though, is that some API calls are of the type "is this job finished?" (e.g gce_wait) which are all identical until they return "READY", so the caching fails.

Is this a scenario that can be covered with memoise?

memoisation with respect to custom output files

Memoise strikes me as an ideal cacher for Make-like build systems (examples: drake, remake). For this use case, it becomes important to track not just return values, but also custom output files, especially dynamic reports. Do you think such a feature would be appropriate for memoise?

knit_my_report <- memoise(knitr::knit, files = "my_report.md")

Cache invalidation is too strict?

I noticed that memoise gives different results when a function is source-d vs run line by line, or when a package is installed on one platform vs another. (As Jim Hester mentioned in #55, this happens because the function itself is included in the hash, and I guess the underlying bytecode can differ.)

Below is a quick example that demonstrates the non-caching behavior when the program is run line by line or sourced.

One way to address this, if you think it's a problem that should be addressed, would be to use or borrow from digest::sha1. sha1 is a generic that dispatches to methods that try to look at meaningful differences (see vignette). Note that in the example below, the digests differ, but the sha1s are the same.

Do you think this is worth addressing? It seems like a non-issue for in-memory caches, but could be a pain for on-disk, potentially cross-platform caches.

library(memoise)
library(digest)

cache <- cache_filesystem("temp_Rcache")
fn <- function() {
  message("Actually running the function")
}
message("Digest: ", digest(fn, algo = "xxhash64"), "\nsha1:   ", sha1(fn))
fn_cached <- memoise(fn, cache = cache)
fn_cached()

save/load file system cache to memory?

I'm using parallel version of a memoised function (something like parallel::mclapply).

I need to use file system cache across sessions, but file system cache will not be available for every thread. Since mclapply will fork current session, I think memory cache will work for them.

One solution will be load file system cache into memory in beginning, and save memory cache into file system when exiting (of course there need to be a proper exit process in user code otherwise it will not be saved.). This should be possible and also can provide some performance boost for normal file system cache usage.

(Re)initialize file system cache if it does not exist.

library(memoise)
f = memoise(rnorm, cache = cache_filesystem("cache"))
dir.create("new_dir")
setwd("new_dir")
f(1) 
## Error in gzfile(file, mode) : cannot open the connection 
## In addition: Warning message: 
## In gzfile(file, mode) :
##    cannot open compressed file 'cache/75bed0259340d03b', 
##    probable reason 'No such file or directory'

Users may also want to write packages with already-memoised functions, and the users of those packages could call the functions from any working directory.

unexpected behavior with timeout(0)

It looks like timeout must be longer than 1e-7 in order to avoid a warning, and produces unexpected behavior when seconds=0 (result is NaN).

It would be great if timeout(0) would just correspond to turning off the timeout behavior.

memoising embedded functions?

memoise looks very interesting. I am trying it with some of my own functions and external libraries' ones. One question arose:
How do I memoise embedded functions that are called by the functions I am memoising:

package1::functionA
package2::functionB <- function() {
 package1::functionA()
}

I can do functionB <- memoise(package2::functionB). But in cases of external libraries (package2 in this case), I don't want to go in and modify package2::functionB. Is it possible to introduce the following?

package2::functionB <- function() {
 functionA <- memoise(package1::functionA())
}

By this?

functionB <- memoise(package2::functionB, subfunc="functionA")

here memoise takes one more argument subfunc to memoise that function too.

If it is not possible at the moment, I would like to make it a feature request.
Thanks

Allow additional cache types

I'd be interested in writing my own cache that stores/retrieves items in google datastore, dropbox, AWS, or elsewhere in a decentralized manner. Any thoughts/ideas something like that? I'm working on a google datastore one right now.

Perhaps you could allow custom caches to be provided as an argument for the memoize function with the expectation that they'll provide functions for reset, set, get, has_key, and keys?

How to pass formula to memoise?

Hello, may I ask how to pass in a formula from a string to memoise?

This works:

library(memoise)
fn <- function() { i <<- i + 1; i }
i <- 0

fnm <- memoise(fn, ~{if(i>=2) TRUE else i})
fnm() # 1
fnm() # 2
fnm() # 3
fnm() # 3

...but how does this work?

library(memoise)
fn <- function() { i <<- i + 1; i }
i <- 0

my_formula <- ~{if(i>=2) TRUE else i}

fnm <- memoise(fn, my_formula)
# Error: `my_formula` must be a formula.

my_formula <- as.formula("~{if(i>=2) TRUE else i}")
fnm <- memoise(fn, my_formula)
# Error: `my_formula` must be a formula

my_formula <- ~{if(i>=2) TRUE else i}
memoise(fn, eval(substitute(my_formula)))
# Error: `eval(substitute(my_formula))` must be a formula.

Memoise used to work but not anymore...

I have used memoise() in the past and it worked very well, but not now.
Here is a simple reproducible example (not my real case of course) and the system info:

library(microbenchmark)
library(memoise)
mmed <- memoise(median)
set.seed(1)
x <- rnorm(1e6)
microbenchmark(median(x), mmed(x))
# Unit: milliseconds
#       expr      min       lq     mean   median       uq       max neval cld
#  median(x) 23.54106 27.85041 32.64600 29.72862 31.60537  76.18108   100  a 
#    mmed(x) 76.10862 83.50640 92.14334 87.06453 92.79896 194.83469   100   b
devtools::session_info()
# Session info ------------------------------------------------------------------
#  setting  value                       
#  version  R version 3.4.3 (2017-11-30)
#  system   x86_64, darwin15.6.0        
#  ui       X11                         
#  language (EN)                        
#  collate  en_US.UTF-8                 
#  tz       Europe/Rome                 
#  date     2017-12-29                  
# 
# Packages ----------------------------------------------------------------------
# package        * version date       source        
# base           * 3.4.3   2017-12-07 local         
# codetools        0.2-15  2016-10-05 CRAN (R 3.4.3)
# colorspace       1.3-2   2016-12-14 CRAN (R 3.4.0)
# compiler         3.4.3   2017-12-07 local         
# datasets       * 3.4.3   2017-12-07 local         
# devtools         1.13.0  2017-05-08 CRAN (R 3.4.0)
# digest           0.6.13  2017-12-14 cran (@0.6.13)
# ggplot2          2.2.1   2016-12-30 CRAN (R 3.4.0)
# graphics       * 3.4.3   2017-12-07 local         
# grDevices      * 3.4.3   2017-12-07 local         
# grid             3.4.3   2017-12-07 local         
# gtable           0.2.0   2016-02-26 CRAN (R 3.4.0)
# lattice          0.20-35 2017-03-25 CRAN (R 3.4.3)
# lazyeval         0.2.0   2016-06-12 CRAN (R 3.4.0)
# MASS             7.3-47  2017-02-26 CRAN (R 3.4.3)
# Matrix           1.2-12  2017-11-20 CRAN (R 3.4.3)
# memoise        * 1.1.0   2017-04-21 CRAN (R 3.4.0)
# methods        * 3.4.3   2017-12-07 local         
# microbenchmark * 1.4-2.1 2015-11-25 CRAN (R 3.4.0)
# multcomp         1.4-6   2016-07-14 CRAN (R 3.4.0)
# munsell          0.4.3   2016-02-13 CRAN (R 3.4.0)
# mvtnorm          1.0-6   2017-03-02 CRAN (R 3.4.0)
# pillar           1.0.1   2017-11-27 CRAN (R 3.4.3)
# plyr             1.8.4   2016-06-08 CRAN (R 3.4.0)
# Rcpp             0.12.13 2017-09-28 CRAN (R 3.4.2)
# rlang            0.1.6   2017-12-21 CRAN (R 3.4.3)
# sandwich         2.3-4   2015-09-24 CRAN (R 3.4.0)
# scales           0.4.1   2016-11-09 CRAN (R 3.4.0)
# splines          3.4.3   2017-12-07 local         
# stats          * 3.4.3   2017-12-07 local         
# survival         2.41-3  2017-04-04 CRAN (R 3.4.3)
# TH.data          1.0-8   2017-01-23 CRAN (R 3.4.0)
# tibble           1.4.1   2017-12-25 CRAN (R 3.4.3)
# utils          * 3.4.3   2017-12-07 local         
# withr            2.0.0   2017-07-28 CRAN (R 3.4.1)
# zoo              1.8-0   2017-04-12 CRAN (R 3.4.0)

Question / Request: timeout time as variable not working when variable removed

Hi,

When memoising a function like this everything is fine:

f <- function(x) {Sys.sleep(2); x}
f <- memoise(f, seconds = ~memoise::timeout(500))
f(1)
f(1)

But when passing the timeout time as variable the varaible seems to be evaluated with every function call and throws an error.

f <- function(x) {Sys.sleep(2); x}
cacheTime <- 500
f <- memoise(f, seconds = ~memoise::timeout(cacheTime))
rm(cacheTime)
f(1)
f(1)

Is there an easy workaround for this?

memoise does not cache in Shiny server

Hi,

I tried to get memoise to work with Shiny to cache results from a SPARQL endpoint to no avail.

Please see the following thread for more details:

https://groups.google.com/forum/#!topic/shiny-discuss/v4GLd5OSEvo

Add compression option for non-memory caches

Would it be possible to add a compression option for non-memory caches?

cache_filesystem, cache_gcs and cache_rds all use saveRDS() to serialise the object to disk, and the latter two then transfer over the network.

It would help if the user could tradeoff some CPU - by compressing the saved object - and hopefully get faster network transfers (and also stay within any cloud service size quotas).

I've submitted a PR with a proposed change to do this: #70

performance decrease

I have noticed a performance decrease between version 1.0.0 from cran and 1.0.0.9001 from github using R 3.3.2 (2016-10-31)

In this example, the median time of the memoised function doubled.

x= data.frame(x=1:100000, y=1:100000)
f <- function (i) { x[x[,1]==i,2] }
f2  = memoise::memoise(f)
microbenchmark::microbenchmark( f(50), f2(50) )

# mran ‘1.0.0’
Unit: microseconds
expr     min       lq       mean   median      uq      max neval
f(50)  681.561 741.7170 1020.15498 840.9110 856.430 4976.026   100
f2(50)  71.997  75.3565   92.76033  84.4765  88.796  969.864   100

# github ‘1.0.0.9001’
microbenchmark::microbenchmark( f(50), f2(50) )
Unit: microseconds
expr     min       lq      mean  median        uq      max neval
f(50)  799.953 813.8725 1038.1763 834.031 1372.7195 1878.610   100
f2(50) 151.352 158.7115  170.9893 166.391  175.9905  263.345   100

default arguments of a memoised function can't be updated

in commit 03566bb and after, if a memoised function is updated by default arguments using formals<-, and then evaluated by a set of arguments that not cached yet, the default arguments eventually fed to the body of the function are the original ones, not the updated ones.

that is because when a memoised function is actually evaluated instead of fetched from cache, the funciton actually evaluated is _f in a manually constructed environment, not the memoised function itself, and the _f's default arguments will not be updated when updating the memoised function's arguments.

it's not a problem in v1.1.0 as the whole set of arguments including explicitly given by user and implicitly given by defaults will be transferred to _f, but it's not the case in 03566bb as only arguments caught by match.call() will be transferred to _f.

Arguments need to be evaluated in the functions environment.

e <- new.env(parent = baseenv())
f <- function(x, y = a) { x + a }
environment(f) <- e
e$a <- 5
f(1)
#> [1] 6
f2 <- memoise(f)
f2(1)
#> Error in serialize(object, connection = NULL, ascii = ascii): object 'a' not found

Vignette about using memoise in a package?

I'm starting to use memoise in one package using @hbrmstr examples e.g. here https://github.com/hrbrmstr/cymruservices/blob/f9a1287998bb6c096f4a00a8a159761a919fc397/R/asn.R

I might make a draft of a Rmd about how to do this. That said, the documentation seems quite detailed so I might realize it's useless. :-)

Can I use same folder as filesystem cache for multiple memoised functions?

@hadley This is a usage question. I decided to post it here instead of SO because it's a very easy question for package author, though difficult to know for other users.

For convenience I'd like to use same folder as filesystem cache for several functions (I need to save and restore folder later), but I'm not sure if different memoised functions cache will interfere with each other.

From the source code, it seemed that forget a memoised function will clear all content under the cache folder. So I guess I'm supposed to use different folder for different functions?

parameter-less functions memoised with cache_filesystem get corrupted

I have the following two parameter less function memoised.

`
library(memoise)
f2 <- function() { 2 }
f3 <- function() { 3 }
fc <- cache_filesystem("./TEMP")

cachedf2 <- memoise(f2, cache = fc)
cachedf3 <- memoise(f3, cache = fc)
`
Then running the following two statements generate the same results:

cachedf2()
[1] 2
cachedf3()
[1] 2

Checking the cache files in ./TEMP directory there is only one file generated for both caches. I guess the key in the cache does not take the function name into account. In that case I assume the same issue might happen for functions containing same parameter lists.

Memoise does not cache by environments (only by arguments)

> fn <- function(x) x + y
> environment(fn)$y <- 1
> fn2 <- memoise::memoise(fn)
> fn2(1)
[1] 2
> environment(fn)$y <- 2
> fn2(1)
[1] 2