Coder Social home page Coder Social logo

Comments (3)

delgadom avatar delgadom commented on June 27, 2024

Did some preliminary research on other popular "caching" libraries. This so question was helpful.

First up, some takeaways

  1. This is not caching in the normal usage. Memcaching is by far the most popular use case, which only persists objects within a session, not across them. For examples see @functools.lru_cache, pycache
  2. Most caching modules rely on pickle or cpickle. While this is useful, it's not guaranteed to be stable over long periods for all types of objects, especially complex data structures like pandas DataFrames.
  3. We may want flexibility in terms of loaders/writers, filepaths, and arguments which are allowed to affect caching behavior. Some of these are supported by some libraries but not all.
  4. It's nice to be able to define these caches in directory structures and locations that make sense outside the context of the caching library, so that we can, e.g., cache tasmax_squared as part of our pipeline but also allow anyone to discover, inspect, and use this object.
  5. Getting this right for simultaneous read/write in the cloud is hard and requires some serious engineering. We should be careful to either use a real library that handles this for us or to avoid these situations (e.g. make sure we're only using pure functions and that inputs fully specify outputs). That said, we don't need to worry about simultaneous writes resulting in corrupted data, as google simply goes with the last-written object, whether over gcsfuse, gsutil, or google.cloud.storage.

Now, the other remotely feasible libraries

On-disk pickled caches

  • shelve - built-in method for pickle-based on-disk "dictionries". A bit more manual of a solution, but deserves mention.
  • pyfscache - potentially a great alternative for many of our very frequent API calls, e.g. to NOAA, which return little data and take a long time. Relies on cpickle and does not write objects as individual items that we could interpret outside the context of the cache, so probably not suitable for something like netcdf climate data.
  • cachetools
  • joblib.Memory - seems like a great option for pickle-based caching
  • bda.cache - may be wrapping pyfscache? Not really sure. Depends on cpickle for disk caching.

Server-based caching solutions

These would be a radically different approach to computing... but maybe?

from rhg_compute_tools.

delgadom avatar delgadom commented on June 27, 2024

oops.

from rhg_compute_tools.

delgadom avatar delgadom commented on June 27, 2024

Here's my implementation for caching NOAA API calls

from __future__ import absolute_import

import os
import toolz
import pickle
import inspect
import hashlib
import functools

from os.path import join
from sklearn.gaussian_process.kernels import RBF, _check_length_scale
from scipy.spatial.distance import pdist, squareform, cdist
import numpy as np
import pandas as pd
import shapely as shp
import shapely.geometry
import scipy.interpolate

import pyTC.settings


def get_error_type_indices(ftrs):
    io_indices = []
    fnf_indices = []
    other_indices = []
    for ftr in [f for f in ftrs if f.status == "error"]:
        if isinstance(ftr.exception(), FileNotFoundError):
            fnf_indices.append(ftrs.index(ftr))
        elif isinstance(ftr.exception(), OSError):
            io_indices.append(ftrs.index(ftr))
        else:
            other_indices.append(ftrs.index(ftr))

    return {"io": io_indices, "fnf": fnf_indices, "other": other_indices}


@toolz.curry
def cache_result_in_pickle(func, cache_dir=None, makedirs=False, error="raise"):
    """
    Caches the results of a function in the specified directory

    Uses the python pickle module to store the results of a
    function call in a directory, with file names set to the
    sha256 hash of the function's arguments. Pass `redo=True`
    or delete the contents of the directory to reset the cache.

    Because the results are cached based only on function
    parameters, it is important that the function not have any
    side effects.

    Note that all function arguments are hashed to derive a
    cached filename, and that any change to any input will
    produce a new cached file. Therefore, functions that
    depend on complex, frequently changing objects, especially
    settings objects, should not be cached. Instead, cache
    lower-level functions with a small list of simple,
    explicit arguments.

    Note also that cached files are not cleaned up
    automatically, and therefore changes in the arguments to a
    function will result in a new set of cached files being
    saved without removing the older files. This could result
    in cache storage creep unless the cache is periodically
    cleared. Clearing the cache based on file creation date
    can be an important part of cache maintenance.

    .. todo::

        replace this function with a more complete
        implementation, e.g. the one described in
        [GH RhodiumGroup/rhg_compute_tools#56](https://github.com/RhodiumGroup/rhg_compute_tools/issues/56).

    Parameters
    ----------
    func : function
        function to decorate. cannot have `redo` as an argument.
    cache_dir : str
        path to the root directory used in caching. If not
        provided, will use the `COASTAL_CACHE_DIR` attribute
        from `pyTC.settings.Settings()`, either one passed as `ps`
        to the wrapped func, or the default settings object if
        none is provided.
    makedirs : bool, optional
    
        

    Returns
    -------
    decorated : function
        Function, with cached results

    Examples
    --------

    .. code-block:: python

        >>> @cache_result_in_pickle(cache_dir=(tmpdir + '/cache'), makedirs=True)
        ... def long_running_func(i):
        ...     import time
        ...     time.sleep(0.1)
        ...     return i
        ...

    Initial calls will execute the function fully

    .. code-block:: python

        >>> long_running_func(1)  # > 0.1s
        1

    Subsequent calls will be much faster

    .. code-block:: python

        >>> long_running_func(1)  # << 0.1 s
        1

    Changing the arguments will result in re-evaluation

    .. code-block:: python

        >>> long_running_func(3)  # > 0.1s
        3

    Cached results are stored in the specified directory, under a
    subdirectory for each decorated function:

    .. code-block:: python

        >>> os.listdir(
        ...     tmpdir + '/cache/pyTC.utilities.long_running_func'
        ... )  # doctest: +NORMALIZE_WHITESPACE
        ...
        ['259ca9884c55ef7e909c0558978d73f915c6454d8e38bc576e8d48179138491a',
         '57630b792604ad1c663441890cda34728ffcb2c04d6b29dc720fd810318b61b6']

    Deleting these files would reset the cache without error. The cache can
    also be refreshed on a per-call basis by passing `redo=True` to the
    function call:

    .. code-block:: python

        >>> long_running_func(1, redo=True)  # > 0.1s
        1

    The parameters `'cache_dir'`, `'mkdirs'`, and `'error'` can also be
    overridden at function call:

    .. code-block:: python

        >>> long_running_func(1, cache_dir=(tmpdir + '/cache2'))
        1
        >>> os.listdir(
        ...     tmpdir + '/cache2/pyTC.utilities.long_running_func'
        ... )  # doctest: +NORMALIZE_WHITESPACE
        ...
        ['259ca9884c55ef7e909c0558978d73f915c6454d8e38bc576e8d48179138491a']

    """

    funcname = ".".join([func.__module__, func.__name__])
    sig = inspect.Signature.from_callable(func)

    default_cache_dir = cache_dir
    default_makedirs = makedirs
    default_error = error

    @functools.wraps(func)
    def inner(*args, redo=False, cache_dir=None, makedirs=None, error=None, **kwargs):

        if cache_dir is None:
            cache_dir = default_cache_dir

        if makedirs is None:
            makedirs = default_makedirs

        if error is None:
            error = default_error

        if error is None:
            error = "raise"

        error = str(error).lower()
        assert error in [
            "raise",
            "ignore",
            "remove",
        ], "error must be one of `'raise'`, `'ignore'`, or `'remove'`"

        if cache_dir is None:
            ps = kwargs.get("ps")

            if ps is None:
                ps = pyTC.settings.Settings()

            cache_dir = ps.DIR_DATA_CACHE

        bound_args = sig.bind(*args, **kwargs)
        bound_args.apply_defaults()

        sha = hashlib.sha256(pickle.dumps(bound_args))
        path = os.path.join(cache_dir, funcname, sha.hexdigest())

        if not redo:
            try:
                with open(path, "rb") as f:
                    return pickle.load(f)
            except (OSError, IOError):
                pass

        res = func(*args, **kwargs)

        try:
            if makedirs:
                os.makedirs(os.path.dirname(path), exist_ok=True)

            with open(path, "wb+") as f:
                pickle.dump(res, f)

        except (OSError, IOError, ValueError) as e:
            if error == "raise":
                raise
            elif error == "remove":
                try:
                    os.remove(path)
                except (IOError):
                    pass
                raise RuntimeError from e
            else:
                # case error == 'ignore'
                pass

        return res

    return inner

from rhg_compute_tools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.