dreamlab / memoize Goto Github PK

View Code? Open in Web Editor NEW

65.0 10.0 6.0 135 KB

Caching library for asynchronous Python applications.

Home Page: https://memoize.readthedocs.io

License: Apache License 2.0

Python 100.00%

python tornado asyncio cache

memoize's Introduction

https://readthedocs.org/projects/memoize/badge/?version=latest

Extended docs (including API docs) available at memoize.readthedocs.io.

What & Why

What: Caching library for asyncio Python applications.

Why: Python deserves library that works in async world (for instance handles dog-piling ) and has a proper, extensible API.

Etymology

In computing, memoization or memoisation is an optimization technique used primarily to speed up computer programs by storing the results of expensive function calls and returning the cached result when the same inputs occur again. (…) The term “memoization” was coined by Donald Michie in 1968 and is derived from the Latin word “memorandum” (“to be remembered”), usually truncated as “memo” in the English language, and thus carries the meaning of “turning [the results of] a function into something to be remembered.” ~ Wikipedia

Getting Started

Installation

Basic Installation

To get you up & running all you need is to install:

pip install py-memoize

Installation of Extras

To harness the power of ujson (if JSON SerDe is used) install extra:

pip install py-memoize[ujson]

Usage

Provided examples use default configuration to cache results in memory. For configuration options see Configurability.

asyncio

To apply default caching configuration use:

import asyncio
import random
from memoize.wrapper import memoize


@memoize()
async def expensive_computation():
    return 'expensive-computation-' + str(random.randint(1, 100))


async def main():
    print(await expensive_computation())
    print(await expensive_computation())
    print(await expensive_computation())


if __name__ == "__main__":
    asyncio.get_event_loop().run_until_complete(main())

Features

Async-first

Asynchronous programming is often seen as a huge performance boost in python programming. But with all the benefits it brings there are also new concurrency-related caveats like dog-piling.

This library is built async-oriented from the ground-up, what manifests in, for example, in Dog-piling proofness or Async cache storage.

Configurability

With memoize you have under control:

timeout applied to the cached method;
key generation strategy (see :class:`memoize.key.KeyExtractor`); already provided strategies use arguments (both positional & keyword) and method name (or reference);
storage for cached entries/items (see :class:`memoize.storage.CacheStorage`); in-memory storage is already provided; for convenience of implementing new storage adapters some SerDe (:class:`memoize.serde.SerDe`) are provided;
eviction strategy (see :class:`memoize.eviction.EvictionStrategy`); least-recently-updated strategy is already provided;
entry builder (see :class:`memoize.entrybuilder.CacheEntryBuilder`) which has control over update_after & expires_after described in Tunable eviction & async refreshing
value post-processing (see :class:`memoize.postprocessing.Postprocessing`); noop is the default one; deep-copy post-processing is also provided (be wary of deep-copy cost & limitations, but deep-copying allows callers to safely modify values retrieved from an in-memory cache).

All of these elements are open for extension (you can implement and plug-in your own). Please contribute!

Example how to customize default config (everything gets overridden):

from datetime import timedelta

from memoize.configuration import MutableCacheConfiguration, DefaultInMemoryCacheConfiguration
from memoize.entrybuilder import ProvidedLifeSpanCacheEntryBuilder
from memoize.eviction import LeastRecentlyUpdatedEvictionStrategy
from memoize.key import EncodedMethodNameAndArgsKeyExtractor
from memoize.postprocessing import DeepcopyPostprocessing
from memoize.storage import LocalInMemoryCacheStorage
from memoize.wrapper import memoize

@memoize(
    configuration=MutableCacheConfiguration
    .initialized_with(DefaultInMemoryCacheConfiguration())
    .set_method_timeout(value=timedelta(minutes=2))
    .set_entry_builder(ProvidedLifeSpanCacheEntryBuilder(update_after=timedelta(minutes=2),
                                                         expire_after=timedelta(minutes=5)))
    .set_eviction_strategy(LeastRecentlyUpdatedEvictionStrategy(capacity=2048))
    .set_key_extractor(EncodedMethodNameAndArgsKeyExtractor(skip_first_arg_as_self=False))
    .set_storage(LocalInMemoryCacheStorage())
    .set_postprocessing(DeepcopyPostprocessing())
)
async def cached():
    return 'dummy'

Still, you can use default configuration which:

sets timeout for underlying method to 2 minutes;
uses in-memory storage;
uses method instance & arguments to infer cache key;
stores up to 4096 elements in cache and evicts entries according to least recently updated policy;
refreshes elements after 10 minutes & ignores unrefreshed elements after 30 minutes;
does not post-process cached values.

If that satisfies you, just use default config:

from memoize.configuration import DefaultInMemoryCacheConfiguration
from memoize.wrapper import memoize


@memoize(configuration=DefaultInMemoryCacheConfiguration())
async def cached():
    return 'dummy'

Also, if you want to stick to the building blocks of the default configuration, but need to adjust some basic params:

from datetime import timedelta

from memoize.configuration import DefaultInMemoryCacheConfiguration
from memoize.wrapper import memoize


@memoize(configuration=DefaultInMemoryCacheConfiguration(capacity=4096, method_timeout=timedelta(minutes=2),
                                                         update_after=timedelta(minutes=10),
                                                         expire_after=timedelta(minutes=30)))
async def cached():
    return 'dummy'

Tunable eviction & async refreshing

Sometimes caching libraries allow providing TTL only. This may result in a scenario where when the cache entry expires latency is increased as the new value needs to be recomputed. To mitigate this periodic extra latency multiple delays are often used. In the case of memoize there are two (see :class:`memoize.entrybuilder.ProvidedLifeSpanCacheEntryBuilder`):

update_after defines delay after which background/async update is executed;
expire_after defines delay after which entry is considered outdated and invalid.

This allows refreshing cached value in the background without any observable latency. Moreover, if some of those background refreshes fail they will be retried still in the background. Due to this beneficial feature, it is recommended to update_after be significantly shorter than expire_after.

Dog-piling proofness

If some resource is accessed asynchronously dog-piling may occur. Caches designed for synchronous python code (like built-in LRU) will allow multiple concurrent tasks to observe a miss for the same resource and will proceed to flood underlying/cached backend with requests for the same resource.

As it breaks the purpose of caching (as backend effectively sometimes is not protected with cache) memoize has built-in dog-piling protection.

Under the hood, concurrent requests for the same resource (cache key) get collapsed to a single request to the backend. When the resource is fetched all requesters obtain the result. On failure, all requesters get an exception (same happens on timeout).

An example of what it all is about:

import asyncio
from datetime import timedelta

from aiocache import cached, SimpleMemoryCache  # version 0.11.1 (latest) used as example of other cache implementation

from memoize.configuration import DefaultInMemoryCacheConfiguration
from memoize.wrapper import memoize

# scenario configuration
concurrent_requests = 5
request_batches_execution_count = 50
cached_value_ttl_ms = 200
delay_between_request_batches_ms = 70

# results/statistics
unique_calls_under_memoize = 0
unique_calls_under_different_cache = 0


@memoize(configuration=DefaultInMemoryCacheConfiguration(update_after=timedelta(milliseconds=cached_value_ttl_ms)))
async def cached_with_memoize():
    global unique_calls_under_memoize
    unique_calls_under_memoize += 1
    await asyncio.sleep(0.01)
    return unique_calls_under_memoize


@cached(ttl=cached_value_ttl_ms / 1000, cache=SimpleMemoryCache)
async def cached_with_different_cache():
    global unique_calls_under_different_cache
    unique_calls_under_different_cache += 1
    await asyncio.sleep(0.01)
    return unique_calls_under_different_cache


async def main():
    for i in range(request_batches_execution_count):
        await asyncio.gather(*[x() for x in [cached_with_memoize] * concurrent_requests])
        await asyncio.gather(*[x() for x in [cached_with_different_cache] * concurrent_requests])
        await asyncio.sleep(delay_between_request_batches_ms / 1000)

    print("Memoize generated {} unique backend calls".format(unique_calls_under_memoize))
    print("Other cache generated {} unique backend calls".format(unique_calls_under_different_cache))
    predicted = (delay_between_request_batches_ms * request_batches_execution_count) // cached_value_ttl_ms
    print("Predicted (according to TTL) {} unique backend calls".format(predicted))

    # Printed:
    # Memoize generated 17 unique backend calls
    # Other cache generated 85 unique backend calls
    # Predicted (according to TTL) 17 unique backend calls

if __name__ == "__main__":
    asyncio.get_event_loop().run_until_complete(main())

Async cache storage

Interface for cache storage allows you to fully harness benefits of asynchronous programming (see interface of :class:`memoize.storage.CacheStorage`).

Currently memoize provides only in-memory storage for cache values (internally at RASP we have others). If you want (for instance) Redis integration, you need to implement one (please contribute!) but memoize will optimally use your async implementation from the start.

Manual Invalidation

You could also invalidate entries manually. To do so you need to create instance of :class:`memoize.invalidation.InvalidationSupport`) and pass it alongside cache configuration. Then you could just pass args and kwargs for which you want to invalidate entry.

from memoize.configuration import DefaultInMemoryCacheConfiguration
from memoize.invalidation import InvalidationSupport


import asyncio
import random
from memoize.wrapper import memoize

invalidation = InvalidationSupport()


@memoize(configuration=DefaultInMemoryCacheConfiguration(), invalidation=invalidation)
async def expensive_computation(*args, **kwargs):
    return 'expensive-computation-' + str(random.randint(1, 100))


async def main():
    print(await expensive_computation('arg1', kwarg='kwarg1'))
    print(await expensive_computation('arg1', kwarg='kwarg1'))

    print("Invalidation #1")
    await invalidation.invalidate_for_arguments(('arg1',), {'kwarg': 'kwarg1'})

    print(await expensive_computation('arg1', kwarg='kwarg1'))
    print(await expensive_computation('arg1', kwarg='kwarg1'))

    print("Invalidation #2")
    await invalidation.invalidate_for_arguments(('arg1',), {'kwarg': 'kwarg1'})

    print(await expensive_computation('arg1', kwarg='kwarg1'))

    # Sample output:
    #
    # expensive - computation - 98
    # expensive - computation - 98
    # Invalidation  # 1
    # expensive - computation - 73
    # expensive - computation - 73
    # Invalidation  # 2
    # expensive - computation - 59

if __name__ == "__main__":
    asyncio.get_event_loop().run_until_complete(main())

Openness to granular TTL

Default configuration sets update and expiry based on fixed values, which are the same for all entries. If you need to set different TTLs for different entries, you can do so by providing a custom :class:`memoize.entrybuilder.CacheEntryBuilder`.

import datetime
import asyncio
import random
from dataclasses import dataclass

from memoize.wrapper import memoize
from memoize.configuration import DefaultInMemoryCacheConfiguration, MutableCacheConfiguration
from memoize.entry import CacheKey, CacheEntry
from memoize.entrybuilder import CacheEntryBuilder
from memoize.storage import LocalInMemoryCacheStorage


@dataclass
class ValueWithTTL:
    value: str
    ttl_seconds: int  # for instance, it could be derived from Cache-Control response header


class TtlRespectingCacheEntryBuilder(CacheEntryBuilder):
    def build(self, key: CacheKey, value: ValueWithTTL):
        now = datetime.datetime.now(datetime.timezone.utc)
        ttl_ends_at = now + datetime.timedelta(seconds=value.ttl_seconds)
        return CacheEntry(
            created=now,
            update_after=ttl_ends_at,
            # allowing stale data for 10% of TTL
            expires_after=ttl_ends_at + datetime.timedelta(seconds=value.ttl_seconds // 10),
            value=value
        )


storage = LocalInMemoryCacheStorage()  # overridden & extracted for demonstration purposes only


@memoize(configuration=MutableCacheConfiguration
         .initialized_with(DefaultInMemoryCacheConfiguration())
         .set_entry_builder(TtlRespectingCacheEntryBuilder())
         .set_storage(storage))
async def external_call(key: str):
    return ValueWithTTL(
        value=f'{key}-result-{random.randint(1, 100)}',
        ttl_seconds=random.randint(60, 300)
    )


async def main():
    await external_call('a')
    await external_call('b')
    await external_call('b')

    print("Entries persisted in the cache:")
    for entry in storage._data.values():
        print('Entry: ', entry.value)
        print('Effective TTL: ', (entry.update_after - entry.created).total_seconds())

    # Entries persisted in the cache:
    # Entry: ValueWithTTL(value='a-result-79', ttl_seconds=148)
    # Effective TTL: 148.0
    # Entry: ValueWithTTL(value='b-result-27', ttl_seconds=192)
    # Effective TTL: 192.0


if __name__ == "__main__":
    asyncio.get_event_loop().run_until_complete(main())

memoize's People

Contributors

Stargazers

Watchers

Forkers

nick-brown mwek sailfish009 charles-dyfis-net odudek ronyb29

memoize's Issues

`str` keys are inconsistent and unpredictable

__str__ methods are not meant to be unique and should not be used to determine equality of objects. Currently cache keys are generated with str function which uses __str__ methods of objects. This problem may lead to duplicate method calls or incorrectly cached results.

For example, let's consider the dict type from Python Standard Library.

>>> x = {"a": "1", "b": "2"}
>>> y = {"b": "2", "a": "1"}
>>> x == y
True
>>> str(x) == str(y)
False

These two objects are equal but their string representations are different. Therefore, they will not be memoized by the library.

For another example, let's say that we have the following class:

class MyClass:
  def __init__(self, x, y):
    self.x = x
    self.y = y
  def __str__(self):
    return str(f"MyClass(x={self.x})")

In this case, cache is not stable and will return incorrect results.

>>> x = MyClass("1", "1")
>>> y = MyClass("1", "555")
>>> x == y
False
>>> str(x) == str(y)
True

I think the default KeyExtractors provided by the library should not have this unstable behavior. As a workaround, I created a custom KeyExtractor and returned the arguments themselves in format_key method (by not complying to the str return type hinting).

Memoizing synchronous methods

I would like to use your library to address the dog piling problem.

Currently, I use caches with both sync and async methods. I know I could convert sync methods to async, but I like how purely sync methods can be identified currently and would like your thought on how to do this with your library.

[Question] Blocking on update

Hi folks,

currently when a cached value is being recomputed, the expired value is returned until it's replaced by the new one. Is there a way to configure memoize so that the call is blocking in the same way as when no value is available?

One way to get this working might be altering get method in the storage:

    async def get(self, key: CacheKey) -> Optional[CacheEntry]:
        entry = self._data.get(key, None)
        if entry:
            if entry.expires_after <= datetime.datetime.utcnow():
                await self.release(key)
                entry = None
        return entry

But it feels like a dirty hack. The other way might be re-implementing the responsible part of the wrapper, specifically this part:

        elif actual_entry is not None and update_statuses.is_being_updated(key):
            logger.debug('As update point reached but concurrent update already in progress, '
                         'relying on concurrent refresh to finish %s', key)
            return actual_entry

What I'm looking for is a 'proper' way to do this. Could you suggest one? Thanks!

edit: code error

Tests failing with python 3.10

From https://bugs.debian.org/1002199 :

======================================================================
ERROR: memoize (unittest.loader._FailedTest)

ImportError: Failed to import test module: memoize
Traceback (most recent call last):
File "/usr/lib/python3.10/unittest/loader.py", line 470, in _find_test_path
package = self._get_module_from_name(name)
File "/usr/lib/python3.10/unittest/loader.py", line 377, in _get_module_from_name
import(name)
File "/<>/.pybuild/cpython3_3.10_pymemoize/build/memoize/init.py", line 1, in
from .core import Memoizer
File "/<>/.pybuild/cpython3_3.10_pymemoize/build/memoize/core.py", line 4, in
from .func import MemoizedFunction
File "/<>/.pybuild/cpython3_3.10_pymemoize/build/memoize/func.py", line 9, in
from .options import OptionProperty
File "/<>/.pybuild/cpython3_3.10_pymemoize/build/memoize/options.py", line 2, in
from collections import Callable
ImportError: cannot import name 'Callable' from 'collections' (/usr/lib/python3.10/collections/init.py)

Is there any interest in supporting python 3.10 ?

Missing exception details on concurrent calls

There is an inconsistency about the exceptions raised from the memoized functions. When a function is called concurrently and it raises an exception, only the first caller receives the exception details. Remaining callers receive CachedMethodFailedExceptions but they don't contain any information about the actual exception. This prevents the callers from handling the exceptions accordingly.

Here is a small code to reproduce the issue:

from datetime import timedelta
from memoize.wrapper import memoize
from memoize.configuration import MutableCacheConfiguration
from memoize.entrybuilder import ProvidedLifeSpanCacheEntryBuilder
from memoize.eviction import NoEvictionStrategy
from memoize.key import EncodedMethodReferenceAndArgsKeyExtractor
from memoize.storage import LocalInMemoryCacheStorage
from asyncio import sleep, gather

@memoize(configuration=MutableCacheConfiguration(
    configured=True,
    storage=LocalInMemoryCacheStorage(),
    key_extractor=EncodedMethodReferenceAndArgsKeyExtractor(),
    method_timeout=timedelta(hours=1),
    entry_builder=ProvidedLifeSpanCacheEntryBuilder(update_after=timedelta(hours=1), expire_after=timedelta(hours=1)),
    eviction_strategy=NoEvictionStrategy(),
))
async def test():
    await sleep(1)
    raise Exception("test")


results = await gather(test(), test(), test(), test(), return_exceptions=True)

print(results)

Output:

[CachedMethodFailedException('Refresh failed to complete', Exception('test')), CachedMethodFailedException('Concurrent refresh failed to complete'), CachedMethodFailedException('Concurrent refresh failed to complete'), CachedMethodFailedException('Concurrent refresh failed to complete')]

JsonSerDe serialization and deserialization inconsistency

Hi, there is a problem with JsonSerDe, when your local timezone is other than UTC, when you serialialize and deserialize your entry, the dates are different. It causes the Cache doesn't work with expiration dates correctly

Wrapping an HTTP client call and using the cache-control header value to cache the response

As the title indicates, I have a need for the wrapped method to define the cache refresh time of the data that is being returned.

If I make an http call to get a config item that rarely changes - the server might respond with the header: Cache-Control: MAX-AGE=900. so now I want to cache that data for 900 seconds, but if the code of the server changes to respond with Cache-Control: MAX-AGE=1500. I don;t want to change my code I want it to adapt to the new cache-control header.

There may be other patterns in which this type of caching could be used.

Do you have a method to handle this if not I have a rough POC of a change to your wrapper that could work that I'd be willing to provide if you are interested.

Feature request: A clean/documented place to perform a deepcopy() when memoizing mutable content with in-memory storage

While this is moot with any backend where deserialization creates a new object on every retrieval, with basic in-memory storage a memoized result can be modified in-place by the code it was returned to.

A mechanism for users to perform postprocessing, such as invocation of copy.deepcopy(), on returned results would mitigate the problems this introduces.

Is it possible to have simple cache interface?

Hi, although lib is great finding, I'd like to not use that complex decorator method, but something like:

cache = AsyncCache()
cache.add(key, value)
cache.get(key)

And cache in this case is the single cache instance across whole async app.

Question: Non-expiring cache

Hi,

Is there a shorter way to change cache-expiry or make non-expiring cache. Currently, I have to do this

from memoize.wrapper import memoize
from memoize.configuration import MutableCacheConfiguration,DefaultInMemoryCacheConfiguration
from memoize.entrybuilder import ProvidedLifeSpanCacheEntryBuilder
from datetime import timedelta

@memoize(configuration=MutableCacheConfiguration.initialized_with(
DefaultInMemoryCacheConfiguration()).set_entry_builder(
ProvidedLifeSpanCacheEntryBuilder(update_after=timedelta(minutes=60),
expire_after=timedelta(minutes=120))))

It's too expressive to change just the validity. Is there a better way to achieve this.

Also, how can I invalidate a cache please

[Question] Does this library work with gevent (including dog-piling mitigation)?

Hi,
I came across this library when I was looking up ways to mitigate dog-piling when using gevent (with gunicorn), but the docs only talk about Asyncio and Tornado support, but no mention of gevent/greenlets. Would this library work with gevent? because I believe gevent would lead to the same dog-piling issue if I need to refresh the cache using a request to an external API that takes a few seconds.

Thanks for the great work!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.