amakelov / mandala Goto Github PK

A simple & elegant experiment tracking framework that integrates persistence logic & best practices directly into Python

License: Apache License 2.0

Python 31.93% Jupyter Notebook 68.07%

data-science experiment-tracking incremental-computation machine-learning

mandala's People

Contributors

Stargazers

Watchers

Forkers

krastanov planetscape ml-lab srikanth-gandi gigabyt3 devdoshi evdcush hbcbh1999 alejandrosuarez thomascherickal

mandala's Issues

Expected a function or method, but got `<class 'method_descriptor'>`

This is a super cool idea that I can see being amazingly useful! I've been testing it on one of my research projects, and it seems like it's failing out with the above error.

The project is quite big, so if you can point me to some idea of where to look for a cause I'll do my best to help provide some more info!

Exception running https://github.com/amakelov/mandala/blob/master/tutorials/00_hello.ipynb

The last cell gives me this exception.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-11-0b8fd8fcfabb>](https://localhost:8080/#) in <cell line: 1>()
      2     a = Q() # a placeholder for a value
      3     x = inc(a) # same code as above
----> 4     y = add(21, x) # same code as above
      5     df = q.get_table(a.named('a'), x.named('x'), y.named('y'))
      6 df

5 frames
[/usr/local/lib/python3.9/dist-packages/mandala/queries/weaver.py](https://localhost:8080/#) in qwrap(obj, tp, strict)
    463     else:
    464         if strict:
--> 465             raise ValueError("value must be a `ValQuery` or `Ref`")
    466         if tp is None:
    467             tp = AnyType()

ValueError: value must be a `ValQuery` or `Ref`

Queries against functions with added arguments don't include the old calls

Steps to reproduce:

storage = Storage()

@op
def inc(x: int) -> int:
    return x + 1

with storage.run():
    z = inc(23)

@op
def inc(x: int, amount: int = 1) -> int:
    return x + amount

with storage.run():
    z = inc(23, 10)

df = storage.similar(z)

This also affects the tutorial notebook 01_logistic.ipynb

Arg names for ops are case insensitive

Running the following results in an error:

from mandala.all import *
storage = Storage()
@op
def f(x, X) -> int:
    return x + X
with storage.run():
    f(1, 2)

> 
> OperationalError: duplicate column name: X

Check compatibility with Python 3.10

It's probably fine but we should make sure.

Content hashing and library versions

A key property of content hashes is that they are deterministic. This allows you
to handle Python objects and automatically arrive at the right UID (hence,
storage location) behind the scenes, without having to think or make decisions
about names or storage. E.g., a simple example would be

@op()
def f(x) -> int:
    return x + 1

# on Monday...
with run(storage):
    f(23)

# on Tuesday...
with run(storage):
    f(23)

Since the hashing is deterministic, this will correctly figure out that f was
already executed with this input.

However a problem can happen in a few ways:

when you have a custom object you're hashing that depends on some library. You
update the library version; something about the internal state of the object
representation changes; you now may end up with a different content hash
you change the version of the tool you use to magically hash (almost) any
Python object (joblib currently).
you have multiple people accessing the same storage that use different library
versions. This leads to different content hashes for equivalent objects.

This can be very bad since you could trigger a completely new computation of a
pipeline.

This issue's goal is to figure out what our constraints for this are and design a solution.
Some very rough possibilities:

record library versions and flat-out refuse to compute if a change is detected
(too much?)
enforce canonically e.g. json-serializable values only, e.g. Python's native
types and recursive combinations thereof + arrays/series/dataframes. Is it
good enough?
let people implement their own content hashes for custom objects alongside
objects that are easy to serialize. This can always be added later.
???

`BLOB longer than INT_MAX bytes` 100s of MB of data from data frames.

File [some-path/.venv/lib/python3.12/site-packages/mandala/storage_utils.py:140](http://localhost:8888/~/Repos/yfacrypto/.venv/lib/python3.12/site-packages/mandala/storage_utils.py#line=139), in SQLiteDictStorage.set(self, key, value, conn)
    136 @transaction
    137 def set(
    138     self, key: str, value: Any, conn: Optional[sqlite3.Connection] = None
    139 ) -> None:
--> 140     conn.execute(
    141         f"INSERT OR REPLACE INTO {self.table} (key, value) VALUES (?, ?)",
    142         (key, serialize(value)),
    143     )

OverflowError: BLOB longer than INT_MAX bytes

Hi @amakelov,

My setup is all python and roughly the following:
Functions are imported into a notebook from around the code base, the notebook invokes those functions to:

read data frames (each 100s of MB in size).
Transform the values columns, create X and y (these transformations are likely to evolve).
Run ML fits and out-of-sample evaluations on the results.

I was excited to try @op and wanted it to memoize the results of (1) and (2) above. I think the error results from having results that are "too big" i.e. exceed the row capacity of a row in the database (e.g. I ran the Notebook with @op and with a small X matrix and it appeared to work).

Test cases for workflows (with toy functions, for now)

For this issue we want to

have more test cases... :)
get an idea of what we're dealing with in terms of the "synthetic" properties of workflows:
- the composition logic (how are the functions composed, e.g. nested loops as @nschiefer described to me);
- the "quantitative" dimensions of workflows, like how many function calls we have ("total number of nodes in the computational graph"), and the size of the longest sequence of chained function calls (something like the "diameter of the computational graph");
- the interfaces of the functions involved, especially if there's going to be something funky going on.

Other aspects of this will be very simple (in-memory storages for now, toy functions).

I'm working on providing a rich enough environment to do this in (most importantly, enabling superops in the two flavors we discussed).

Assigning output UIDs

There are two choices we have for this:

content hashing: once the function computes its outputs, they are content
hashed and this is their UID.
- good: calls that end up computing the same thing (a rare but not
  vanishingly so event) do not duplicate storage; accidentally unwrapping and
  then re-wrapping an output won't mess up the UID.
- bad: takes time for large objects; objects no longer have a unique
  history (so, for example, you can't generate a unique piece of code that
  lead to this output)
causal hashing: the call UID is combined with e.g. the index of the output
of the function to obtain a new UID.
- good: each value has a unique history; fast to compute
- bad: you could end up in a situation where you break the chain of
  relations between things if you unwrap an output (which will lose the causal
  UID), and then wrap it again (which will assign it a content-based UID).
  This means that the system treats the two values as different, so you could
  end up computing the same things twice, and your relational queries will be
  broken.

It's very easy to add a config option and switch between the two, but I think
it'd be good to figure out what we want to go for here in terms of clarity.
Seems like content hashing is a safer bet in terms of transparency and avoiding
"broken" state?

amakelov / mandala Goto Github PK

mandala's People

Contributors

Stargazers

Watchers

Forkers

mandala's Issues

Expected a function or method, but got `<class 'method_descriptor'>`

Exception running https://github.com/amakelov/mandala/blob/master/tutorials/00_hello.ipynb

Queries against functions with added arguments don't include the old calls

Arg names for ops are case insensitive

Check compatibility with Python 3.10

Content hashing and library versions

`BLOB longer than INT_MAX bytes` 100s of MB of data from data frames.

Test cases for workflows (with toy functions, for now)

Assigning output UIDs

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent