Coder Social home page Coder Social logo

amakelov / mandala Goto Github PK

View Code? Open in Web Editor NEW
498.0 498.0 14.0 3.7 MB

A simple & elegant experiment tracking framework that integrates persistence logic & best practices directly into Python

License: Apache License 2.0

Python 31.93% Jupyter Notebook 68.07%
data-science experiment-tracking incremental-computation machine-learning

mandala's People

Contributors

amakelov avatar nschiefer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

mandala's Issues

Expected a function or method, but got `<class 'method_descriptor'>`

This is a super cool idea that I can see being amazingly useful! I've been testing it on one of my research projects, and it seems like it's failing out with the above error.

The project is quite big, so if you can point me to some idea of where to look for a cause I'll do my best to help provide some more info!

Exception running https://github.com/amakelov/mandala/blob/master/tutorials/00_hello.ipynb

The last cell gives me this exception.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-11-0b8fd8fcfabb>](https://localhost:8080/#) in <cell line: 1>()
      2     a = Q() # a placeholder for a value
      3     x = inc(a) # same code as above
----> 4     y = add(21, x) # same code as above
      5     df = q.get_table(a.named('a'), x.named('x'), y.named('y'))
      6 df

5 frames
[/usr/local/lib/python3.9/dist-packages/mandala/queries/weaver.py](https://localhost:8080/#) in qwrap(obj, tp, strict)
    463     else:
    464         if strict:
--> 465             raise ValueError("value must be a `ValQuery` or `Ref`")
    466         if tp is None:
    467             tp = AnyType()

ValueError: value must be a `ValQuery` or `Ref`

Queries against functions with added arguments don't include the old calls

Steps to reproduce:

storage = Storage()

@op
def inc(x: int) -> int:
    return x + 1

with storage.run():
    z = inc(23)

@op
def inc(x: int, amount: int = 1) -> int:
    return x + amount

with storage.run():
    z = inc(23, 10)

df = storage.similar(z)

This also affects the tutorial notebook 01_logistic.ipynb

Arg names for ops are case insensitive

Running the following results in an error:

from mandala.all import *
storage = Storage()
@op
def f(x, X) -> int:
    return x + X
with storage.run():
    f(1, 2)

> 
> OperationalError: duplicate column name: X

Content hashing and library versions

A key property of content hashes is that they are deterministic. This allows you
to handle Python objects and automatically arrive at the right UID (hence,
storage location) behind the scenes, without having to think or make decisions
about names or storage. E.g., a simple example would be

@op()
def f(x) -> int:
    return x + 1

# on Monday...
with run(storage):
    f(23)

# on Tuesday...
with run(storage):
    f(23)

Since the hashing is deterministic, this will correctly figure out that f was
already executed with this input.

However a problem can happen in a few ways:

  • when you have a custom object you're hashing that depends on some library. You
    update the library version; something about the internal state of the object
    representation changes; you now may end up with a different content hash
  • you change the version of the tool you use to magically hash (almost) any
    Python object (joblib currently).
  • you have multiple people accessing the same storage that use different library
    versions. This leads to different content hashes for equivalent objects.

This can be very bad since you could trigger a completely new computation of a
pipeline.

This issue's goal is to figure out what our constraints for this are and design a solution.
Some very rough possibilities:

  • record library versions and flat-out refuse to compute if a change is detected
    (too much?)
  • enforce canonically e.g. json-serializable values only, e.g. Python's native
    types and recursive combinations thereof + arrays/series/dataframes. Is it
    good enough?
  • let people implement their own content hashes for custom objects alongside
    objects that are easy to serialize. This can always be added later.
  • ???

`BLOB longer than INT_MAX bytes` 100s of MB of data from data frames.

File [some-path/.venv/lib/python3.12/site-packages/mandala/storage_utils.py:140](http://localhost:8888/~/Repos/yfacrypto/.venv/lib/python3.12/site-packages/mandala/storage_utils.py#line=139), in SQLiteDictStorage.set(self, key, value, conn)
    136 @transaction
    137 def set(
    138     self, key: str, value: Any, conn: Optional[sqlite3.Connection] = None
    139 ) -> None:
--> 140     conn.execute(
    141         f"INSERT OR REPLACE INTO {self.table} (key, value) VALUES (?, ?)",
    142         (key, serialize(value)),
    143     )

OverflowError: BLOB longer than INT_MAX bytes

Hi @amakelov,

My setup is all python and roughly the following:
Functions are imported into a notebook from around the code base, the notebook invokes those functions to:

  1. read data frames (each 100s of MB in size).
  2. Transform the values columns, create X and y (these transformations are likely to evolve).
  3. Run ML fits and out-of-sample evaluations on the results.

I was excited to try @op and wanted it to memoize the results of (1) and (2) above. I think the error results from having results that are "too big" i.e. exceed the row capacity of a row in the database (e.g. I ran the Notebook with @op and with a small X matrix and it appeared to work).

Test cases for workflows (with toy functions, for now)

For this issue we want to

  • have more test cases... :)
  • get an idea of what we're dealing with in terms of the "synthetic" properties of workflows:
    • the composition logic (how are the functions composed, e.g. nested loops as @nschiefer described to me);
    • the "quantitative" dimensions of workflows, like how many function calls we have ("total number of nodes in the computational graph"), and the size of the longest sequence of chained function calls (something like the "diameter of the computational graph");
    • the interfaces of the functions involved, especially if there's going to be something funky going on.

Other aspects of this will be very simple (in-memory storages for now, toy functions).

I'm working on providing a rich enough environment to do this in (most importantly, enabling superops in the two flavors we discussed).

Assigning output UIDs

There are two choices we have for this:

  • content hashing: once the function computes its outputs, they are content
    hashed and this is their UID.
    • good: calls that end up computing the same thing (a rare but not
      vanishingly so event) do not duplicate storage; accidentally unwrapping and
      then re-wrapping an output won't mess up the UID.
    • bad: takes time for large objects; objects no longer have a unique
      history (so, for example, you can't generate a unique piece of code that
      lead to this output)
  • causal hashing: the call UID is combined with e.g. the index of the output
    of the function to obtain a new UID.
    • good: each value has a unique history; fast to compute
    • bad: you could end up in a situation where you break the chain of
      relations between things if you unwrap an output (which will lose the causal
      UID), and then wrap it again (which will assign it a content-based UID).
      This means that the system treats the two values as different, so you could
      end up computing the same things twice, and your relational queries will be
      broken.

It's very easy to add a config option and switch between the two, but I think
it'd be good to figure out what we want to go for here in terms of clarity.
Seems like content hashing is a safer bet in terms of transparency and avoiding
"broken" state?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.