amakelov / mandala Goto Github PK
View Code? Open in Web Editor NEWA simple & elegant experiment tracking framework that integrates persistence logic & best practices directly into Python
License: Apache License 2.0
A simple & elegant experiment tracking framework that integrates persistence logic & best practices directly into Python
License: Apache License 2.0
This is a super cool idea that I can see being amazingly useful! I've been testing it on one of my research projects, and it seems like it's failing out with the above error.
The project is quite big, so if you can point me to some idea of where to look for a cause I'll do my best to help provide some more info!
The last cell gives me this exception.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
[<ipython-input-11-0b8fd8fcfabb>](https://localhost:8080/#) in <cell line: 1>()
2 a = Q() # a placeholder for a value
3 x = inc(a) # same code as above
----> 4 y = add(21, x) # same code as above
5 df = q.get_table(a.named('a'), x.named('x'), y.named('y'))
6 df
5 frames
[/usr/local/lib/python3.9/dist-packages/mandala/queries/weaver.py](https://localhost:8080/#) in qwrap(obj, tp, strict)
463 else:
464 if strict:
--> 465 raise ValueError("value must be a `ValQuery` or `Ref`")
466 if tp is None:
467 tp = AnyType()
ValueError: value must be a `ValQuery` or `Ref`
Steps to reproduce:
storage = Storage()
@op
def inc(x: int) -> int:
return x + 1
with storage.run():
z = inc(23)
@op
def inc(x: int, amount: int = 1) -> int:
return x + amount
with storage.run():
z = inc(23, 10)
df = storage.similar(z)
This also affects the tutorial notebook 01_logistic.ipynb
Running the following results in an error:
from mandala.all import *
storage = Storage()
@op
def f(x, X) -> int:
return x + X
with storage.run():
f(1, 2)
>
> OperationalError: duplicate column name: X
It's probably fine but we should make sure.
A key property of content hashes is that they are deterministic. This allows you
to handle Python objects and automatically arrive at the right UID (hence,
storage location) behind the scenes, without having to think or make decisions
about names or storage. E.g., a simple example would be
@op()
def f(x) -> int:
return x + 1
# on Monday...
with run(storage):
f(23)
# on Tuesday...
with run(storage):
f(23)
Since the hashing is deterministic, this will correctly figure out that f
was
already executed with this input.
However a problem can happen in a few ways:
joblib
currently).This can be very bad since you could trigger a completely new computation of a
pipeline.
This issue's goal is to figure out what our constraints for this are and design a solution.
Some very rough possibilities:
File [some-path/.venv/lib/python3.12/site-packages/mandala/storage_utils.py:140](http://localhost:8888/~/Repos/yfacrypto/.venv/lib/python3.12/site-packages/mandala/storage_utils.py#line=139), in SQLiteDictStorage.set(self, key, value, conn)
136 @transaction
137 def set(
138 self, key: str, value: Any, conn: Optional[sqlite3.Connection] = None
139 ) -> None:
--> 140 conn.execute(
141 f"INSERT OR REPLACE INTO {self.table} (key, value) VALUES (?, ?)",
142 (key, serialize(value)),
143 )
OverflowError: BLOB longer than INT_MAX bytes
Hi @amakelov,
My setup is all python and roughly the following:
Functions are imported into a notebook from around the code base, the notebook invokes those functions to:
X
and y
(these transformations are likely to evolve).I was excited to try @op
and wanted it to memoize the results of (1) and (2) above. I think the error results from having results that are "too big" i.e. exceed the row capacity of a row in the database (e.g. I ran the Notebook with @op
and with a small X matrix and it appeared to work).
For this issue we want to
Other aspects of this will be very simple (in-memory storages for now, toy functions).
I'm working on providing a rich enough environment to do this in (most importantly, enabling superops in the two flavors we discussed).
There are two choices we have for this:
It's very easy to add a config option and switch between the two, but I think
it'd be good to figure out what we want to go for here in terms of clarity.
Seems like content hashing is a safer bet in terms of transparency and avoiding
"broken" state?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.