linealabs / lineapy Goto Github PK

Move fast from data science prototype to pipeline. Capture, analyze, and transform messy notebooks into data pipelines with just two lines of code.

Home Page: https://lineapy.org

License: Apache License 2.0

Python 45.87% Dockerfile 0.02% Jupyter Notebook 53.50% Makefile 0.11% Shell 0.01% Jinja 0.47% Mako 0.01%

lineapy's People

Contributors

Stargazers

Watchers

Forkers

andyk nickjalbert jseps ryankopa kiranvarghesev jaedukseo priya-gittest sainiudit shivamtyagi12345 hongbo-miao techthiyanes simrit1 fishbone andycui97 mingjerli lorddarkula the-intelligence-of-information yoonspark xianwei-tan lazargugleta joshpoll niteshgupta2711 pd-t unco3892 ngeorger vishalsingh17 aayan636 andreinosov taltaf913 daavoo investorvova jeanolv arengard kaal777 yongokhwan adelabb 5l1v3r1 jrojeres brunoscaglione brodbygog starktayfun academydaoo starkmerv kartelmaro12 platformtestnet2 tvoyasovestrezkaya hellobrobro ducanhthieu mipo915 mildol5 junyahiraiwa coinmamo 9144809728

lineapy's Issues

DB refactoring

Right now Library objects are stored directly by the SessionContext and ImportNode models (currently being pickled). Per @yifanwu comments, we should instead create a separate ORM model for Library and reference it with LineaID.

Supporting use cases beyond cli + Jupyter NB/Lab

Currently, the instrumentation relies on having control over the execution of the Python script---this means, right now, either the CLI or the kernel.

From this official post, it seems that AWS's EMR allows for custom Kernels. However we should see how buggy it is and test it ourselves.

My cursory search about Databricks didn't reveal any support for custom kernels, but they do have some exploratory work supporting JupyterLab, but even then, they might have their own custom kernel?

These questions are important to understand for

Short term locating design partners
Long term understanding our technical limitations and whether we need to take actions to overcome them.

These all require approximately 1 half day's work.

Put linea cli into PATH, to support `lineapy my_python_file.py`

We are supporting the use case of lineapy my_python_file.py

To do so I've been looking to Flask as reference---Flask has a way of running that's flask run. They use click, but in the basic tutorials, https://github.com/pallets/click, it's executing a python file, e.g., linea-cli.py my_python_file.py---I would like for us to figure out how to do something similar to flask run. I dug around https://github.com/pallets/flask/blob/main/src/flask/cli.py but didn't quite figure out how they did it. Maybe I'm just missing something obvious?

Another benefit of lineapy my_python_file.py is that it introduces a layer of abstraction (in case we change the file name etc.)

Note that this needs Yifan's current work on transformers to be merged into main.

transformer needs to automatically add certain imports (e.g. operator module)

Transformer needs to automatically add import nodes for operations like where we convert a+b to add(a, b) (in this case the operator module).

Future re-exec items

There are a few re-exec items that we haven't gotten to yet (and similarly not for the database or the transformer, since the node type is not defined)

ClassNode

class Counter:
  def __init__(self):
     self.counter = 0
  def inc(self)
     self.counter+=1

c = Counter()
c.inc()

Try catch and raise

@dorx and I discussed and these few items can be deferred until we have something end to end

Re-exec involving external data

Add tests to include loading a csv to a pandas dataframe.

To make the test pass, we will also need another node (in types.py) about the data source (we can start with the csv but know that we also need to support S3 and databases etc.).

@1dividedby0 Please come up with an initial plan and have @dorx and I sign off. This should be fun!

Future: maybe find ways to avoid using a custom Kernel and command line tool

@dorx and I tried to explore ways to try to instrument the code without (1) instrumenting the kernel for Jupyter/IPython, and (2) a linea commandline tool for python scripts.

We thought about something like monkey patching like the below

from IPython import get_ipython
shell = get_ipython()
shell.do_execute = lambda *args: print("hello!")
a = 1

Or instrumenting it at the JavaScript level, but that seems really janky ¯_(ツ)_/¯. So we are just going to go with Kernel and command line tool for now.

We need to look into whether instrumented kernels are compatible with Databricks @dorx please look into it?

Execute program with specific inputs

See execute_program_with_inputs in executor.py.

Remove library versions from requirements.txt

Function of edges in the graph

if i have

line_1: a = 0
line_2: b = a
line_3: a = 2

i could have edges line_1 -> line_2, line_2 -> line_3.
OR i could have line_1 -> line_2, line_1 -> line_3.
OR I could just have line_1 -> line_2

in the first example, it doesnt matter what order we put the nodes in when we create the graph, because the program will always be executed in the desired order

in the second example, we have reassurance that line 1 (a=0) will always execute before line_2 and line_3. But that can be problematic if line_3 executes before line_2

in the third example, (which we have right now) we are just ensuring minimum functionality; i.e. a has to be initialized before we reference it with b = a. In that case, we are completely reliant on ordering the nodes properly when we pass them into the graph constructor

Which way do we want to go?

sub folder name?

Should we rename /linea to /lineapy to be more consistent?

cc @dorx

Support asynchronous evaluation (for IPython 7+)

Currently, everything in lineapy is synchronous. However the Python world is changing (asyncio now being a built in library). We should at some point consider it for good eng practice (along with multi-threading).

This optimization would otherwise NOT be on the critical path if not for the fact that newer versions of JupyterLab expects the cell execution functions to be async. This blog post explains more their rationale for using it.

We should investigate the scope to which this applies---our design partners might still be on older versions of Jupyters.

MVP: a runtime version of lineapy

Given that the LineaDB is going to require some iterations. We fastest path towards something usable is to keep things all in memory.

Plus, having the software being decoupled from the database (when things happen to be in memory) would be a performance enhancement.

To close this issue, we need the following minimum sets of features:

having the kernel & scripting version running
the transformer tracking calls
calls create relevant nodes that we can use for different services like slicing etc. (we should prob lift the slicer out of the DB class method at some point for this)

Make data persistent

As of the time this issue is created, the database is in transient sqlite in-memory store.

We should change this to a more persistent store (that does not get overwritten), so we can actually start persisting values across usage sessions.

This should be a quick fix, either change or override the default here:

lineapy/lineapy/db/base.py

Line 29 in 38169f8

database_uri: str = "sqlite:///:memory:"

Create a set of Node, Edge, and SessionContext objects for Housing Price example

Current API proposals:

from enum import Enum
import uuid, datetime

class Node:
    name: str
    uuid: str    # populated on creation by uuid.uuid4()
    value: Any   # raw value of the node
    code: str
    session_id: str    # refers to SessionContext.uuid
    context: NodeContext

class NodeContext:
    line_number: int
    columns: Tuple[int, int]
    execution_time: datetime.datetime
    # TODO loop, conditional context

class DirectedEdge:
    source_node_id: str    # refers to Node.uuid
    sink_node_id: str   # refers to Node.uuid

class SessionContext:
    uuid: str # populated on creation by uuid.uuid4()
    session_name: str   # obtained from name in with tracking(name=...):
    file_name: Optional[str]
    user_name: Optional[str]
    environment_type: SessionType
    creation_time: datetime.datetime
    hardware_spec: Optional[HardwareSpec]

class SessionType(Enum):
    JUPYTER = 1
    SCRIPT = 2

class HardwareSpec:
    # TODO

Create Executor API

Create a Flask app
Add Execute endpoint with query params "artifact_id" and optional param "version" (integer)
- Execute endpoint should
  - Increment the version if "version" param is not specified
  - Create a new row in Execution table with version number and artifact id
  - Grab artifact from DB
  - Call get_graph_from_artifact_id
  - Call execute_program on the Graph object
  - Run through Graph object nodes and write their values to NodeValueORM with new version (including the value of the Artifact Node itself)
  - Create relationships between Execution row and NodeValue objects
  - Return the new Artifact JSON with it's new NodeValue

Supporting chart artifacts

Currently, we just store the value of the nodes, however, some (popular) visualization libraries make use of side-effects to render the visualization, so the variable the user is dealing with often are not the value of the chart.

Let's take matplotlib as an example, taking an example from their gallery:

import matplotlib.pyplot as plt
import numpy as np

# Data for plotting
t = np.arange(0.0, 2.0, 0.01)
s = 1 + np.sin(2 * np.pi * t)

fig, ax = plt.subplots()
ax.plot(t, s)

ax.set(xlabel='time (s)', ylabel='voltage (mV)',
       title='About as simple as it gets, folks')
ax.grid()

fig.savefig("test.png")
plt.show()

Which variable should we tell the Linea user to publish?

In order to support the DataAssetManager based logic (saving some variable value), we would need to look into the return values of fig.savefig and plt.show to see what is the easiest for us to use (I dug around for 2 mi but it seems like it would take longer, e.g., the show function traces to a few overloads that we'd need to read through to understand).

Matplotlib is known to be tricky to deal with (global variables everywhere, exemplified by the use of plt). Vega-lite (through Altair) I think is much better, but still they do not have a common used function that just returns the image binary---pretty useless for the "normal" use cases.

Most visualization libraries do however offer very easy support for writing to an image file. For example in the Matplotlib case, we have fig.savefig("test.png"), and for Altair, we have the example---notice the different file formats and implications (the JS based one we can easily render in our UI, but it's less portable than a PDF/PNG, but the latter requires some additional libraries, i.e., altair_saver .)

import altair as alt
from vega_datasets import data

chart = alt.Chart(data.cars.url).mark_point().encode(
    x='Horsepower:Q',
    y='Miles_per_Gallon:Q',
    color='Origin:N'
)

chart.save('chart.json')
chart.save('chart.pdf')

So maybe instead of trying to dynamically figure out how to work with each library, we just directly intercept at the file level? I think both the pro and con is that we rely on the user to figure out how to save, which makes Linea less magical but also less work from us (there would be more work on the DataAssetManager still to now also work with some files rather than in memory variables).

My vote is on to let the user give us the file containing the visualization. @dorx what do you think?

If you are also onboard, then we should think about what the implication is for the lineapy.publish API.

[Front-end] Use Execution Table to populate Execution tabs in ArtifactPage

Add Executions tab to ArtifactPage

bring back transformer

Once Rolando is done with integrating the loops and conditionals, we will bring
https://github.com/LineaLabs/lineapy_experimental/tree/master/linea/transformer back.

Support function definitions modifying globals

For example:

import math
a = 0
def my_function():
    a = math.factorial(5)
my_function()

the global variable a will not be modified unless global a is declared within the my_function() definition.

Executor write to NodeValue for every node

add "virtual" column to NodeValue, if virtual is True, the value has not been cached
increment NodeValue version on every execution

eng setup

Let's add

requirements.txt & dev-requirements.txt
a readme
- installation w/ venv etc.
- descriptions about checking black and mypy.

Having @dorx on it since it's running on your machine. Happy to take a stab at this too if you are busy.

Make `execute_program` evaluate nodes at a time (as opposed to a whole graph)

Currently, the execute_program takes in a whole graph. For supporting the notebook use case, which is an on-going session, we need to add an execute_node method that executes a list of nodes at a time. Importantly, the next execute_node will also share the same current program state (so all the variables, function definitions, and modules imported). You can assume that these nodes are ordered properly.

As part of this refactor, you might need to refactor and reuse the walk method, since it is already evaluating one node at a time. You might also need to lift some of the function state in walk into the class instance state (e.g., scoped_locals). You can assume that each Executor instance is only associated with one session (so two scripts would be with two Executor instances)

I think it would be nice for @1dividedby0 to work on this given that he is the most familiar with the executor, or alternatively I can work on it as a forcing function for Dhruv's eng handoff. I have a slight preference to stay on the transformer/tracer side of things for less context switching.

cc @dorx for another take.

Model prediction REST endpoint

Make the example REST endpoint functional for a simple ML model to do prediction on new data points.

Cloud sync for local data

This would be a simple wrapper using S3 to upload a local file accessed during a run. This will help with re-execution.

However we should probably defer until we have a remote server setup, which won't be for a few weeks.

[Front-end] Tasks page

For documentation purpose only since the code will be committed to linea-server.

This is the "Tasks" page from the Figma prototype. We need to make the following modifications from the Figma page:

Ignore the four boxes at the top. Have a single box that shows the number of active Tasks.

choose and commit to a type system

Right now we are using mypy (but not really making sure that we pass all the tests)

But I think there a few other tools out there

FB: https://github.com/facebook/pyre-check
Microsoft: https://github.com/microsoft/pyright

They are supposed to be faster and better for larger code base, so maybe worth checking out!

add tests for slicer for variable alias

graph_with_alias_by_reference and graph_with_alias_by_value should be test cases to be added in lineadb_tests.

Need to wait until at least one of the two PRs are merged (#28/#27).

Executor adds a row to Execution table

Link row to NodeValue rows for the artifact at version X

Handling for sending Matplotlib images from backend to frontend

Future enhancements to `DataSourceNode`

We decided to just support local file systems for recognizing data sources.

Note that this is missing a few critical items:

add database definitions to include databases
perhaps some fingerprints for the content of the file for easy diffing, like sha256sum--I'm sure there are other more probablistic methods that are less expensive etc. (see https://stackoverflow.com/questions/4927736/how-likely-are-md5-false-positive-checksums for good explanation for collision likelihood).

This is similar to issue #20, low pri.

Fix the broken lineapy kernel installation

The kernel currently is not added to Jupyter's list of kernels with pip install -e .

It is however added to home/ubuntu/.local/share/jupyter/kernels/kernel, so not sure what's going on. Need someone to help debug.

More proper way to use `NodeTransformer` in the iPython Kernel

Per chat with Stephen (author of nbsafety). We should probably piggy back on existing ast_transformer in the iPython kernel. Understood this pattern from

old = _ipython().ast_transformers
_ipython().ast_transformers = old + transformers

I haven't had time to explore what the tradeoff/implications are, just putting this down for future improvements.

Fix the caching warning related to the usage of `TypeDecorator` for SQLAlchemy

This warning is showing up for all SQLAlchemy node types that uses TypeDecorator, which includes AttributesDict and LineaID.

  /home/ubuntu/linea-dev/lineapy/lineapy/db/db.py:232: SAWarning: TypeDecorator LineaID(length=16) will not produce a cache key because the ``cache_ok`` flag is not set to True.  Set this flag to True if this type object's state is safe to use in a cache key, or False to disable this warning.
    node = self.session.query(NodeORM).filter(NodeORM.id == linea_id).one()

Collect library versions and other environment characteristics from the Python Kernel

Inspect the runtime to populate information in SessionContext, which should be separate from the transformer piece. See https://github.com/rasbt/watermark for inspirations (but don't lift code directly).

[Front-end] Add version selector to ArtifactOverview

Selecting a version should populate the artifact view with the corresponding NodeValue for that artifact at that version

Executor to populate the `Library` `version` and `path` if not already present

Due to the current way the tracer interacts with the Executor, the first time the Library object is created, the tracer does not have access to the value---the executor will, so it should introspec the version and path and populate if not filled.

I think this should happen right after the following if statement

elif node.node_type == NodeType.ImportNode:
    node = cast(ImportNode, node)
    node.module = importlib.import_module(node.library.name)

in executor.py

Since the classes are all by reference, the Tracer can just pass the referenced Library to the database for serialization after.

@1dividedby0 please let me know what you think. If you might be up to it, this would be a great issue to bundle with #68 for a PR.

[Front-end] Update artifact view upon execution completion

Add functionality to the Update button to do this

clarifying the semantics of variable alias in node definitions (and capture)

@1dividedby0 asked a very good question about what we node to represent when we have a=b. Intuitively, we can think of this as an "identify function", but then the question is, what is the semantics of this identity function. For values like numbers and strings, it's basically "LiteralAssignNode" in that the variable the source variable is copied into just gets a snapshot of the literal value. See below:

>>> a = 1
>>> b = a
>>> a
1
>>> a = 2
>>> b
1
>>> c = "hello"
>>> d = c
>>> c += "a"
>>> d
'hello'

When it's an object, e.g., a list, then the "identify function" is really just an alias.

>>> x = [1,2,3]
>>> y = x
>>> x.append(4)
>>> y
[1, 2, 3, 4]

Given that we don't currently have such an "alias" representation, we need to create ObjectAliasNode (@dorx and I discussed the alternative where we just add a field "alias" to some other nodes, but that's likely going to be annoying for the relational DB). Note that this alias node should be used for analysis for slicing as well---anything that touches x later will now be in y's dependencies. Note also that this shifts the burden on the transformer/transformer API to differentiate between the two.

Add support for "headless" literals and variables.

Currently, our nodes do not support the two following cases that really do not have any effects on the state of the program:

a = 1
a

@dorx thinks that we should accommodate these cases, which we could do by extending LiteralAssignNode and VariableLiasNode, by making the assigned_variable_name field optional.

This would also mean, I think, that the following call will also have an edge between the call and the -11 node.

abs(-11)

This refactor should be pretty easy, but off the critical path for now.

Create tests to measure execution overhead (and savings) by `lineapy`

It's always good to have the ability to keep track of the overhead of lineapy so we can (1) assure our users, and (2) know when we need to start optimizing.

This is low pri for now. This would be a good onboarding item because it's pretty well isolated and involves end to end usage and instrumentation of the lineapy codebase.

Support `with`

We decided that it's sufficiently common (apparently PyTorch??) that we should support it soon.

with open('file_path', 'w') as file:
    file.write('hello world !')

The semantics of with is kinda different: https://docs.python.org/3/library/stdtypes.html#context-manager-types, so we'll have to think about the structure of the WithNode. I will create the initial stub and iterate with @1dividedby0 on what makes the most sense for implementation.

Serialization mechanism

PR #51 used SQLAlchemy to pickle cached node values.

Inspecting the code for where pickling is used and imported, we confirm that the pickle is simply the Python pickle implementation.

There are two main things to look out for in the future:

(1) If we want better control over the pickling process. For instance, we might want to use cloud pickle instead for complex objects---see their github readme for a discussion (I think Ray also uses cloudpickle per the discussion here. And some one wrote something here that we can dig through when we run into this problem.

(2) Performance---it seems like the SQLAlchemy code is single threaded and synchronous? We might want to at least have two separate databases to make sure that the nodes are written down, which are more important since the values are just a cache.

For now, since we are bee-lining towards MVP, we can just use the existing implementation in the PR (using SQLAlchemy).
In the (near) future, to make sure we can detect when things go south, we should have
[ ] good performance testing and tracking (also described in issue #37)
[ ] add tests for more import scenarios (so that we can at least know exactly how pickling breaks).

@dorx what do you think?

And @1dividedby0 how do you feel about following up on the two TODOs once you are done with the tasks and REST end point sprint (if there is extra time left)!

Transformer feature requirements for identifying data sources

When generating class DataSourceNode, the transformer should identify when file accesses are used

Not sure if there is a systematic way to do it
- Option 1: Hard code popular libraries & methods, e.g., pandas's read_csv and boto3.
- Option 2: Have the user instrument it and register with us.

And some details:

For the local files, use some path introspection to get the absolute path.

Discussion points: I think Option 1 is prob easier for demo, but Option 2 is prob better UX and more scalable---I imagine that we need some user intervention about data sources to give them control about whether its synced to our cloud or point us to permissions to access the data etc. @dorx please comment! (I don't think I'll get to this until at least a week out.)

Asynchronous Execution with Flask API

Right now we are assuming that the execution will take a couple seconds (which is true for the stubs we have). But in the future the execute API will need to return something immediately even if the execution isn't finished.

This isn't necessary for MVP but is still very important to implement.

print_graph missing from tests.stub_data

lineapy/tests/executor_test.py

Line 5 in 528f783

from tests.stub_data.print_graph import print_graph

Refactor helper methods in `Graph` (graph.py) to `GraphReader` (graph_reader.py)

Per discussion with @dorx, graph should stay more or less just as a data structure. Having GraphReader being separated allows for better abstraction (especial when we start having GraphWriter).

This would require downstream users of Graph---currently in re-exec, so it made the most sense to have @1dividedby0 work on it. Dhurv let us know if this is not clear or if you have push back about the design!

[Front-end] Refresh popup dialog

For documentation purpose only since the code will be committed to linea-server.

Make `find_all_artifacts_derived_from_data_source` search through ALL graphs

Currently the method (in LineaDB) only searches within a specific graph, but in the future we want it to search through all graphs. This can be done in the following two ways:

If we want it to work without a specific graph, we would need to be able to query the DirectedEdges. This can be done by adding a DirectedEdge table to the SQLAlchemy ORM. But that would also break from what we had decided for edges, since we're currently reconstructing edges just based on the nodes rather than querying a DB.
Another way to implement this would be to go through all of the artifacts and check if their ancestors contain the DataSourceNode. This can be done by using get_ancestors_from_node(which is currently used in get_graph_from_artifact_id). But this may be slow (O(A*N)) where A is the number of artifacts and N is number of nodes. worst case O(N^2) where every node is an artifact and we traverse the entire node database for every node.

remove pickling of arrays and dictionaries and flatten out DB models

Right now we use PickleType as the column type for columns that hold arrays or dictionaries (e.g. arguments column in CallNodeORM). In the future we want to flatten our DB models out, so this is just a temporary way of handling arrays and dicts.

Technically there is an SQLAlchemy ARRAY type, but I got the feeling it's not really supposed to be used + it didn't work for me when I tried to use it.