Coder Social home page Coder Social logo

linealabs / lineapy Goto Github PK

View Code? Open in Web Editor NEW
660.0 660.0 55.0 36.48 MB

Move fast from data science prototype to pipeline. Capture, analyze, and transform messy notebooks into data pipelines with just two lines of code.

Home Page: https://lineapy.org

License: Apache License 2.0

Python 45.87% Dockerfile 0.02% Jupyter Notebook 53.50% Makefile 0.11% Shell 0.01% Jinja 0.47% Mako 0.01%

lineapy's People

Contributors

1dividedby0 avatar aayan636 avatar andycui97 avatar becca-miller avatar dependabot[bot] avatar dorx avatar edwardlee4948 avatar hogepodge avatar joshpoll avatar lazargugleta avatar lineainfra avatar lionsardesai avatar loganloganlogan avatar lorddarkula avatar maksimbr avatar marov avatar mingjerli avatar moustafa-a avatar pd-t avatar saulshanabrook avatar yifanwu avatar yoonspark avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lineapy's Issues

DB refactoring

Right now Library objects are stored directly by the SessionContext and ImportNode models (currently being pickled). Per @yifanwu comments, we should instead create a separate ORM model for Library and reference it with LineaID.

Supporting use cases beyond cli + Jupyter NB/Lab

Currently, the instrumentation relies on having control over the execution of the Python script---this means, right now, either the CLI or the kernel.

From this official post, it seems that AWS's EMR allows for custom Kernels. However we should see how buggy it is and test it ourselves.

My cursory search about Databricks didn't reveal any support for custom kernels, but they do have some exploratory work supporting JupyterLab, but even then, they might have their own custom kernel?

These questions are important to understand for

  • Short term locating design partners
  • Long term understanding our technical limitations and whether we need to take actions to overcome them.

These all require approximately 1 half day's work.

Put linea cli into PATH, to support `lineapy my_python_file.py`

We are supporting the use case of lineapy my_python_file.py

To do so I've been looking to Flask as reference---Flask has a way of running that's flask run. They use click, but in the basic tutorials, https://github.com/pallets/click, it's executing a python file, e.g., linea-cli.py my_python_file.py---I would like for us to figure out how to do something similar to flask run. I dug around https://github.com/pallets/flask/blob/main/src/flask/cli.py but didn't quite figure out how they did it. Maybe I'm just missing something obvious?

Another benefit of lineapy my_python_file.py is that it introduces a layer of abstraction (in case we change the file name etc.)

Note that this needs Yifan's current work on transformers to be merged into main.

Future re-exec items

There are a few re-exec items that we haven't gotten to yet (and similarly not for the database or the transformer, since the node type is not defined)

  • ClassNode
class Counter:
  def __init__(self):
     self.counter = 0
  def inc(self)
     self.counter+=1

c = Counter()
c.inc()
  • Try catch and raise

@dorx and I discussed and these few items can be deferred until we have something end to end

Re-exec involving external data

Add tests to include loading a csv to a pandas dataframe.

To make the test pass, we will also need another node (in types.py) about the data source (we can start with the csv but know that we also need to support S3 and databases etc.).

@1dividedby0 Please come up with an initial plan and have @dorx and I sign off. This should be fun!

Future: maybe find ways to avoid using a custom Kernel and command line tool

@dorx and I tried to explore ways to try to instrument the code without (1) instrumenting the kernel for Jupyter/IPython, and (2) a linea commandline tool for python scripts.

We thought about something like monkey patching like the below

from IPython import get_ipython
shell = get_ipython()
shell.do_execute = lambda *args: print("hello!")
a = 1

Or instrumenting it at the JavaScript level, but that seems really janky ¯_(ツ)_/¯. So we are just going to go with Kernel and command line tool for now.

We need to look into whether instrumented kernels are compatible with Databricks @dorx please look into it?

Function of edges in the graph

if i have

line_1: a = 0
line_2: b = a
line_3: a = 2

i could have edges line_1 -> line_2, line_2 -> line_3.
OR i could have line_1 -> line_2, line_1 -> line_3.
OR I could just have line_1 -> line_2

in the first example, it doesnt matter what order we put the nodes in when we create the graph, because the program will always be executed in the desired order

in the second example, we have reassurance that line 1 (a=0) will always execute before line_2 and line_3. But that can be problematic if line_3 executes before line_2

in the third example, (which we have right now) we are just ensuring minimum functionality; i.e. a has to be initialized before we reference it with b = a. In that case, we are completely reliant on ordering the nodes properly when we pass them into the graph constructor

Which way do we want to go?

Support asynchronous evaluation (for IPython 7+)

Currently, everything in lineapy is synchronous. However the Python world is changing (asyncio now being a built in library). We should at some point consider it for good eng practice (along with multi-threading).

This optimization would otherwise NOT be on the critical path if not for the fact that newer versions of JupyterLab expects the cell execution functions to be async. This blog post explains more their rationale for using it.

We should investigate the scope to which this applies---our design partners might still be on older versions of Jupyters.

MVP: a runtime version of lineapy

Given that the LineaDB is going to require some iterations. We fastest path towards something usable is to keep things all in memory.

Plus, having the software being decoupled from the database (when things happen to be in memory) would be a performance enhancement.

To close this issue, we need the following minimum sets of features:

  • having the kernel & scripting version running
  • the transformer tracking calls
  • calls create relevant nodes that we can use for different services like slicing etc. (we should prob lift the slicer out of the DB class method at some point for this)

Make data persistent

As of the time this issue is created, the database is in transient sqlite in-memory store.

We should change this to a more persistent store (that does not get overwritten), so we can actually start persisting values across usage sessions.

This should be a quick fix, either change or override the default here:

database_uri: str = "sqlite:///:memory:"

Create a set of Node, Edge, and SessionContext objects for Housing Price example

Current API proposals:

from enum import Enum
import uuid, datetime

class Node:
    name: str
    uuid: str    # populated on creation by uuid.uuid4()
    value: Any   # raw value of the node
    code: str
    session_id: str    # refers to SessionContext.uuid
    context: NodeContext

class NodeContext:
    line_number: int
    columns: Tuple[int, int]
    execution_time: datetime.datetime
    # TODO loop, conditional context

class DirectedEdge:
    source_node_id: str    # refers to Node.uuid
    sink_node_id: str   # refers to Node.uuid

class SessionContext:
    uuid: str # populated on creation by uuid.uuid4()
    session_name: str   # obtained from name in with tracking(name=...):
    file_name: Optional[str]
    user_name: Optional[str]
    environment_type: SessionType
    creation_time: datetime.datetime
    hardware_spec: Optional[HardwareSpec]

class SessionType(Enum):
    JUPYTER = 1
    SCRIPT = 2

class HardwareSpec:
    # TODO

Create Executor API

  • Create a Flask app
  • Add Execute endpoint with query params "artifact_id" and optional param "version" (integer)
    • Execute endpoint should
      • Increment the version if "version" param is not specified
      • Create a new row in Execution table with version number and artifact id
      • Grab artifact from DB
      • Call get_graph_from_artifact_id
      • Call execute_program on the Graph object
      • Run through Graph object nodes and write their values to NodeValueORM with new version (including the value of the Artifact Node itself)
      • Create relationships between Execution row and NodeValue objects
      • Return the new Artifact JSON with it's new NodeValue

Supporting chart artifacts

Currently, we just store the value of the nodes, however, some (popular) visualization libraries make use of side-effects to render the visualization, so the variable the user is dealing with often are not the value of the chart.

Let's take matplotlib as an example, taking an example from their gallery:

import matplotlib.pyplot as plt
import numpy as np

# Data for plotting
t = np.arange(0.0, 2.0, 0.01)
s = 1 + np.sin(2 * np.pi * t)

fig, ax = plt.subplots()
ax.plot(t, s)

ax.set(xlabel='time (s)', ylabel='voltage (mV)',
       title='About as simple as it gets, folks')
ax.grid()

fig.savefig("test.png")
plt.show()

Which variable should we tell the Linea user to publish?

In order to support the DataAssetManager based logic (saving some variable value), we would need to look into the return values of fig.savefig and plt.show to see what is the easiest for us to use (I dug around for 2 mi but it seems like it would take longer, e.g., the show function traces to a few overloads that we'd need to read through to understand).

Matplotlib is known to be tricky to deal with (global variables everywhere, exemplified by the use of plt). Vega-lite (through Altair) I think is much better, but still they do not have a common used function that just returns the image binary---pretty useless for the "normal" use cases.

Most visualization libraries do however offer very easy support for writing to an image file. For example in the Matplotlib case, we have fig.savefig("test.png"), and for Altair, we have the example---notice the different file formats and implications (the JS based one we can easily render in our UI, but it's less portable than a PDF/PNG, but the latter requires some additional libraries, i.e., altair_saver .)

import altair as alt
from vega_datasets import data

chart = alt.Chart(data.cars.url).mark_point().encode(
    x='Horsepower:Q',
    y='Miles_per_Gallon:Q',
    color='Origin:N'
)

chart.save('chart.json')
chart.save('chart.pdf')

So maybe instead of trying to dynamically figure out how to work with each library, we just directly intercept at the file level? I think both the pro and con is that we rely on the user to figure out how to save, which makes Linea less magical but also less work from us (there would be more work on the DataAssetManager still to now also work with some files rather than in memory variables).

My vote is on to let the user give us the file containing the visualization. @dorx what do you think?

If you are also onboard, then we should think about what the implication is for the lineapy.publish API.

Support function definitions modifying globals

For example:

import math
a = 0
def my_function():
    a = math.factorial(5)
my_function()

the global variable a will not be modified unless global a is declared within the my_function() definition.

eng setup

Let's add

  • requirements.txt & dev-requirements.txt
  • a readme
    • installation w/ venv etc.
    • descriptions about checking black and mypy.

Having @dorx on it since it's running on your machine. Happy to take a stab at this too if you are busy.

Make `execute_program` evaluate nodes at a time (as opposed to a whole graph)

Currently, the execute_program takes in a whole graph. For supporting the notebook use case, which is an on-going session, we need to add an execute_node method that executes a list of nodes at a time. Importantly, the next execute_node will also share the same current program state (so all the variables, function definitions, and modules imported). You can assume that these nodes are ordered properly.

As part of this refactor, you might need to refactor and reuse the walk method, since it is already evaluating one node at a time. You might also need to lift some of the function state in walk into the class instance state (e.g., scoped_locals). You can assume that each Executor instance is only associated with one session (so two scripts would be with two Executor instances)

I think it would be nice for @1dividedby0 to work on this given that he is the most familiar with the executor, or alternatively I can work on it as a forcing function for Dhruv's eng handoff. I have a slight preference to stay on the transformer/tracer side of things for less context switching.

cc @dorx for another take.

Cloud sync for local data

This would be a simple wrapper using S3 to upload a local file accessed during a run. This will help with re-execution.

However we should probably defer until we have a remote server setup, which won't be for a few weeks.

[Front-end] Tasks page

For documentation purpose only since the code will be committed to linea-server.

This is the "Tasks" page from the Figma prototype. We need to make the following modifications from the Figma page:

  • Ignore the four boxes at the top. Have a single box that shows the number of active Tasks.

Future enhancements to `DataSourceNode`

We decided to just support local file systems for recognizing data sources.

Note that this is missing a few critical items:

This is similar to issue #20, low pri.

Fix the broken lineapy kernel installation

The kernel currently is not added to Jupyter's list of kernels with pip install -e .

It is however added to home/ubuntu/.local/share/jupyter/kernels/kernel, so not sure what's going on. Need someone to help debug.

More proper way to use `NodeTransformer` in the iPython Kernel

Per chat with Stephen (author of nbsafety). We should probably piggy back on existing ast_transformer in the iPython kernel. Understood this pattern from

old = _ipython().ast_transformers
_ipython().ast_transformers = old + transformers

I haven't had time to explore what the tradeoff/implications are, just putting this down for future improvements.

Fix the caching warning related to the usage of `TypeDecorator` for SQLAlchemy

This warning is showing up for all SQLAlchemy node types that uses TypeDecorator, which includes AttributesDict and LineaID.

  /home/ubuntu/linea-dev/lineapy/lineapy/db/db.py:232: SAWarning: TypeDecorator LineaID(length=16) will not produce a cache key because the ``cache_ok`` flag is not set to True.  Set this flag to True if this type object's state is safe to use in a cache key, or False to disable this warning.
    node = self.session.query(NodeORM).filter(NodeORM.id == linea_id).one()

Executor to populate the `Library` `version` and `path` if not already present

Due to the current way the tracer interacts with the Executor, the first time the Library object is created, the tracer does not have access to the value---the executor will, so it should introspec the version and path and populate if not filled.

I think this should happen right after the following if statement

elif node.node_type == NodeType.ImportNode:
    node = cast(ImportNode, node)
    node.module = importlib.import_module(node.library.name)

in executor.py

Since the classes are all by reference, the Tracer can just pass the referenced Library to the database for serialization after.

@1dividedby0 please let me know what you think. If you might be up to it, this would be a great issue to bundle with #68 for a PR.

clarifying the semantics of variable alias in node definitions (and capture)

@1dividedby0 asked a very good question about what we node to represent when we have a=b. Intuitively, we can think of this as an "identify function", but then the question is, what is the semantics of this identity function. For values like numbers and strings, it's basically "LiteralAssignNode" in that the variable the source variable is copied into just gets a snapshot of the literal value. See below:

>>> a = 1
>>> b = a
>>> a
1
>>> a = 2
>>> b
1
>>> c = "hello"
>>> d = c
>>> c += "a"
>>> d
'hello'

When it's an object, e.g., a list, then the "identify function" is really just an alias.

>>> x = [1,2,3]
>>> y = x
>>> x.append(4)
>>> y
[1, 2, 3, 4]

Given that we don't currently have such an "alias" representation, we need to create ObjectAliasNode (@dorx and I discussed the alternative where we just add a field "alias" to some other nodes, but that's likely going to be annoying for the relational DB). Note that this alias node should be used for analysis for slicing as well---anything that touches x later will now be in y's dependencies. Note also that this shifts the burden on the transformer/transformer API to differentiate between the two.

Add support for "headless" literals and variables.

Currently, our nodes do not support the two following cases that really do not have any effects on the state of the program:

a = 1
a
1

@dorx thinks that we should accommodate these cases, which we could do by extending LiteralAssignNode and VariableLiasNode, by making the assigned_variable_name field optional.

This would also mean, I think, that the following call will also have an edge between the call and the -11 node.

abs(-11)

This refactor should be pretty easy, but off the critical path for now.

Create tests to measure execution overhead (and savings) by `lineapy`

It's always good to have the ability to keep track of the overhead of lineapy so we can (1) assure our users, and (2) know when we need to start optimizing.

This is low pri for now. This would be a good onboarding item because it's pretty well isolated and involves end to end usage and instrumentation of the lineapy codebase.

Serialization mechanism

PR #51 used SQLAlchemy to pickle cached node values.

Inspecting the code for where pickling is used and imported, we confirm that the pickle is simply the Python pickle implementation.

There are two main things to look out for in the future:

(1) If we want better control over the pickling process. For instance, we might want to use cloud pickle instead for complex objects---see their github readme for a discussion (I think Ray also uses cloudpickle per the discussion here. And some one wrote something here that we can dig through when we run into this problem.

(2) Performance---it seems like the SQLAlchemy code is single threaded and synchronous? We might want to at least have two separate databases to make sure that the nodes are written down, which are more important since the values are just a cache.

For now, since we are bee-lining towards MVP, we can just use the existing implementation in the PR (using SQLAlchemy).
In the (near) future, to make sure we can detect when things go south, we should have
[ ] good performance testing and tracking (also described in issue #37)
[ ] add tests for more import scenarios (so that we can at least know exactly how pickling breaks).

@dorx what do you think?

And @1dividedby0 how do you feel about following up on the two TODOs once you are done with the tasks and REST end point sprint (if there is extra time left)!

Transformer feature requirements for identifying data sources

When generating class DataSourceNode, the transformer should identify when file accesses are used

  • Not sure if there is a systematic way to do it
    • Option 1: Hard code popular libraries & methods, e.g., pandas's read_csv and boto3.
    • Option 2: Have the user instrument it and register with us.

And some details:

  • For the local files, use some path introspection to get the absolute path.

Discussion points: I think Option 1 is prob easier for demo, but Option 2 is prob better UX and more scalable---I imagine that we need some user intervention about data sources to give them control about whether its synced to our cloud or point us to permissions to access the data etc. @dorx please comment! (I don't think I'll get to this until at least a week out.)

Asynchronous Execution with Flask API

Right now we are assuming that the execution will take a couple seconds (which is true for the stubs we have). But in the future the execute API will need to return something immediately even if the execution isn't finished.

This isn't necessary for MVP but is still very important to implement.

Refactor helper methods in `Graph` (graph.py) to `GraphReader` (graph_reader.py)

Per discussion with @dorx, graph should stay more or less just as a data structure. Having GraphReader being separated allows for better abstraction (especial when we start having GraphWriter).

This would require downstream users of Graph---currently in re-exec, so it made the most sense to have @1dividedby0 work on it. Dhurv let us know if this is not clear or if you have push back about the design!

Make `find_all_artifacts_derived_from_data_source` search through ALL graphs

Currently the method (in LineaDB) only searches within a specific graph, but in the future we want it to search through all graphs. This can be done in the following two ways:

  1. If we want it to work without a specific graph, we would need to be able to query the DirectedEdges. This can be done by adding a DirectedEdge table to the SQLAlchemy ORM. But that would also break from what we had decided for edges, since we're currently reconstructing edges just based on the nodes rather than querying a DB.
  2. Another way to implement this would be to go through all of the artifacts and check if their ancestors contain the DataSourceNode. This can be done by using get_ancestors_from_node(which is currently used in get_graph_from_artifact_id). But this may be slow (O(A*N)) where A is the number of artifacts and N is number of nodes. worst case O(N^2) where every node is an artifact and we traverse the entire node database for every node.

remove pickling of arrays and dictionaries and flatten out DB models

Right now we use PickleType as the column type for columns that hold arrays or dictionaries (e.g. arguments column in CallNodeORM). In the future we want to flatten our DB models out, so this is just a temporary way of handling arrays and dicts.

Technically there is an SQLAlchemy ARRAY type, but I got the feeling it's not really supposed to be used + it didn't work for me when I tried to use it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.