Coder Social home page Coder Social logo

exorcist's People

Contributors

dwhswenson avatar richardjgowers avatar

Watchers

 avatar  avatar  avatar  avatar

exorcist's Issues

TaskStatusDB: Set up an empty database

Empty database includes two tables:

  • tasks with columns:
    • taskid: str
    • status: int
    • last_modified: datetime
    • tries: int
  • dependencies with columns:
    • to: str (FK on tasks.taskid)
    • from: str (FK on tasks.taskid)

Main user-facing to implement:

  • __init__(self, engine: sqla.Engine)
  • from_filename(cls, filename: os.PathLike)

DISCUSS: TaskStatusDB: Switch to storing status name instead of value?

In original development, I stored the integer value associated from the TaskStatus enum for a given status. The question here is whether to instead use string name from the enum. At this point I don't have a strong preference of one over the other, although I'm leaning a bit toward using the string name. Here are the advantages I see to each choice:

Why to use string name:

  • Allows us to use sqla.Enum as column type, which may do better validation of values (haven't checked this, but we're certainly not currently preventing the DB from storing an int with no meaning from the enum)
  • More obvious output if user directly works with DB (e.g., loading tasks table with a pandas data frame): meaningful string instead of meaningless int
  • (I think) it will allow us to immediately get the enum object back, instead of converting the int value into the enum object being our responsibility. This could simplify future code based on an existing task database (dashboards, consistency checks, etc.)

Why to use int value:

  • Possible performance improvements (space and speed) over storing CHAR/VARCHAR.
  • If sqla.Enum is internally using CHAR, there might be migration issues if a new status is added to the enum (different CHAR length might be required)
  • In the short term, I think we're more likely to change the name of a status than its numerical value. That would be breaking for existing DBs using different string names.

Worker: Method for selecting task to work on

The Worker need to have a way to select which task it will run. There are a few options here; I think we should engineer things such that we can easily try alternatives, since I'm not sure what will best meet needs of users. A couple options:

  • Priority in the task status DB. Get first available sorted by priority (on the SQL end). Should be fast, but reading doesn't block other readers, so there are potential concurrency pileup issues.
  • Make the decision in Python; could include some randomness to avoid concurrency issues (e.g., select weighted by priority). Will be slower between read and claim, but we already have safety on the claim to ensure that we're actually the only one to get stake our claim to a task.

TaskStatusDB: Methods to add a task/network of tasks to a database

Make it possible to add Tasks to the database.

  • add_task(self, task: Task, requirements: Iterable[Task]: add a single task to database, along with edges to things it depends on. A task with no dependencies should be added with status as AVAILABLE, otherwise status should be BLOCKED.
  • add_task_network(self, network: nx.DiGraph): add an entire graph (with Tasks as nodes) to the database. Initial status as with add_task.

These should probably use some internal methods, rather than having add_task_network call add_task for each task (which would require a separate transaction with the DB for each task). Adding tasks to the DB should be batched.

TaskStatusDB: Rebuild task network object

Go from the databases to an nx.DiGraph of Tasks.

This isn't strictly necessary for minimal functionality, but has potential to be very useful for things like troubleshooting, debugging, and ensuring database consistency.

This should actually be done in 2 stages: going from the TaskStatusDB to a network of taskid strings, and then a second function that takes that taskid network and attaches TaskDetails from the TaskDetailsStore.

TaskStatusDB: Method to update task status

  • update_task_status(self, taskid, new_status, old_status)

A couple concerns/questions:

  • Do we need to pass the DB connection in here? This seems like it could be part of a more complicated sequence that we'd like to commit all at once.

ResultStore

This object stores the final results, and is specific to the client application. The retry number is passed to this when storing, and it is up to the client application to ensure that the combination of result object and retry number is ensured a unique location in their storage.

  • is_failure_result(result: ResultObject) -> bool
  • store_result(result: ResultObject, retry: int)
  • load_result(label: str, retry: int): this isn't strictly necessary for Exorcist, but may be useful (and will be needed in the client code anyway)

TaskStatusDB: Update DAG after task completion

  • mark_task_completed(self, taskid): This both marks the task as having status COMPLETED and also updates the dependencies table to mark tasks involving this one as completed, and finally marks any newly unblocked tasks as AVAILABLE.

This involves a decent bit of shuffling between SQL and Python, with a lot of writes. This is the area where we'll need to pay close attention to avoid inconsistency problems.

TaskDetailsStore

This object loads and saves details of how to run tasks. This is specific to the client application, but we define an API here that must be at least duck-typed to.

In practice, our first usage will be as files on the filesystem.

  • load_task(self, taskid: str) -> Callable[[], Result]: Note that this returns a callable. All that the worker needs to do is call the function returned here.
  • store_task_details(self, taskid: str, task_details: TaskDetails): Store the task details. The nature of the TaskDetails object depends on the client application.
  • load_task_details(self, taskid: str) -> TaskDetails: This isn't strictly needed for the primary functionality, but will be useful for various tools for troubleshooting/introspection/debugging. (Is needed if we switch to the run_task model)
  • run_task(self, task_details: TaskDetails) -> Result: Run the actual task.

Example client code

Since TaskDetailsStore and ResultStore need to be subclassed (or duck-typed) by client code, we need to have a very simple example to show how to do this. This will also facilitate our testing, especially when doing integration tests between various units and moving toward end-to-end testing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.