The hydroflow from hydro-project

covid_tracing_dist broken

Broken by/since fc48b71

thread 'main' panicked at 'called Result::unwrap() on an Err value: Custom { kind: Uncategorized, error: "failed to lookup address information: nodename nor servname provided, or not known" }', covid_tracing_dist/src/tracker.rs:117:10
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

programs with >1 stratum don't handle stdin properly

To repro, the below test will echo what you type as given. But if you uncomment the line with fold it will not.

pub async fn test_strata_with_stdin() {
    let reader = tokio::io::BufReader::new(tokio::io::stdin());
    let stdin_lines =
        tokio_stream::wrappers::LinesStream::new(tokio::io::AsyncBufReadExt::lines(reader));

    let mut hf = hydroflow_syntax! {
        recv_iter(vec![1,2,3])
            // -> fold(0, |a,b| a + 1)
            -> for_each(|x| println!("There are {} items", x));
        recv_stream(stdin_lines)
            -> map(|l: Result<std::string::String, std::io::Error>| l.unwrap())
            -> for_each(|s| println!("Echo: {:?}", s))
    };
    tokio::select! {
        _ = hf.run_async() => (),
        _ = tokio::time::sleep(std::time::Duration::from_secs(10)) => (),
    };
}

Build broken by type alias enforcing outlives relations

rust-lang/rust#95519

Add support for arithmetic expressions in rules

An interesting decision to make is whether we represent it as an infinite relation (pure, but hard) or a custom extension to the language (not datalog, but easy to implement). There are two main cases where arithmetic is used:

to compute some column on the LHS based on values from the RHS (could just add curly braces syntax to LHS columns)
to filter rows based on some expression (extension of support for boolean relations)

So I don't think there's a big immediate need for the infinite relations since the language extensions can support the above too without too much effort.

@davidchuyaya thoughts, do the above two cases cover your uses?

Tick mechanism in state API

Rather than have the state API be aware of ticks and do clobbering, provide a simple map-esque wrapper class where the keys are ticks/epochs. Each tick advancement resets the value.

Test against scheduler regressions via logging of scheduled subgraphs

justinj 25 minutes ago

One way that might be cool to test scheduling code like this might be to introduce some kind of logging that logs whenever an operator is scheduled, and then (via datadriven or some other means) compare or eyeball that to what is expected. In the past I've found that kind of stuff helpful for testing things that are like, hard to pre-define the expected results of, but also you know an output is correct when you see it (I think this test is fine for now, though).

auto-index surface syntax for merge and tee

There's really no benefit to having users pick the indexes for these.

API for registering static values in flows

I found it inconvenient and unnecessarily "Rusty" to think about how to pass static config information (e.g. command line options) into flows with the right ownership. I think it would be nice to have a handy Hydroflow API where we register relevant static variables that are accessible read-only in flows.

A fancier version of this would be to have scoping of such things within the flow ... not even sure how to think about that. Global is OK by me for now.

This is an example of typical Rust gotchas that we can shield noobs from so they just focus on writing their pipelines. There are likely many others.

Write tests for `fold_epoch`, `map_scan`

easy operators for stored data

It should be very easy to wire up a YAML or JSON file, a Vec of internal data, a DB iterator.

Book improvements

https://hydro-project.github.io/hydroflow/book/setup.html rust-anaylzer
https://hydro-project.github.io/hydroflow/book/example_2_surface.html -> map(|n| (n..=n+1))
https://hydro-project.github.io/hydroflow/book/example_3_surface.html note about mpsc: "this is a multiple-producer, single-consumer channel, although we'll just be using it with a single producer"
https://hydro-project.github.io/hydroflow/book/example_4_surface.html
- confusion between hydroflow graph and data graph
- vec![0] is a single-element vec literal
- "pull-based" stuff is irrelevant to surface syntax
https://hydro-project.github.io/hydroflow/book/architecture.html
- fix design doc link with: https://hydro-project.github.io/hydroflow/design_docs/2021-10_architecture_design_doc.html
- However, if instead iterators were pull-based SHOULD SAY "PUSH"
https://hydro-project.github.io/hydroflow/book/in-out_trees.html "There is usually not a unique partitioning, but it usually makes sense to create as few subgraphs as possible." there will always be multiple partitionings unless the flow graph is a single node, technically

Support rules with more than two relations on the RHS

Requires generating nested joins, and we may want to investigate doing some lightweight query optimizations.

Example program: detecting triangles:

triangle(a, b, c) :- edge(a, b), edge(b, c), edge(c, a)

having both sink_async and write_async is confusing for users

are they meaningfully different? can we make it easier to understand? Else document better.

streams in surface API

for stdin and for network input without an extra await

groupby operator in surface syntax

Nice way to handle input/output ports we don't care about

Right now, if you create an operator that gives you an input or output port you don't care about, you still have to attach it to something or else you get an unattached handoff error. It would be nice if there was a better way to do that than by constructing a sink that no-op drains the the handoff.

Specialization for handoffs preceeding non-monotone/blocking operators (perf optimization)

For example, if we have max() or min(), a preceding handoff would only ever need to store a single max or min element, streaming. But currently we only have VecHandoffs, which store everything. So we can be more efficient. This will probably tie in to using lattice types in handoffs.

Surface Syntax stateful and dedup operators

Add support for negations in rules

Will require the insertion of barriers, which I don't really know how to do in HF.

Make udp bytes surface syntax example

add network in/out nodes to builder

e.g. to wrap inbound_tcp_vertex_port and outbound_tcp_vertex_port

Patterns for Persistence

A common pattern in Dedalus programs is to persist some relation into the future via an inductive rule:

q(X)@next :- q(X).

Translated naively to Hydroflow today, this would result in draining a buffer only to re-fill it the same way on the next iteration. We probably want some mechanism to allow us to designate a relation as "persisted" and thus not drained at the end of a tick.

This issue is to acknowledge two needs I see:

A way in Hydroflow to easily denote which buffers should be drained at the end of a tick and which should not, to avoid this churn, and
a representation in whatever higher-level IR we eventually adopt that can recognize patterns like q(X)@next :- q(X). and translate them into a form that can be handled more efficiently.

cc @MingweiSamuel @jhellerstein

Merge/join defined before their inputs can be defined - allow inputs to forward reference names before definition

~~This issue will serve as a central place for comments on the surface syntax and its usability.~~

A dedup/distinct/unique operator

While this can be achieved with reduce it would be handy.

Current Difficulties with Surface API

A list of some stuff we discussed today while working through exchange.

Extend is tricky

Having to do fancy type-level list concatenation to write an operator is Not So Much Fun. It definitely gives us a lot of power and safety, but I think the amount that the average user will be exposed to it today is a bit unfortunate, especially because they don't conceptually feel (to me, at least) like an essential part of the type I'm describing when I write a function.

Lifetimes are difficult when implementing the Build structs.

This might just be my inexperience with explicit lifetimes, but there's quite a lot of machinery to go through to get something working. This one I could see getting resolved once we have sufficiently many examples, though.

Stream Completion

This should be its own issue for discussion, perhaps.

Can we easily thread metadata through a pipeline here?

We sort of want something like the tokio streams, kind of looking like this:

enum Msg<T> {
    PartialStop,
    TotalStop,
    Data(T)
}

but as it is, the iterators expect Options, and its not clear how to pass these through the tree. Another option would be to have a parallel tree that manages the metadata, but that seems kind of hard and difficult to make safe.

@jhellerstein @MingweiSamuel

Build broken due to nightly closure bound changes

rust-lang/rust#96899

Autogenerate surface syntax op docs from code

E.g. we just added flatten and null and they're not documented. How do we encourage this?

Source subgraph that doesnt call waker won't necessarily be scheduled (bug)

Right now there is not anything that reschedules generic source subgraphs. Also check how the context allows scheduling.

macro or inspect operator for debugging that tees any variable to stdio

Would be nice for debugging to have an easy syntax to mark a hydroflow variable "debug" and have its contents teed to stdio or stderr.

I.e. suppose I have

message_generator = recv_iter([1,2,3]) -> ... foo(...) -> sink_async(..);

I would want:

message_generator_prep = recv_iter([1,2,3]) -> ... foo(...) -> tee();
message_generator_prep[0] -> foreach(|m| println!(message_generator: {"?"}, m);
message_generator = message_generator_prep[1] -> sink_async(..);

Hydroflow Spins

An idling Hydroflow instance makes my (and Mingwei's) fans go crazy and take up 100% cpu. We should fix that.

surface syntax ignores join input indexes

The hydroflow parser chooses left and right sides of the join based on the order in which join inputs appear in the text, not based on the input index. I.e. the two cases below differ only in the order of the lines but only the first parses correctly.

pub fn test_join_order() {
    let mut df_good = hydroflow_syntax! {
        yikes = join() -> for_each(|m: ((), (u32, String))| println!("{:?}", m));
        recv_iter([0,1,2]) -> map(|i| ((), i)) -> [0]yikes;
        recv_iter(["a".to_string(),"b".to_string(),"c".to_string()]) -> map(|s| ((), s)) -> [1]yikes;
    };
    let mut df_bad = hydroflow_syntax! {
        yikes = join() -> for_each(|m: ((), (u32, String))| println!("{:?}", m));
        recv_iter(["a".to_string(),"b".to_string(),"c".to_string()]) -> map(|s| ((), s)) -> [1]yikes;
        recv_iter([0,1,2]) -> map(|i| ((), i)) -> [0]yikes;
    };
}

Possibly make non-stratified subgraphs schedulable at any time

justinj

It would be nice if there was a simple semantic for explicitly vs. non-explicitly stratified subgraphs. This one (where they are always at stratum 0) seems fine to me, but another approach that sounds equally reasonable to me is that non-explicitly stratified operators are always eligible to run? Not sure if we have a preference but seems like something we should decide on.

Make FAQ/common bugs document or use github discussions

Support rules that create constraints within a relation through identifiers

... :- foo(a, a)

implies a filter on the rows coming from foo. We need to handle this appropriately.

rename start_tee

tee_consumer? after_tee?

Emit diagnostics when parsing datalog fails

Surface API Usability roughness

Handoffs require specifying a specific CanReceive<T> type. Previously we used this to allow submitting multiple types, but it adds roughness since we have to specify the T when using the surface API
Forgetting to call .flush() on inputs is an easy footgun
FlatMap needed since handoffs return their inner Vec instead of the items inside the vec. Also need a dedicated .flatten() instead of .flat_map(std::convert::identity)

How to schedule one-time bootstraping code

In Bloom there is a "bootstrap" block which can be used to run code before tick 0. Similarly datalog has fixed EDB code. We have to figure out how to setup/schedule this.

Ad-hoc, can be done with arbitrary rust in fn main(), but maybe we want to be more principled.

`.map_scan()` shows up as `map` in flow graphs

Build broken by nightly regression

rust-lang/rust#92917

2pc surface syntax hangs on 10th input

not sure why, seems to be something around reading lines from terminal??

surface syntax for cross-join

it's annoying to have to map things to () keys to achieve cross-join

Document how to update insta and trybuild snapshots

Support boolean "relations" with filtering expressions

@davidchuyaya's protocols often include "relations" defined by a boolean expression, which can be joined with to introduce filters.

I'm imagining a syntax like:

... :- ..., { a + b > 3 }

rename pivot to pull_to_push

push_to_pull()? pull_to_push()? switch_push_pull()?

Semantics for receiving external events

Presently, while we are ticking a stratum, we call try_recv_events after each operator, which means that we can receive events at basically any point, meaning a network event which is not present at the beginning of a stratum could show up in the middle of it.

I think there are three obvious behaviours:

you can receive an event at any time,
you can receive an event in between strata, but not during a stratum,
you can only receive events when you loop back around to stratum 0.

I think (2) and (3) might be indistinguishable semantically, but (1) is different. We currently implement (1), I don't have a strong opinion on what the correct behaviour is here, but my understanding was that the desired behaviour was (3). Probably something we should discuss and figure out.

cc @MingweiSamuel @jhellerstein

Performance testing for mapping into/out of joins

Is there an expense to the mapping once the compiler has done its magic, relative to tuples that "happen to be" set up right and don't need maps? Would a closure for "key access" on each input help the compiler more than mapping? And/or should we have some fast-path that makes the "relational joins on relational data" go fast?

easy mux/demux (e.g. out of an IP socket)

In the spirit of Bloom's channel, it would be nice to have a single socket to handle all the streams running into a Hydroflow node. All the Hydroflow programmer should care about is the name and the type of each stream, not the socket associated with it.

rename reverse()

push_into()? then()?

maybe write down in english what it does and find a name in that explanation

Support `to_surface_syntax` on `SerdeGraph`

Right now, we can only generate surface syntax when using the macro graph builder logic, which means the only way to get surface syntax out is from the proc macro writing to stdout. Ideally, we could generate surface syntax just like we generate mermaid.

hydro-project / hydroflow Goto Github PK

hydroflow's People

Contributors

Stargazers

Watchers

Forkers

hydroflow's Issues

Extend is tricky

Lifetimes are difficult when implementing the Build structs.

Stream Completion

Recommend Projects

Recommend Topics

Recommend Org