fishrock123 / bob Goto Github PK

View Code? Open in Web Editor NEW

77.0 77.0 8.0 433 KB

🚰 binary data "streams+" via data producers, data consumers, and pull flow.

License: MIT License

JavaScript 57.14% C++ 31.50% Python 0.67% C 10.69%

bob node nodejs pull-streams sink source streams

bob's Introduction

Specialist in API design & Rust language implementation.

Co-Maintainer of http-rs, including Tide & Surf.

Node.js Technical Streering Committee Emeritus

www.jeremiah-senkpiel.com

bob's People

Contributors

Stargazers

Watchers

Forkers

nicknaso adarshsaraogi antsmartian jasnell mirzasmrkovic

bob's Issues

Proposal: change binding

Binding is ugly and I bet it is the least comprehensible part right now.

Ideally I guess it would look something like Stream(source, transform, sink).

Streams3 adaptor / transform (s)

Heya @mcollina & @mafintosh - I think this is the next required step here, since we'd need this kinda thing to be able to do anything in node core anyways.

I've created a repo for this at https://github.com/Fishrock123/bob-streams3 and invited you both as collaborators. It contains some basic bits but no functioning code.

There are a good bit of docs and examples lying around here & in the linked-to module repos, please holler if you need any of my help.

Construct flow

Some kind of construct flow would be very useful for a couple of significant reasons:

Presently resources must either be opened upon constructor call, or upon "first pull".
- The former presents async timing issues
- The latter is complicated and messy
It would be useful to pass buffer allocation hits out-of-flow (#52)

Arguments for doing it all inline could be getting pretty long (and very variable), not even counting the first point:

pull(status_type, error, buffer, size, offset)

Maybe?

Idk, maybe separating this all out into multiple flows would be better, similar to Streams3 but just sans the dreaded EventEmitter.

(sink calls) -> (source calls)
construct(...) -> ready(...)
pull(...) -> give(...)
destroy(error) -> destroy(error)

Very related to nodejs/node#29314

automated tests

finally workin on this

The npm modules (fs-source, etc) have automated tests but this repo never yet did...

Potential streams3 adaptor bugs

See #35 (review)

Proposal: rename `next()`

Due to potential conflicts with iterator#next(), perhaps this part of the protocol should be renamed. Any thoughts?

Maybe something like give()?

Make Stream() an async iterator

I think Stream() would be a good place for this, what does everyone else think?

We should also be sure to handle cases such as nodejs/node#28194

clean up the source after stopping

this module implements a stop method:
https://github.com/Fishrock123/bob/blob/master/reference-extension-stop.js

but it doesn't notify the source in anyway. If the source is using a heavy resource, (eg, a file descriptor) the source needs to know the sink has stopped so that it can close the resource.

The current status is:

Completed moving moving sinks / sources to npm modules:

The status codes enum: bob-status
A file system source: fs-source
A file system sink: fs-sink
A zlib transform: zlib-transform

Next up:

Make a pull-streaming http / socket "duplex". Needs to use BOB streams as far down as possible.
- Likely to be based on Matteo's work of making a lighter TCP socket + http implementation.

Previous status - 1/6/2018 (Berlin Collab Summit): #11

Proposal: make `start()` standard instead of suggesting to automatically pull.

I like being explicit in intent - I feel it would be better to not include default code examples that automatically start from sinks and rather just always include start(). (And no longer have start me an "extension")

Transform this repo into more of a spec

So I've been realizing a couple related things...

A big one is that I think protocol enforcement needs to be done via classes / helpers somehow, but at the same time I don't think it should be 100% necessary. (Heck it isn't even in streams3)

A more formal specification than just the reference would be useful, in the same way it would be useful to have (or had) that kind of thing for streams3.

This should also allow the project at a spec and then "core classes" level to be moved into the node org without having to drag every sub module in.

Official way of doing C++ binding detection

need to do this... currently it's really just done via the C++ passthrough

Managed state

Essentially an extension of #53 with a different focus.

There are a number of reasons (#53, #52, #30, etc) for why some kind of state management that does not need to be reimplemented by every stream would be useful.

There are two primary ways to deal with this (that I can think of):

Class inheritance, do everything in a base class and then extend that
Some kind of managed state object which is persistent and passed in the pull flow
- @jasnell's idea
- Could potentially deal with the issues without requiring class inheritance?

Of course, if we inherit form a class the obvious thing to do would be to integrate the verify transform into said class, so that guarentees are at the absolute minimal by convention but rather by code.

"Extensions"

I have added a section to the readme in this repo about "extensions": API Extension Reference

So far this has seemed the best way to deal with possible optional additional APIs, such as an explicit start, or a stop for handling timeouts.

Progress 16/2/2018 (2018 week 7)

The current status is:

Goals: merged a couple weeks ago: https://github.com/Fishrock123/bob#goals
js-only fs source/sinks: working in this repo
js-only zlib transform: working in this repo
buffering api: no work yet
JS in C++ passthrough: working in this repo - b4fafa6
C++ api: currently out of date, see nodejs/node#16414

Next up:

C++ to C++ passthrough
C++ file source or sink endpoint

Previous status - 15/1/2018 (2018 week 3): #2

Buffer allocation hints

As much as I'd like to avoid it, it seems like some kind of hint system would be useful for telling who should allocate buffers and of what size.

Ideally, this would be done out of the regular pull flow (to avoid passing like 7 arguments every time). So, probably going to be connected to an other (yet to be made) issue about doing some kind of "construct" flow...

See #30 (comment)

One piece of contention in the current design of the sink API is whom allocates the buffer. If I have data already in buffers that needs to be written to a socket or disk it doesn't make sense for the sink to allocate a write buffer and tell me to copy values into it.

Where do object streams fit in?

One thing I like about node.js streams is composability. It's easy to compose pipelines that mix binary and object streams. For example, parsing a csv file, transforming the rows (objects), then writing back to a file. When raw performance isn't important (it often isn't) then node.js streams are pretty great.

Bob is binary-only, so what will replace object streams? If the answer is "async iterators", how does one compose a pipeline as easily as x.pipe(y).pipe(z)? And if each of those is an async iterator, wouldn't that hurt throughput, as you can no longer transform multiple objects in one tick?

'bob' performance discussion

So, I finally profiled this on my linux box (macOS is useless because of ___channel_get_opt, good luck).

I have documented the results so far in performance.md. I only really tried doing a very large file and have not yet made cases that make many small streams.

The results are looking good. The HDD is the limiting factor of my linux system, and the profiles show file copying has ~7x less CPU time in JS, and zlib transform has ~33% less CPU time in JS. 💥 (C++ time does not seem significantly affected for either case.)

cc @jasnell, @mcollina

potential stack overflows

currently, the bob sink calls this.source.pull() and then the source calls this.sink.next()
however, if the source calls back synchronously, and the sink calls pull again synchronously, and the stream moves enough data, this could cause a stack overflow.

I worked around this in pull-stream with this (ugly) code:
https://github.com/pull-stream/pull-stream/blob/master/sinks/drain.js#L12-L37

Basically, it checks if it's next was called sync, (i.e. if the last call to pull hasn't returned yet) and if so, falls out to a loop that calls pull again. if the pull() returns before next is called, then the source is async, so exit the loop. this is the most complicated part of pull-streams. bob streams will need to have a thing like this too. you can use setImmedate too, but that's actually a more overhead than the loop, and the loop means that a completely sync stream can stay completely sync.

push-stream solves this a much simpler way: sinks have a paused property, which the source can check before it calls write. A sync source can just loop until the sink pauses. then wait until resume is called. this means it uses less stack memory.

Hmm, that wouldn't work with bob streams because of the way the sink allocates the buffer...
I'm not really sure about that, though. (and also forbidding object streams, but not discuss that in this issue, the stackoverflow problem is more important)

Progress 11/12/2019 - December

The last update was fairly large (23/07/2019 - July #40), but this one is much smaller.

Notably I'm no longer employed and paid to continue this kind of work around Node and I don't really do this as my hobby or have much default motivation to continue.

I did merge a pull request that adds WritableSource and ReadableSink` for streams3 interoperability: 57a78d1

I am supposed to present this initiative's status again at the Montreal collab summit. I am not quite sure what will come out of that and it may end up being more of a post-mortem.

Progress 5/10/2018 - October

The current status is:

Completed sink / source / "duplex" npm modules:

The status codes enum: bob-status
A file system source: fs-source
A file system sink: fs-sink
A zlib transform: zlib-transform
A TCP socket "duplex": socket
A TCP server of "duplex" sockets: socket

Next up:

Present this at the N+JSI 2018 Collaborator Summit: openjs-foundation/summit#110

Previous status - Progress 23/7/2018 - July: #12

Progress 30/3/2018 (2018 week 13)

The current status is:

Project goals
Project high-level approach
JS API:
- fs source/sinks: working
- zlib transform: working
- JS in C++ passthrough: working - b4fafa6
C++ API:
- C++ PassThrough with JS or C++ endpoints: working- b560aa7
- C++ File Source & File Sink: working in branch c++
buffering api: no work yet

Next up:

Updated profiles
Proper C++ error handling / passing / translation
Evaluation of API, possible comparison / test implementation of a more 'functional' style JS API for discussion reference.

Previous status - 13/3/2018 (2018 week 11): #9

potential future move to nodejs org

Not really super into moving stuff into an official repo atm but once some more status has been completed (see #2), we may want to so as to get some more traction, as forming a "team" may help there.

idk if that would mean actually moving the repo or just it's contents

Progress 15/1/2018 (2018 week 3)

The current status is:

Goals: almost ironed out, see #1
initial js-only fs source/sinks: working in this repo
js-only zlib transform: almost working, see aec09c2
buffering api: no work yet
C++ api: currently out of date, see nodejs/node#16414

I'm thinking of taking a swing at doing a C++ fs after the js-only zlib transform is working.

Proposal: Status Code Error Enums

For errors, any negative value should be considered an error, with multiple negative values permitted. That would obviously mean not using an enum for status, and instead using constants... e.g.

#define BOB_ERR_{whatever} -{some integer}
#define BOB_ERR_END 0
#define BOB_ERR_CONTINUE 1

Proposal: add `offset` to `pull()`

From the collab summit, after talking with @jasnell extensively about QUIC, I think it would be useful to have pull support an offset to ask the source to read from (the source may choose to still return whatever data it chooses).

This should allow for ACKs in a network implementation (i.e. if you need to ACK, you just request the last / desired offset again).

It also would support a level of content-addressability. A sink or downstream intermediate could be the one to inform a file source of where to read from.

Relatedly, this may also make disk-based sources require less state...

Progress 13/3/2018 (2018 week 11)

The current status is:

Project goals: https://github.com/Fishrock123/bob#goals
Project high-level approach: https://github.com/Fishrock123/bob#project-approach
JS API:
- fs source/sinks: working in this repo
- zlib transform: working in this repo
- JS in C++ passthrough: working in this repo - b4fafa6
C++ API:
- C++ PassThrough with JS or C++ endpoints working in #8
buffering api: no work yet

Next up:

C++ file source or sink endpoint
Evaluation of API, possible comparison / test implementation of a more 'functional' style JS API for discussion reference.

Previous status - 16/2/2018 (2018 week 7): #5

Blog about this?

I probably have enough material, could be good for visibility?

Compare more to alternate ideas, such as push-stream.

See https://github.com/push-stream/push-stream

Progress 15/11/2018 - November

The current status is:

Was presented, to good reception, at the N+JSI 2018 Collaborator Summit: openjs-foundation/summit#110
- Slides online at https://fishrock123.github.io/nodejs-collab-summit-2018

Completed sink / source / "duplex" npm modules:

The status codes enum: bob-status
A file system source: fs-source
A file system sink: fs-sink
A zlib transform: zlib-transform
C++ headers: bob-base
A TCP socket "duplex": socket
A TCP server of "duplex" sockets: socket

Next up:

Address open questions: #17
Write a Streams3 adaptor: #18

Previous status - Progress 5/10/2018 - October: #13

Open Questions presented to N+JSI collab summit

From my collab summit presentation (#16), here are the open questions I presented at the end:

Is this API likeable?
- It seemed to be well received at the collab summit
Do we need a buffer pooling helper?
- (Probably.)
Is it too API-pattern focused?
How to enforce single active request?
Where to start implementing in core?
Libuv pull streams?
- The libuv folks I catted with at N+JSI seemed open to it.

Check Deno's stance on streams

@benjamingr Mentioned Deno was discussing their approach to streams. We should check it out.

Non-single-logical-flow (multiple pulls)

Moving out from #23 (comment)

It seems that newer network protocols like QUIC desire multiple chunks of data to be in-flight at once (besides consider re-sending).

This probably violates these two core design ideas:

One-to-one: The protocol assumes a one-to-one relationship between producer and consumer.

In-line errors and EOF: Errors, data, and EOF ("end") should flow through the same call path.

It may also unleash zalgo? lol.

Anyways, I think it is possible to still keep things simple and "pretend" that things are multiplexed, by doing slightly more waiting at the network sink end. I'm not really sure that perf would be considerably impacted in most cases?

Edit: See #30 (comment) for updated thoughts.

Progress 23/07/2019 - July

Ok so, a lot of stuff has happened since the last update (15/11/2018 - November).

An unfortunately incomplete list of actions since then:

Re-presented at the Berlin June 2019 collab summit.
- Slides at https://fishrock123.github.io/nodejs-collab-summit-berlin-2019
Introduced Stream() composition #31
start(cb) required for Sinks #32
Automated tests! #38
AssertionSink & AssertionSource
A Verification passthrough for API enforcement #34
Large progress on Streams3 Readable/Writable adaptors #35
This repo is now an npm module under the name bob-streams. #33
Made crc-transform as proof for an internal live coding demo
- I don't think I can make the recording public but maybe I can do a public / extended version
Many other minor fixes
Additional prototyping of interaction with async iterators
Discussion about adding an offset to pull() #23
Discussion about multi-pull flow #30

Collab summit - Progress 1/6/2018

The current status is:

Moving sinks / sources to npm modules:

The status codes enum: bob-status
A file system source: fs-source
A file system sink: fs-sink

Next up:

Move more bits to npm modules
Updated profiles
Proper C++ error handling / passing / translation

Previous status - 30/3/2018 (2018 week 13): #9

Reconsider `bind{Source|Sink}()`

I'm really not convinced the binding apis are very good.

Might be better to have a helper function like streamline(a, b, c) - ideally so that you can do streamline(streamline(a, b, c), d) too.

Gona try to work on a PoC module this week..

What are the differences of this approach in comparison to pull-stream?

I'm excited to see this development as I am a heavy user of the pull-stream ecosystem for etl processing. This approach feels and reads extremely similar, but with obvious gains to be made by making it natively supported by node. Do these two efforts align (or differ) in any way? Is bob expected to support existing pull-stream patterns so as to benefit the variety of libraries already available on npm? Could it? 😄

For reference: https://github.com/pull-stream/pull-stream

Project Approach evaluation
Goals evaluation
API conventionality (how to enforce this?)
Naming e.g. #22, "sinks"
Multi-pull flow #30
C++ binding detection #29

ccing everyone who has showed interest so far:
@jasnell, @Raynos, @mcollina, @mafintosh, @benjamingr

fishrock123 / bob Goto Github PK

bob's Introduction

Specialist in API design & Rust language implementation.

Co-Maintainer of http-rs, including Tide & Surf.

bob's People

Contributors

Stargazers

Watchers

Forkers

bob's Issues

The current status is:

The current status is:

The current status is:

Recommend Projects

Recommend Topics

Recommend Org