Coder Social home page Coder Social logo

fishrock123 / bob Goto Github PK

View Code? Open in Web Editor NEW
77.0 77.0 8.0 433 KB

🚰 binary data "streams+" via data producers, data consumers, and pull flow.

License: MIT License

JavaScript 57.14% C++ 31.50% Python 0.67% C 10.69%
bob node nodejs pull-streams sink source streams

bob's Introduction

bob's People

Contributors

antsmartian avatar fishrock123 avatar jasnell avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bob's Issues

Proposal: change binding

Binding is ugly and I bet it is the least comprehensible part right now.

Ideally I guess it would look something like Stream(source, transform, sink).

Streams3 adaptor / transform (s)

Heya @mcollina & @mafintosh - I think this is the next required step here, since we'd need this kinda thing to be able to do anything in node core anyways.

I've created a repo for this at https://github.com/Fishrock123/bob-streams3 and invited you both as collaborators. It contains some basic bits but no functioning code.

There are a good bit of docs and examples lying around here & in the linked-to module repos, please holler if you need any of my help.

Construct flow

Some kind of construct flow would be very useful for a couple of significant reasons:

  • Presently resources must either be opened upon constructor call, or upon "first pull".
    • The former presents async timing issues
    • The latter is complicated and messy
  • It would be useful to pass buffer allocation hits out-of-flow (#52)

Arguments for doing it all inline could be getting pretty long (and very variable), not even counting the first point:

pull(status_type, error, buffer, size, offset)

Maybe?

Idk, maybe separating this all out into multiple flows would be better, similar to Streams3 but just sans the dreaded EventEmitter.

  • (sink calls) -> (source calls)
  • construct(...) -> ready(...)
  • pull(...) -> give(...)
  • destroy(error) -> destroy(error)

Very related to nodejs/node#29314

Proposal: rename `next()`

Due to potential conflicts with iterator#next(), perhaps this part of the protocol should be renamed. Any thoughts?

Maybe something like give()?

Progress 23/7/2018 - July

The current status is:

Completed moving moving sinks / sources to npm modules:

Next up:

  • Make a pull-streaming http / socket "duplex". Needs to use BOB streams as far down as possible.
    • Likely to be based on Matteo's work of making a lighter TCP socket + http implementation.

Previous status - 1/6/2018 (Berlin Collab Summit): #11


Transform this repo into more of a spec

So I've been realizing a couple related things...

A big one is that I think protocol enforcement needs to be done via classes / helpers somehow, but at the same time I don't think it should be 100% necessary. (Heck it isn't even in streams3)

A more formal specification than just the reference would be useful, in the same way it would be useful to have (or had) that kind of thing for streams3.

This should also allow the project at a spec and then "core classes" level to be moved into the node org without having to drag every sub module in.

Managed state

Essentially an extension of #53 with a different focus.

There are a number of reasons (#53, #52, #30, etc) for why some kind of state management that does not need to be reimplemented by every stream would be useful.

There are two primary ways to deal with this (that I can think of):

  • Class inheritance, do everything in a base class and then extend that
  • Some kind of managed state object which is persistent and passed in the pull flow
    • @jasnell's idea
    • Could potentially deal with the issues without requiring class inheritance?

Of course, if we inherit form a class the obvious thing to do would be to integrate the verify transform into said class, so that guarentees are at the absolute minimal by convention but rather by code.

"Extensions"

I have added a section to the readme in this repo about "extensions": API Extension Reference

So far this has seemed the best way to deal with possible optional additional APIs, such as an explicit start, or a stop for handling timeouts.

Buffer allocation hints

As much as I'd like to avoid it, it seems like some kind of hint system would be useful for telling who should allocate buffers and of what size.

Ideally, this would be done out of the regular pull flow (to avoid passing like 7 arguments every time). So, probably going to be connected to an other (yet to be made) issue about doing some kind of "construct" flow...

See #30 (comment)

One piece of contention in the current design of the sink API is whom allocates the buffer. If I have data already in buffers that needs to be written to a socket or disk it doesn't make sense for the sink to allocate a write buffer and tell me to copy values into it.

Where do object streams fit in?

One thing I like about node.js streams is composability. It's easy to compose pipelines that mix binary and object streams. For example, parsing a csv file, transforming the rows (objects), then writing back to a file. When raw performance isn't important (it often isn't) then node.js streams are pretty great.

Bob is binary-only, so what will replace object streams? If the answer is "async iterators", how does one compose a pipeline as easily as x.pipe(y).pipe(z)? And if each of those is an async iterator, wouldn't that hurt throughput, as you can no longer transform multiple objects in one tick?

'bob' performance discussion

So, I finally profiled this on my linux box (macOS is useless because of ___channel_get_opt, good luck).

I have documented the results so far in performance.md. I only really tried doing a very large file and have not yet made cases that make many small streams.

The results are looking good. The HDD is the limiting factor of my linux system, and the profiles show file copying has ~7x less CPU time in JS, and zlib transform has ~33% less CPU time in JS. πŸ’₯ (C++ time does not seem significantly affected for either case.)

cc @jasnell, @mcollina

potential stack overflows

currently, the bob sink calls this.source.pull() and then the source calls this.sink.next()
however, if the source calls back synchronously, and the sink calls pull again synchronously, and the stream moves enough data, this could cause a stack overflow.

I worked around this in pull-stream with this (ugly) code:
https://github.com/pull-stream/pull-stream/blob/master/sinks/drain.js#L12-L37

Basically, it checks if it's next was called sync, (i.e. if the last call to pull hasn't returned yet) and if so, falls out to a loop that calls pull again. if the pull() returns before next is called, then the source is async, so exit the loop. this is the most complicated part of pull-streams. bob streams will need to have a thing like this too. you can use setImmedate too, but that's actually a more overhead than the loop, and the loop means that a completely sync stream can stay completely sync.

push-stream solves this a much simpler way: sinks have a paused property, which the source can check before it calls write. A sync source can just loop until the sink pauses. then wait until resume is called. this means it uses less stack memory.

Hmm, that wouldn't work with bob streams because of the way the sink allocates the buffer...
I'm not really sure about that, though. (and also forbidding object streams, but not discuss that in this issue, the stackoverflow problem is more important)

Progress 11/12/2019 - December

The last update was fairly large (23/07/2019 - July#40), but this one is much smaller.

Notably I'm no longer employed and paid to continue this kind of work around Node and I don't really do this as my hobby or have much default motivation to continue.

I did merge a pull request that adds WritableSource and ReadableSink` for streams3 interoperability: 57a78d1

I am supposed to present this initiative's status again at the Montreal collab summit. I am not quite sure what will come out of that and it may end up being more of a post-mortem.


Progress 30/3/2018 (2018 week 13)

The current status is:

Next up:

  • Updated profiles
  • Proper C++ error handling / passing / translation
  • Evaluation of API, possible comparison / test implementation of a more 'functional' style JS API for discussion reference.

Previous status - 13/3/2018 (2018 week 11): #9

potential future move to nodejs org

Not really super into moving stuff into an official repo atm but once some more status has been completed (see #2), we may want to so as to get some more traction, as forming a "team" may help there.

idk if that would mean actually moving the repo or just it's contents

Progress 15/1/2018 (2018 week 3)

The current status is:

  • Goals: almost ironed out, see #1
  • initial js-only fs source/sinks: working in this repo
  • js-only zlib transform: almost working, see aec09c2
  • buffering api: no work yet
  • C++ api: currently out of date, see nodejs/node#16414

I'm thinking of taking a swing at doing a C++ fs after the js-only zlib transform is working.

Proposal: Status Code Error Enums

For errors, any negative value should be considered an error, with multiple negative values permitted. That would obviously mean not using an enum for status, and instead using constants... e.g.

#define BOB_ERR_{whatever} -{some integer}
#define BOB_ERR_END 0
#define BOB_ERR_CONTINUE 1

Proposal: add `offset` to `pull()`

From the collab summit, after talking with @jasnell extensively about QUIC, I think it would be useful to have pull support an offset to ask the source to read from (the source may choose to still return whatever data it chooses).

This should allow for ACKs in a network implementation (i.e. if you need to ACK, you just request the last / desired offset again).

It also would support a level of content-addressability. A sink or downstream intermediate could be the one to inform a file source of where to read from.

Relatedly, this may also make disk-based sources require less state...

Progress 13/3/2018 (2018 week 11)

The current status is:

Next up:

  • C++ file source or sink endpoint
  • Evaluation of API, possible comparison / test implementation of a more 'functional' style JS API for discussion reference.

Previous status - 16/2/2018 (2018 week 7): #5

Blog about this?

I probably have enough material, could be good for visibility?

Progress 15/11/2018 - November

The current status is:

Completed sink / source / "duplex" npm modules:

Next up:

  • Address open questions: #17
  • Write a Streams3 adaptor: #18

Previous status - Progress 5/10/2018 - October: #13


Open Questions presented to N+JSI collab summit

From my collab summit presentation (#16), here are the open questions I presented at the end:

  • Is this API likeable?
    • It seemed to be well received at the collab summit
  • Do we need a buffer pooling helper?
    • (Probably.)
  • Is it too API-pattern focused?
  • How to enforce single active request?
  • Where to start implementing in core?
  • Libuv pull streams?
    • The libuv folks I catted with at N+JSI seemed open to it.

Non-single-logical-flow (multiple pulls)

Moving out from #23 (comment)

It seems that newer network protocols like QUIC desire multiple chunks of data to be in-flight at once (besides consider re-sending).

This probably violates these two core design ideas:

  • One-to-one: The protocol assumes a one-to-one relationship between producer and consumer.
  • In-line errors and EOF: Errors, data, and EOF ("end") should flow through the same call path.

It may also unleash zalgo? lol.

Anyways, I think it is possible to still keep things simple and "pretend" that things are multiplexed, by doing slightly more waiting at the network sink end. I'm not really sure that perf would be considerably impacted in most cases?

Edit: See #30 (comment) for updated thoughts.

Progress 23/07/2019 - July

Ok so, a lot of stuff has happened since the last update (15/11/2018 - November).

An unfortunately incomplete list of actions since then:

  • Re-presented at the Berlin June 2019 collab summit.
  • Introduced Stream() composition #31
  • start(cb) required for Sinks #32
  • Automated tests! #38
  • AssertionSink & AssertionSource
  • A Verification passthrough for API enforcement #34
  • Large progress on Streams3 Readable/Writable adaptors #35
  • This repo is now an npm module under the name bob-streams. #33
  • Made crc-transform as proof for an internal live coding demo
    • I don't think I can make the recording public but maybe I can do a public / extended version
  • Many other minor fixes
  • Additional prototyping of interaction with async iterators
  • Discussion about adding an offset to pull() #23
  • Discussion about multi-pull flow #30

Reconsider `bind{Source|Sink}()`

I'm really not convinced the binding apis are very good.

Might be better to have a helper function like streamline(a, b, c) - ideally so that you can do streamline(streamline(a, b, c), d) too.

Gona try to work on a PoC module this week..

What are the differences of this approach in comparison to pull-stream?

I'm excited to see this development as I am a heavy user of the pull-stream ecosystem for etl processing. This approach feels and reads extremely similar, but with obvious gains to be made by making it natively supported by node. Do these two efforts align (or differ) in any way? Is bob expected to support existing pull-stream patterns so as to benefit the variety of libraries already available on npm? Could it? πŸ˜„

For reference: https://github.com/pull-stream/pull-stream

Organize a meeting

I think it would be useful soon have a voice meeting to discuss various unresolved conversations.

Things that should be talked about:

  • Project Approach evaluation
  • Goals evaluation
  • API conventionality (how to enforce this?)
  • Naming e.g. #22, "sinks"
  • Multi-pull flow #30
  • C++ binding detection #29

ccing everyone who has showed interest so far:
@jasnell, @Raynos, @mcollina, @mafintosh, @benjamingr

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.