Document that stream API needs polyfill to be used as async iterable about parquet-wasm HOT 5 OPEN

kylebarron commented on July 2, 2024

Document that stream API needs polyfill to be used as async iterable

from parquet-wasm.

Comments (5)

H-Plus-Time commented on July 2, 2024

Ah, drat. I do wonder how useful (and hopefully unobtrusive) running some of the test suite through playwright would be. Happy to contribute if its desirable (potentially side benefits for the web examples).

from parquet-wasm.

kylebarron commented on July 2, 2024

Another thing I noticed was that making lots of different requests is quite annoying and you get latency between chunks.

One crazy idea I just had is whether it would be possible to mix the stream and async approaches... e.g. first do an end request for the metadata, but then do a full-file streaming request. In theory, you could make the streaming read work because you know the byte ranges of every chunk in the file. But that's probably totally incompatible with the existing parquet rust apis.

from parquet-wasm.

H-Plus-Time commented on July 2, 2024

Yeah, I noticed that too - if I'm reading parquet2's source correctly, it does $n_{\text{row groups}} \cdot m_{\text{flattened fields}}$ requests, (struct and fixed size list fields appear to get their own requests, and presumably so do the dictionary blocks). Toning that down to just one request per row group might be worthwhile. One thing - I noticed in that observable UScounties.parquet ends up being served over http1.1, that'd probably be worsening the problem (http/2 would still be subject to the inter-row group latency).

from parquet-wasm.

kylebarron commented on July 2, 2024

it does nrow groups⋅mflattened fields requests, (struct and fixed size list fields appear to get their own requests, and presumably so do the dictionary blocks).

Oh yikes! I didn't notice that before. Indeed, looking at that observable example, I see 58 total requests to the parquet file! I'm guessing the first two are for the metadata, and there are 7 row groups in the file, so that adds up to 8 requests per row group, where there are 6 columns plus the geometry column.

Presumably parquet2 was designed for a cloud environment where latency is assumed to be quite cheap.

Toning that down to just one request per row group might be worthwhile.

Yeah that seems ideal.

from parquet-wasm.

H-Plus-Time commented on July 2, 2024

Update on the http1.1 vs http2 point, and cloudflare's range requests:

So it looks like http/2 does help a fair bit (about 9+/-2s seconds vs 20+/-2 seconds), but it's still 3x slower than one big request (~2.2+/-.5 seconds on a 50Mb/s link, more or less full saturation).

Those requests are also serial at the column level (in addition to the row group-serialized behaviour we expect, given this is a pull stream). Calling readRowGroupAsync a bunch without awaiting (i.e. you get an array of Promise) gets you row-group parallel, column serial, which might improve things a bit (wrap that in an async generator and loop through your array of promises, awaiting each in sequence, and you have a rough approximation of a push-style stream).

I think given the intention is there to do column-selective reads, it would probably be worthwhile seeing if we could rework things to dispatch a row group's column requests concurrently or coalesce them into a multi-byte range request (requires a preflight for cors requests, but likely that'd just be one per file).

Cloudflare R2's range requests also appear to be inordinately slow in several ways:

large (say, 20% of the file) slices toward the end of the file frequently took as long or longer than the full request :-/.
~300ms server-initiated waits every 4th request.
Glacial transmission rates on very small slices (300ms for ~2kB - so around 6kB/s)

Unusual, given Cloudflare's usually stellar perf - I'm inclined to believe the situation hasn't changed since 2021 (see The Impossibility of Perfectly Caching HTTP Range Requests for an interesting write-up and comparison of the major CDN players' approaches), and they're rounding to the entire file.

Doing the same requests in parallel (all columns, all row groups) brings the total time down to the full-size request's (because each and every request hits a different cloudfront CDN worker), though the lack of ordering means the first large column is usually the last to finish (presumably something can be done with the fetch priority flag).

from parquet-wasm.

Document that stream API needs polyfill to be used as async iterable about parquet-wasm HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent