Coder Social home page Coder Social logo

Comments (5)

H-Plus-Time avatar H-Plus-Time commented on June 16, 2024

Ah, drat. I do wonder how useful (and hopefully unobtrusive) running some of the test suite through playwright would be. Happy to contribute if its desirable (potentially side benefits for the web examples).

from parquet-wasm.

kylebarron avatar kylebarron commented on June 16, 2024

Another thing I noticed was that making lots of different requests is quite annoying and you get latency between chunks.

One crazy idea I just had is whether it would be possible to mix the stream and async approaches... e.g. first do an end request for the metadata, but then do a full-file streaming request. In theory, you could make the streaming read work because you know the byte ranges of every chunk in the file. But that's probably totally incompatible with the existing parquet rust apis.

from parquet-wasm.

H-Plus-Time avatar H-Plus-Time commented on June 16, 2024

Yeah, I noticed that too - if I'm reading parquet2's source correctly, it does $n_{\text{row groups}} \cdot m_{\text{flattened fields}}$ requests, (struct and fixed size list fields appear to get their own requests, and presumably so do the dictionary blocks). Toning that down to just one request per row group might be worthwhile. One thing - I noticed in that observable UScounties.parquet ends up being served over http1.1, that'd probably be worsening the problem (http/2 would still be subject to the inter-row group latency).

from parquet-wasm.

kylebarron avatar kylebarron commented on June 16, 2024

it does nrow groups⋅mflattened fields requests, (struct and fixed size list fields appear to get their own requests, and presumably so do the dictionary blocks).

Oh yikes! I didn't notice that before. Indeed, looking at that observable example, I see 58 total requests to the parquet file! I'm guessing the first two are for the metadata, and there are 7 row groups in the file, so that adds up to 8 requests per row group, where there are 6 columns plus the geometry column.

Presumably parquet2 was designed for a cloud environment where latency is assumed to be quite cheap.

Toning that down to just one request per row group might be worthwhile.

Yeah that seems ideal.

from parquet-wasm.

H-Plus-Time avatar H-Plus-Time commented on June 16, 2024

Update on the http1.1 vs http2 point, and cloudflare's range requests:

So it looks like http/2 does help a fair bit (about 9+/-2s seconds vs 20+/-2 seconds), but it's still 3x slower than one big request (~2.2+/-.5 seconds on a 50Mb/s link, more or less full saturation).

Those requests are also serial at the column level (in addition to the row group-serialized behaviour we expect, given this is a pull stream). Calling readRowGroupAsync a bunch without awaiting (i.e. you get an array of Promise) gets you row-group parallel, column serial, which might improve things a bit (wrap that in an async generator and loop through your array of promises, awaiting each in sequence, and you have a rough approximation of a push-style stream).

I think given the intention is there to do column-selective reads, it would probably be worthwhile seeing if we could rework things to dispatch a row group's column requests concurrently or coalesce them into a multi-byte range request (requires a preflight for cors requests, but likely that'd just be one per file).

Cloudflare R2's range requests also appear to be inordinately slow in several ways:

  1. large (say, 20% of the file) slices toward the end of the file frequently took as long or longer than the full request :-/.
  2. ~300ms server-initiated waits every 4th request.
  3. Glacial transmission rates on very small slices (300ms for ~2kB - so around 6kB/s)

Unusual, given Cloudflare's usually stellar perf - I'm inclined to believe the situation hasn't changed since 2021 (see The Impossibility of Perfectly Caching HTTP Range Requests for an interesting write-up and comparison of the major CDN players' approaches), and they're rounding to the entire file.

Doing the same requests in parallel (all columns, all row groups) brings the total time down to the full-size request's (because each and every request hits a different cloudfront CDN worker), though the lack of ordering means the first large column is usually the last to finish (presumably something can be done with the fetch priority flag).

from parquet-wasm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.