Comments (5)
Ah, drat. I do wonder how useful (and hopefully unobtrusive) running some of the test suite through playwright would be. Happy to contribute if its desirable (potentially side benefits for the web examples).
from parquet-wasm.
Another thing I noticed was that making lots of different requests is quite annoying and you get latency between chunks.
One crazy idea I just had is whether it would be possible to mix the stream and async approaches... e.g. first do an end request for the metadata, but then do a full-file streaming request. In theory, you could make the streaming read work because you know the byte ranges of every chunk in the file. But that's probably totally incompatible with the existing parquet rust apis.
from parquet-wasm.
Yeah, I noticed that too - if I'm reading parquet2's source correctly, it does
from parquet-wasm.
it does nrow groups⋅mflattened fields requests, (struct and fixed size list fields appear to get their own requests, and presumably so do the dictionary blocks).
Oh yikes! I didn't notice that before. Indeed, looking at that observable example, I see 58 total requests to the parquet file! I'm guessing the first two are for the metadata, and there are 7 row groups in the file, so that adds up to 8 requests per row group, where there are 6 columns plus the geometry column.
Presumably parquet2 was designed for a cloud environment where latency is assumed to be quite cheap.
Toning that down to just one request per row group might be worthwhile.
Yeah that seems ideal.
from parquet-wasm.
Update on the http1.1 vs http2 point, and cloudflare's range requests:
So it looks like http/2 does help a fair bit (about 9+/-2s seconds vs 20+/-2 seconds), but it's still 3x slower than one big request (~2.2+/-.5 seconds on a 50Mb/s link, more or less full saturation).
Those requests are also serial at the column level (in addition to the row group-serialized behaviour we expect, given this is a pull stream). Calling readRowGroupAsync a bunch without awaiting (i.e. you get an array of Promise) gets you row-group parallel, column serial, which might improve things a bit (wrap that in an async generator and loop through your array of promises, awaiting each in sequence, and you have a rough approximation of a push-style stream).
I think given the intention is there to do column-selective reads, it would probably be worthwhile seeing if we could rework things to dispatch a row group's column requests concurrently or coalesce them into a multi-byte range request (requires a preflight for cors requests, but likely that'd just be one per file).
Cloudflare R2's range requests also appear to be inordinately slow in several ways:
- large (say, 20% of the file) slices toward the end of the file frequently took as long or longer than the full request :-/.
- ~300ms server-initiated waits every 4th request.
- Glacial transmission rates on very small slices (300ms for ~2kB - so around 6kB/s)
Unusual, given Cloudflare's usually stellar perf - I'm inclined to believe the situation hasn't changed since 2021 (see The Impossibility of Perfectly Caching HTTP Range Requests for an interesting write-up and comparison of the major CDN players' approaches), and they're rounding to the entire file.
Doing the same requests in parallel (all columns, all row groups) brings the total time down to the full-size request's (because each and every request hits a different cloudfront CDN worker), though the lack of ordering means the first large column is usually the last to finish (presumably something can be done with the fetch priority flag).
from parquet-wasm.
Related Issues (20)
- runtime error while reading large parquet HOT 6
- HEAD-ache requests HOT 2
- Do you have a data processing flowchart for this set? HOT 1
- Update `readParquetFFI` docs to drop the table
- Deprecate arrow2/parquet2
- Try using `ParquetObjectReader` for arrow1 async api
- Module '"parquet-wasm/bundler/arrow2"' has no exported member '__wasm' HOT 3
- Group dependabot updates
- Add publish from CI
- Update README documentation HOT 2
- Request batching HOT 21
- bundler version doesn't work in production since 0.4.0-beta.5 HOT 3
- Unable to get 0.6.0-beta.1 to work in Node HOT 7
- Writing a Date column drops associated time information HOT 1
- Changelog notes: HOT 5
- explore ehttp for request fetching
- import with typescript HOT 3
- Example doesnt work HOT 18
- Cannot read properties of undefined (reading 'fromIPCStream') HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from parquet-wasm.