kylebarron / parquet-wasm Goto Github PK
View Code? Open in Web Editor NEWRust-based WebAssembly bindings to read and write Apache Parquet data
Home Page: https://kylebarron.dev/parquet-wasm/
License: Apache License 2.0
Rust-based WebAssembly bindings to read and write Apache Parquet data
Home Page: https://kylebarron.dev/parquet-wasm/
License: Apache License 2.0
can it not say bundler on the side bar?
When you make a method:
#[wasm_bindgen]
pub fn version(self) -> i32 {
self.0.version()
}
the self
consumes the instance, and you can't use it again. So if you try to call .version()
twice from JS you see:
/Users/kyle/github/rust/parquet-wasm/tmp/arrow1.js:403
if (this.ptr == 0) throw new Error('Attempt to use a moved value');
^
Error: Attempt to use a moved value
at FileMetaData.version (/Users/kyle/github/rust/parquet-wasm/tmp/arrow1.js:403:34)
at evalmachine.<anonymous>:1:10
at Script.runInThisContext (node:vm:129:12)
at Object.runInThisContext (node:vm:305:38)
at run ([eval]:1020:15)
at onRunRequest ([eval]:864:18)
at onMessage ([eval]:828:13)
at process.emit (node:events:520:28)
at emit (node:internal/child_process:938:14)
at processTicksAndRejections (node:internal/process/task_queues:84:21)
Instead you can take a reference (?) to self
:
#[wasm_bindgen]
pub fn version(&self) -> i32 {
self.0.version()
}
And now .version()
doesn't consume the instance.
You might also want to consider using a mutable self here, instead of making a new instance every time.
parquet-wasm/src/arrow1/writer_properties.rs
Lines 80 to 82 in 3cd4bac
For a while, there will probably be issues with the APIs, either in the wasm bindings, my bindings, or the underlying libraries. It will be necessary to debug these problems outside of the web environment, at least to the extent possible.
To that end I think it'll be very helpful to have a debug CLI, where essentially the exact same binding code is run, but locally instead of in wasm.
This means:
read_parquet
and write_parquet
. They should take as input and output rust slices and buffers, and return rust errors, not js errors. (Maybe read up on how method()?;
works, which would make the code a lot cleaner).main.rs
which would be a CLI input to these four functions.// lib.rs
#[cfg(feature = "arrow1")]
#[wasm_bindgen(js_name = readParquet1)]
pub fn read_parquet(parquet_file: &[u8]) -> Result<Uint8Array, JsValue> {
match crate::arrow1::read_parquet() {
// This function would return a rust vec that would be copied to a Uint8Array here
Ok(buffer) => buffer,
Err(error) => JsValue::from_str(format!("{}", error).as_str())
}
}
// main.rs
// CLI that wraps crate::arrow1::read_parquet and writes output to a local file
Should probably also mention this in the readme, that lz4 works only with files created with pyarrow/parquet cpp v7+
Doesn't currently work because writeParquet
requires a second argument
cargo run --example parquet_read --features io_parquet,io_parquet_compression -- 1-partition-lz4.parquet
Clippy is "A collection of lints to catch common mistakes and improve your Rust code.". Sounds perfect for me!
Currently issues with wasm-bindgen
until its next release. See rustwasm/wasm-bindgen#2774
brew install llvm
export PATH="/usr/local/opt/llvm/bin/:$PATH"
export CC=/usr/local/opt/llvm/bin/clang
export AR=/usr/local/opt/llvm/bin/llvm-ar
Caused problems for arrow1 in #19 . Might be good to remove the unsafe views until I understand the issue better and have a good test suite.
Might solve the bundling problems we've been facing in loaders.gl
from this issue: duckdb/duckdb-wasm#345 (comment)
Motivation: Parquet and Arrow are chunked formats. Therefore we shouldn't need to wait for the entire dataset to load/parse before getting some data back.
However I'm still not aware of a way to return an iterable or an async iterable from rust to js. To get around this, I think we can "drive" the iteration from JS. Essentially this:
import * as wasm from 'parquet-wasm';
const arr = new Uint8Array(); // Parquet bytes
// name readSchema to align with pyarrow api?
const parquetFile = new wasm.ParquetFile(arr);
const schemaIPC = parquetFile.schema();
for (let i = 0; i < parquetFile.numRowGroups; i++) {
const recordBatchIPC = parquetFile.readRowGroup(i);
}
And ideally we'll have an async version of this too
see https://docs.rs/arrow2/latest/arrow2/io/parquet/read/fn.read_columns_many.html#implementation
if you look at the implementation of read_columns
, it basically just fetches the byte range described in the column chunk metadata.
the implication of this is that if we expose the metadata to JS, then we can delegate all the data fetching to js pretty easily, including whether to merge range requests. then pass the downloaded buffers to to_deserializer
.
Use env var for "debug" and env var for CI? Then have yarn build:prod
and yarn build:dev
?
Currently takes 5m45s to build all six bundles.
Looks like apache/arrow-rs#1414 should be included in 11.0.0
.
I.e. each public binding should be able to change from
parquet-wasm/src/arrow1/wasm.rs
Lines 5 to 19 in bfd6943
pub fn read_parquet(parquet_file: &[u8]) -> Result<Uint8Array, JsValue> {
let buffer = crate::arrow1::reader::read_parquet(parquet_file)?;
Ok(crate::utils::copy_vec_to_uint8array(buffer))
}
Should be some way to make the ?
work in this context? You might need to add an impl
for converting from the ParquetError
to a js Error
?
I.e. structs have a .free
function in JS that you prob need to call when you're done with the memory.
Return an empty IPC table with only the schema but no rows?
Currently you just get an obscure:
string: 'Parquet error: Invalid Parquet file. Size is smaller than footer'
would enable fetching _metadata from JS and then pass that buffer to read the entire filemetadata
For a while, before switching to arrow2/parquet2
, (i.e. up until this commit) I was using the arrow
and parquet
crates from https://github.com/apache/arrow-rs. I repeatedly had an issue with some files, where the Parquet file would be readable in Rust, and then the generated Arrow IPC data wouldn't be readable in JS. This caused a ton of frustration, and switching to Arrow2/Parquet2 seemed to solve it, but I didn't know why.
With more debugging, (crucial was logging the vector in Rust right before returning and the Uint8Array
from JS), I realized that the data wasn't successfully being transferred back to JS correctly! E.g. when testing at this commit with the test file 1-partition-snappy.parquet
, the arrays on the JS and Rust sides had the same length, but changed data.
It appears the entire issue was the reliance on unsafe { Uint8Array::view(&file) }
. When I instead create a new Uint8Array
and copy the file
into the newly created Uint8Array
, the array in JS and in Rust matches, and the file is read successfully by Arrow JS.
From the wasm-bindgen
docs
Views into WebAssembly memory are only valid so long as the backing buffer isn’t resized in JS. Once this function is called any future calls to
Box::new
(ormalloc
of any form) may cause the returned value here to be invalidated. Use with caution!Additionally the returned object can be safely mutated but the input slice isn’t guaranteed to be mutable.
Finally, the returned object is disconnected from the input slice’s lifetime, so there’s no guarantee that the data is read at the right time.
To be honest, I'm not entirely sure where I was violating these principles (or maybe it was some internals from the arrow FileWriter
). So makes sense (at least for now) to remove the unsafe
code and create a new Uint8Array
buffer to solve this 🙂 .
Note that creating another Uint8Array
buffer would put more memory pressure on WebAssembly, which seems to run out of memory after using 1GB, but that's a problem for the future (ideally we'll be able to return a stream of record batches to JS).
Hi,
I was looking for a JS library for reading parquet files on the browser, and I found parquet-wasm
.
My current setup is the following:
import { readParquet } from "parquet-wasm";
const parquetFile = files[0]; // File picked from input tag
const fileData = new Blob([parquetFile]);
const promise = new Promise(getBuffer(fileData));
promise
.then(function (data) {
const arrowStream = readParquet(data);
})
.catch(function (err) {
console.log("Error: ", err);
});
function getBuffer(fileData) {
return function (resolve) {
const reader = new FileReader();
reader.readAsArrayBuffer(fileData);
reader.onload = function () {
const arrayBuffer = reader.result;
const bytes = new Uint8Array(arrayBuffer);
resolve(bytes);
};
};
}
Unfortunately I get the following error Error: TypeError: wasm.__wbindgen_add_to_stack_pointer is not a function
.
Do you have any suggestion? Thanks for the help.
Any wasm-bindgen function annotated with ///
before the function (seems to work when it's on the line before #[wasm_bindgen]
) becomes a jsdoc when built 🙀 😍 . So it will be nice to copy all documentation into the function docstrings.
Would be great to add some examples, maybe a benchmark page?
js_sys::ArrayBuffer
type? It might even do the arrraybuffer/uint8array conversion on the js bindings sideUse the same underlying read/write functions but add a CLI through main.rs
. This should be helpful when debugging why a file is crashing in the browser.
Looks like pyarrow v7 switched to the new lz4 compression enum value.
jorgecarleitao/parquet2#124 (comment)
Should probably also mention this in the readme, that lz4 works only with files created with pyarrow/parquet cpp v7+.
And should we change web
to esm
?
How much does code size improve for just a reader?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.