The parquet-wasm's discuss from kylebarron

Observable example

Inspired by: https://observablehq.com/@domoritz/apache-arrow-in-webassembly

update typedoc for default bundle

can it not say bundler on the side bar?

Use &self on methods on wasm bindgen structs

When you make a method:

    #[wasm_bindgen]
    pub fn version(self) -> i32 {
        self.0.version()
    }

the self consumes the instance, and you can't use it again. So if you try to call .version() twice from JS you see:

/Users/kyle/github/rust/parquet-wasm/tmp/arrow1.js:403
        if (this.ptr == 0) throw new Error('Attempt to use a moved value');
                           ^

Error: Attempt to use a moved value
    at FileMetaData.version (/Users/kyle/github/rust/parquet-wasm/tmp/arrow1.js:403:34)
    at evalmachine.<anonymous>:1:10
    at Script.runInThisContext (node:vm:129:12)
    at Object.runInThisContext (node:vm:305:38)
    at run ([eval]:1020:15)
    at onRunRequest ([eval]:864:18)
    at onMessage ([eval]:828:13)
    at process.emit (node:events:520:28)
    at emit (node:internal/child_process:938:14)
    at processTicksAndRejections (node:internal/process/task_queues:84:21)

Instead you can take a reference (?) to self:

    #[wasm_bindgen]
    pub fn version(&self) -> i32 {
        self.0.version()
    }

And now .version() doesn't consume the instance.

You might also want to consider using a mutable self here, instead of making a new instance every time.

parquet-wasm/src/arrow1/writer_properties.rs

Lines 80 to 82 in 3cd4bac

    
           pub fn set_writer_version(self, value: WriterVersion) -> Self { 
        
               Self(self.0.set_writer_version(value.to_arrow1())) 
        
           }

Add dependabot

https://github.com/domoritz/arrow-wasm/blob/main/.github/dependabot.yml

Explore wasm-bindgen-futures

https://rustwasm.github.io/wasm-bindgen/api/wasm_bindgen_futures/

Declare minimum supported rust version in cargo.toml

Example with parquet dataset

https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files

Split functions into non-wasm-bindgen helpers

For a while, there will probably be issues with the APIs, either in the wasm bindings, my bindings, or the underlying libraries. It will be necessary to debug these problems outside of the web environment, at least to the extent possible.

To that end I think it'll be very helpful to have a debug CLI, where essentially the exact same binding code is run, but locally instead of in wasm.

This means:

Decoupling any JS specific code out of read_parquet and write_parquet. They should take as input and output rust slices and buffers, and return rust errors, not js errors. (Maybe read up on how method()?; works, which would make the code a lot cleaner).
Creating an optional feature with main.rs which would be a CLI input to these four functions.

// lib.rs

#[cfg(feature = "arrow1")]
#[wasm_bindgen(js_name = readParquet1)]
pub fn read_parquet(parquet_file: &[u8]) -> Result<Uint8Array, JsValue> {
  match crate::arrow1::read_parquet() {
    // This function would return a rust vec that would be copied to a Uint8Array here
    Ok(buffer) => buffer,
    Err(error) => JsValue::from_str(format!("{}", error).as_str())
  }
}

// main.rs
// CLI that wraps crate::arrow1::read_parquet and writes output to a local file

Update lz4 docs once supported by arrow2

Should probably also mention this in the readme, that lz4 works only with files created with pyarrow/parquet cpp v7+

Update example in Readme

Doesn't currently work because writeParquet requires a second argument

Document debug cli

cargo run --example parquet_read --features io_parquet,io_parquet_compression -- 1-partition-lz4.parquet

Add rust clippy

Clippy is "A collection of lints to catch common mistakes and improve your Rust code.". Sounds perfect for me!

https://github.com/jorgecarleitao/arrow2/blob/81bfaddb92f432ae25bff4e9fdf200159ecebafe/.github/workflows/test.yml#L46-L57

Currently issues with wasm-bindgen until its next release. See rustwasm/wasm-bindgen#2774

Link to observable example(s)

https://observablehq.com/@kylebarron/geoparquet-on-the-web

Add dev note about building zstd on mac

brew install llvm
export PATH="/usr/local/opt/llvm/bin/:$PATH"
export CC=/usr/local/opt/llvm/bin/clang
export AR=/usr/local/opt/llvm/bin/llvm-ar

ESM Example without a bundler

https://rustwasm.github.io/docs/wasm-bindgen/examples/without-a-bundler.html

Remove unsafe Uint8Array::view for arrow2?

Caused problems for arrow1 in #19 . Might be good to remove the unsafe views until I understand the issue better and have a good test suite.

Explore separate exports in package.json for browser and node

https://github.com/duckdb/duckdb-wasm/blob/51329add89adf9b3b9cd2e4d4b24dbae2515afdf/packages/duckdb-wasm/package.json#L109-L119

Might solve the bundling problems we've been facing in loaders.gl

from this issue: duckdb/duckdb-wasm#345 (comment)

CI to publish docs automatically on new release

Return iterator of arrow record batches to JS

Motivation: Parquet and Arrow are chunked formats. Therefore we shouldn't need to wait for the entire dataset to load/parse before getting some data back.

However I'm still not aware of a way to return an iterable or an async iterable from rust to js. To get around this, I think we can "drive" the iteration from JS. Essentially this:

import * as wasm from 'parquet-wasm';

const arr = new Uint8Array(); // Parquet bytes
// name readSchema to align with pyarrow api?
const parquetFile = new wasm.ParquetFile(arr);
const schemaIPC = parquetFile.schema();
for (let i = 0; i < parquetFile.numRowGroups; i++) {
  const recordBatchIPC = parquetFile.readRowGroup(i);
}

And ideally we'll have an async version of this too

delegate async reading to js

see https://docs.rs/arrow2/latest/arrow2/io/parquet/read/fn.read_columns_many.html#implementation

if you look at the implementation of read_columns, it basically just fetches the byte range described in the column chunk metadata.

the implication of this is that if we expose the metadata to JS, then we can delegate all the data fetching to js pretty easily, including whether to merge range requests. then pass the downloaded buffers to to_deserializer.

Consolidate into a single build script with env vars

Use env var for "debug" and env var for CI? Then have yarn build:prod and yarn build:dev?

Only build node bundle on Node CI tests

Currently takes 5m45s to build all six bundles.

Bump arrow-rs for zstd support

Looks like apache/arrow-rs#1414 should be included in 11.0.0.

Explore async

References:

Improve reader options

Explore documentation on shrinking wasm code size

https://rustwasm.github.io/docs/book/game-of-life/code-size.html
https://rustwasm.github.io/docs/book/reference/code-size.html#optimizing-builds-for-code-size
https://pretired.dazwilkin.com/posts/200819/

Use opt-level = "z" instead of s?

Arrow2 writing configuration

Helper to copy `Vec<u8>` to `Uint8Array`

I.e. each public binding should be able to change from

parquet-wasm/src/arrow1/wasm.rs

Lines 5 to 19 in bfd6943

    
           pub fn read_parquet(parquet_file: &[u8]) -> Result<Uint8Array, JsValue> { 
        
               let buffer = match crate::arrow1::reader::read_parquet(parquet_file) { 
        
                   // This function would return a rust vec that would be copied to a Uint8Array here 
        
                   Ok(buffer) => buffer, 
        
                   Err(error) => return Err(JsValue::from_str(format!("{}", error).as_str())), 
        
               }; 
        
               let return_len = match (buffer.len() as usize).try_into() { 
        
                   Ok(return_len) => return_len, 
        
                   Err(error) => return Err(JsValue::from_str(format!("{}", error).as_str())), 
        
               }; 
        
               let return_vec = Uint8Array::new_with_length(return_len); 
        
               return_vec.copy_from(&buffer); 
        
               return Ok(return_vec); 
        
           }

to

pub fn read_parquet(parquet_file: &[u8]) -> Result<Uint8Array, JsValue> {
    let buffer = crate::arrow1::reader::read_parquet(parquet_file)?;
    Ok(crate::utils::copy_vec_to_uint8array(buffer))
}

Should be some way to make the ? work in this context? You might need to add an impl for converting from the ParquetError to a js Error?

Docs about calling `.free` on wasm-bindgen objects

I.e. structs have a .free function in JS that you prob need to call when you're done with the memory.

Function to just get Parquet schema

Return an empty IPC table with only the schema but no rows?

Better error when passing in an arrayBuffer instead of a Uint8Array

Currently you just get an obscure:

string: 'Parquet error: Invalid Parquet file. Size is smaller than footer'

Update cargo features syntax?

https://blog.rust-lang.org/2022/04/07/Rust-1.60.0.html#new-syntax-for-cargo-features

Make read_parquet API from buffer

would enable fetching _metadata from JS and then pass that buffer to read the entire filemetadata

Arrow-rs debugging (Error: Expected to read 2166784 metadata bytes, but only read 486.) [Solved]

For a while, before switching to arrow2/parquet2, (i.e. up until this commit) I was using the arrow and parquet crates from https://github.com/apache/arrow-rs. I repeatedly had an issue with some files, where the Parquet file would be readable in Rust, and then the generated Arrow IPC data wouldn't be readable in JS. This caused a ton of frustration, and switching to Arrow2/Parquet2 seemed to solve it, but I didn't know why.

With more debugging, (crucial was logging the vector in Rust right before returning and the Uint8Array from JS), I realized that the data wasn't successfully being transferred back to JS correctly! E.g. when testing at this commit with the test file 1-partition-snappy.parquet, the arrays on the JS and Rust sides had the same length, but changed data.

It appears the entire issue was the reliance on unsafe { Uint8Array::view(&file) }. When I instead create a new Uint8Array and copy the file into the newly created Uint8Array, the array in JS and in Rust matches, and the file is read successfully by Arrow JS.

From the wasm-bindgen docs

Views into WebAssembly memory are only valid so long as the backing buffer isn’t resized in JS. Once this function is called any future calls to Box::new (or malloc of any form) may cause the returned value here to be invalidated. Use with caution!

Additionally the returned object can be safely mutated but the input slice isn’t guaranteed to be mutable.

Finally, the returned object is disconnected from the input slice’s lifetime, so there’s no guarantee that the data is read at the right time.

To be honest, I'm not entirely sure where I was violating these principles (or maybe it was some internals from the arrow FileWriter). So makes sense (at least for now) to remove the unsafe code and create a new Uint8Array buffer to solve this 🙂 .

Note that creating another Uint8Array buffer would put more memory pressure on WebAssembly, which seems to run out of memory after using 1GB, but that's a problem for the future (ideally we'll be able to return a stream of record batches to JS).

TypeError: wasm.__wbindgen_add_to_stack_pointer is not a function

Hi,

I was looking for a JS library for reading parquet files on the browser, and I found parquet-wasm.

My current setup is the following:

import { readParquet } from "parquet-wasm";

const parquetFile = files[0]; // File picked from input tag
const fileData = new Blob([parquetFile]);
const promise = new Promise(getBuffer(fileData));

promise
	.then(function (data) {
		const arrowStream = readParquet(data);
	})
	.catch(function (err) {
		console.log("Error: ", err);
	});

function getBuffer(fileData) {
	return function (resolve) {
		const reader = new FileReader();
		reader.readAsArrayBuffer(fileData);
		reader.onload = function () {
			const arrayBuffer = reader.result;
			const bytes = new Uint8Array(arrayBuffer);
			resolve(bytes);
		};
	};
}

Unfortunately I get the following error Error: TypeError: wasm.__wbindgen_add_to_stack_pointer is not a function.

Do you have any suggestion? Thanks for the help.

Explore reference types

Refs:

add vscode settings file to project to set rust-analyzer to wasm32-unknown-unknown

Improve passing buffer to wasm

Refs:

Docstrings for exported functions

Any wasm-bindgen function annotated with /// before the function (seems to work when it's on the line before #[wasm_bindgen]) becomes a jsdoc when built 🙀 😍 . So it will be nice to copy all documentation into the function docstrings.

Benchmarks

How to add arbitrary pages to typedoc?

Would be great to add some examples, maybe a benchmark page?

Patch generated package.json to add `type = "module"`?

rustwasm/wasm-pack#1039

Switch to ArrayBuffer

Can you try using a js_sys::ArrayBuffer type? It might even do the arrraybuffer/uint8array conversion on the js bindings side
Can you use js snippets? wasm-bindgen doesn't natively support an ArrayBuffer input/output rustwasm/wasm-bindgen#1961. Can you use js_snippets to cast to/from arraybuffer? https://rustwasm.github.io/wasm-bindgen/reference/js-snippets.html

Should probably also mention this in the readme, that lz4 works only with files created with pyarrow/parquet cpp v7+.

Change name of bundler output to `bundler`?

And should we change web to esm?

Separate features for reading and writing

How much does code size improve for just a reader?

	pub fn set_writer_version(self, value: WriterVersion) -> Self {
	Self(self.0.set_writer_version(value.to_arrow1()))
	}

	pub fn read_parquet(parquet_file: &[u8]) -> Result<Uint8Array, JsValue> {
	let buffer = match crate::arrow1::reader::read_parquet(parquet_file) {
	// This function would return a rust vec that would be copied to a Uint8Array here
	Ok(buffer) => buffer,
	Err(error) => return Err(JsValue::from_str(format!("{}", error).as_str())),
	};

	let return_len = match (buffer.len() as usize).try_into() {
	Ok(return_len) => return_len,
	Err(error) => return Err(JsValue::from_str(format!("{}", error).as_str())),
	};
	let return_vec = Uint8Array::new_with_length(return_len);
	return_vec.copy_from(&buffer);
	return Ok(return_vec);
	}

kylebarron / parquet-wasm Goto Github PK

parquet-wasm's Issues

Recommend Projects

Recommend Topics

Recommend Org