Coder Social home page Coder Social logo

kylebarron / parquet-wasm Goto Github PK

View Code? Open in Web Editor NEW
469.0 7.0 19.0 2.2 MB

Rust-based WebAssembly bindings to read and write Apache Parquet data

Home Page: https://kylebarron.dev/parquet-wasm/

License: Apache License 2.0

Rust 70.38% JavaScript 6.88% HTML 0.31% Python 3.81% Shell 2.33% TypeScript 16.28%
webassembly wasm rust parquet javascript arrow apache-arrow apache-parquet

parquet-wasm's Introduction

WASM Parquet npm version

WebAssembly bindings to read and write the Apache Parquet format to and from Apache Arrow using the Rust parquet and arrow crates.

This is designed to be used alongside a JavaScript Arrow implementation, such as the canonical JS Arrow library.

Including read and write support and all compression codecs, the brotli-compressed WASM bundle is 1.2 MB. Refer to custom builds for how to build a smaller bundle. A minimal read-only bundle without compression support can be as small as 456 KB brotli-compressed.

Install

parquet-wasm is published to NPM. Install with

yarn add parquet-wasm

or

npm install parquet-wasm

API

Parquet-wasm has both a synchronous and asynchronous API. The sync API is simpler but requires fetching the entire Parquet buffer in advance, which is often prohibitive.

Sync API

Refer to these functions:

  • readParquet: Read a Parquet file synchronously.
  • readSchema: Read an Arrow schema from a Parquet file synchronously.
  • writeParquet: Write a Parquet file synchronously.

Async API

  • readParquetStream: Create a ReadableStream that emits Arrow RecordBatches from a Parquet file.
  • ParquetFile: A class for reading portions of a remote Parquet file. Use fromUrl to construct from a remote URL or fromFile to construct from a File handle. Note that when you're done using this class, you'll need to call free to release any memory held by the ParquetFile instance itself.

Both sync and async functions return or accept a Table class, an Arrow table in WebAssembly memory. Refer to its documentation for moving data into/out of WebAssembly.

Entry Points

Entry point Description Documentation
parquet-wasm, parquet-wasm/esm, or parquet-wasm/esm/parquet_wasm.js ESM, to be used directly from the Web as an ES Module Link
parquet-wasm/bundler "Bundler" build, to be used in bundlers such as Webpack Link
parquet-wasm/node Node build, to be used with synchronous require in NodeJS Link

ESM

The esm entry point is the primary entry point. It is the default export from parquet-wasm, and is also accessible at parquet-wasm/esm and parquet-wasm/esm/parquet_wasm.js (for symmetric imports directly from a browser).

Note that when using the esm bundles, you must manually initialize the WebAssembly module before using any APIs. Otherwise, you'll get an error TypeError: Cannot read properties of undefined. There are multiple ways to initialize the WebAssembly code:

Asynchronous initialization

The primary way to initialize is by awaiting the default export.

import wasmInit, {readParquet} from "parquet-wasm";

await wasmInit();

Without any parameter, this will try to fetch a file named 'parquet_wasm_bg.wasm' at the same location as parquet-wasm. (E.g. this snippet input = new URL('parquet_wasm_bg.wasm', import.meta.url);).

Note that you can also pass in a custom URL if you want to host the .wasm file on your own servers.

import wasmInit, {readParquet} from "parquet-wasm";

// Update this version to match the version you're using.
const wasmUrl = "https://cdn.jsdelivr.net/npm/[email protected]/esm/parquet_wasm_bg.wasm";
await wasmInit(wasmUrl);

Synchronous initialization

The initSync named export allows for

import {initSync, readParquet} from "parquet-wasm";

// The contents of esm/parquet_wasm_bg.wasm in an ArrayBuffer
const wasmBuffer = new ArrayBuffer(...);

// Initialize the Wasm synchronously
initSync(wasmBuffer)

Async initialization should be preferred over downloading the Wasm buffer and then initializing it synchronously, as WebAssembly.instantiateStreaming is the most efficient way to both download and initialize Wasm code.

Bundler

The bundler entry point doesn't require manual initialization of the WebAssembly blob, but needs setup with whatever bundler you're using. Refer to the Rust Wasm documentation for more info.

Node

The node entry point can be loaded synchronously from Node.

const {readParquet} = require("parquet-wasm");

const wasmTable = readParquet(...);

Using directly from a browser

You can load the esm/parquet_wasm.js file directly from a CDN

const parquet = await import(
  "https://cdn.jsdelivr.net/npm/[email protected]/esm/+esm"
)
await parquet.default();

const wasmTable = parquet.readParquet(...);

This specific endpoint will minify the ESM before you receive it.

Debug functions

These functions are not present in normal builds to cut down on bundle size. To create a custom build, see Custom Builds below.

setPanicHook

setPanicHook(): void

Sets console_error_panic_hook in Rust, which provides better debugging of panics by having more informative console.error messages. Initialize this first if you're getting errors such as RuntimeError: Unreachable executed.

The WASM bundle must be compiled with the console_error_panic_hook feature for this function to exist.

Example

import * as arrow from "apache-arrow";
import initWasm, {
  Compression,
  readParquet,
  Table,
  writeParquet,
  WriterPropertiesBuilder,
} from "parquet-wasm";

// Instantiate the WebAssembly context
await initWasm();

// Create Arrow Table in JS
const LENGTH = 2000;
const rainAmounts = Float32Array.from({ length: LENGTH }, () =>
  Number((Math.random() * 20).toFixed(1))
);

const rainDates = Array.from(
  { length: LENGTH },
  (_, i) => new Date(Date.now() - 1000 * 60 * 60 * 24 * i)
);

const rainfall = arrow.tableFromArrays({
  precipitation: rainAmounts,
  date: rainDates,
});

// Write Arrow Table to Parquet

// wasmTable is an Arrow table in WebAssembly memory
const wasmTable = Table.fromIPCStream(arrow.tableToIPC(rainfall, "stream"));
const writerProperties = new WriterPropertiesBuilder()
  .setCompression(Compression.ZSTD)
  .build();
const parquetUint8Array = writeParquet(wasmTable, writerProperties);

// Read Parquet buffer back to Arrow Table
// arrowWasmTable is an Arrow table in WebAssembly memory
const arrowWasmTable = readParquet(parquetUint8Array);

// table is now an Arrow table in JS memory
const table = arrow.tableFromIPC(arrowWasmTable.intoIPCStream());
console.log(table.schema.toString());
// Schema<{ 0: precipitation: Float32, 1: date: Date64<MILLISECOND> }>

Published examples

(These may use older versions of the library with a different API).

Performance considerations

Tl;dr: When you have a Table object (resulting from readParquet), try the new Table.intoFFI API to move it to JavaScript memory. This API is less well tested than the Table.intoIPCStream API, but should be faster and have much less memory overhead (by a factor of 2). If you hit any bugs, please create a reproducible issue.

Under the hood, parquet-wasm first decodes a Parquet file into Arrow in WebAssembly memory. But then that WebAssembly memory needs to be copied into JavaScript for use by Arrow JS. The "normal" conversion APIs (e.g. Table.intoIPCStream) use the Arrow IPC format to get the data back to JavaScript. But this requires another memory copy inside WebAssembly to assemble the various arrays into a single buffer to be copied back to JS.

Instead, the new Table.intoFFI API uses Arrow's C Data Interface to be able to copy or view Arrow arrays from within WebAssembly memory without any serialization.

Note that this approach uses the arrow-js-ffi library to parse the Arrow C Data Interface definitions. This library has not yet been tested in production, so it may have bugs!

I wrote an interactive blog post on this approach and the Arrow C Data Interface if you want to read more!

Example

import * as arrow from "apache-arrow";
import { parseTable } from "arrow-js-ffi";
import initWasm, { wasmMemory, readParquet } from "parquet-wasm";

// Instantiate the WebAssembly context
await initWasm();

// A reference to the WebAssembly memory object.
const WASM_MEMORY = wasmMemory();

const resp = await fetch("https://example.com/file.parquet");
const parquetUint8Array = new Uint8Array(await resp.arrayBuffer());
const wasmArrowTable = readParquet(parquetUint8Array).intoFFI();

// Arrow JS table that was directly copied from Wasm memory
const table: arrow.Table = parseTable(
  WASM_MEMORY.buffer,
  wasmArrowTable.arrayAddrs(),
  wasmArrowTable.schemaAddr()
);

// VERY IMPORTANT! You must call `drop` on the Wasm table object when you're done using it
// to release the Wasm memory.
// Note that any access to the pointers in this table is undefined behavior after this call.
// Calling any `wasmArrowTable` method will error.
wasmArrowTable.drop();

Compression support

The Parquet specification permits several compression codecs. This library currently supports:

  • Uncompressed
  • Snappy
  • Gzip
  • Brotli
  • ZSTD
  • LZ4_RAW
  • LZ4 (deprecated)

LZ4 support in Parquet is a bit messy. As described here, there are two LZ4 compression options in Parquet (as of version 2.9.0). The original version LZ4 is now deprecated; it used an undocumented framing scheme which made interoperability difficult. The specification now reads:

It is strongly suggested that implementors of Parquet writers deprecate this compression codec in their user-facing APIs, and advise users to switch to the newer, interoperable LZ4_RAW codec.

It's currently unknown how widespread the ecosystem support is for LZ4_RAW. As of pyarrow v7, it now writes LZ4_RAW by default and presumably has read support for it as well.

Custom builds

In some cases, you may know ahead of time that your Parquet files will only include a single compression codec, say Snappy, or even no compression at all. In these cases, you may want to create a custom build of parquet-wasm to keep bundle size at a minimum. If you install the Rust toolchain and wasm-pack (see Development), you can create a custom build with only the compression codecs you require.

The minimum supported Rust version in this project is 1.60. To upgrade your toolchain, use rustup update stable.

Example custom builds

Reader-only bundle with Snappy compression:

wasm-pack build --no-default-features --features snappy --features reader

Writer-only bundle with no compression support, targeting Node:

wasm-pack build --target nodejs --no-default-features --features writer

Bundle with reader and writer support, targeting Node, using arrow and parquet crates with all their supported compressions, with console_error_panic_hook enabled:

wasm-pack build \
  --target nodejs \
  --no-default-features \
  --features reader \
  --features writer \
  --features all_compressions \
  --features debug
# Or, given the fact that the default feature includes several of these features, a shorter version:
wasm-pack build --target nodejs --features debug

Refer to the wasm-pack documentation for more info on flags such as --release, --dev, target, and to the Cargo documentation for more info on how to use features.

Available features

By default, all_compressions, reader, writer, and async features are enabled. Use --no-default-features to remove these defaults.

  • reader: Activate read support.
  • writer: Activate write support.
  • async: Activate asynchronous read support.
  • all_compressions: Activate all supported compressions.
  • brotli: Activate Brotli compression.
  • gzip: Activate Gzip compression.
  • snappy: Activate Snappy compression.
  • zstd: Activate ZSTD compression.
  • lz4: Activate LZ4_RAW compression.
  • debug: Expose the setPanicHook function for better error messages for Rust panics.

Node <20

On Node versions before 20, you'll have to polyfill the Web Cryptography API.

Future work

  • Example of pushdown predicate filtering, to download only chunks that match a specific condition
  • Column filtering, to download only certain columns
  • More tests

Acknowledgements

A starting point of my work came from @my-liminal-space's read-parquet-browser (which is also dual licensed MIT and Apache 2).

@domoritz's arrow-wasm was a very helpful reference for bootstrapping Rust-WASM bindings.

parquet-wasm's People

Contributors

andyredhead avatar dependabot[bot] avatar fspoettel avatar h-plus-time avatar jdoig avatar kylebarron avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

parquet-wasm's Issues

Helper to copy `Vec<u8>` to `Uint8Array`

I.e. each public binding should be able to change from

pub fn read_parquet(parquet_file: &[u8]) -> Result<Uint8Array, JsValue> {
let buffer = match crate::arrow1::reader::read_parquet(parquet_file) {
// This function would return a rust vec that would be copied to a Uint8Array here
Ok(buffer) => buffer,
Err(error) => return Err(JsValue::from_str(format!("{}", error).as_str())),
};
let return_len = match (buffer.len() as usize).try_into() {
Ok(return_len) => return_len,
Err(error) => return Err(JsValue::from_str(format!("{}", error).as_str())),
};
let return_vec = Uint8Array::new_with_length(return_len);
return_vec.copy_from(&buffer);
return Ok(return_vec);
}

to

pub fn read_parquet(parquet_file: &[u8]) -> Result<Uint8Array, JsValue> {
    let buffer = crate::arrow1::reader::read_parquet(parquet_file)?;
    Ok(crate::utils::copy_vec_to_uint8array(buffer))
}

Should be some way to make the ? work in this context? You might need to add an impl for converting from the ParquetError to a js Error?

create debug cli

Use the same underlying read/write functions but add a CLI through main.rs. This should be helpful when debugging why a file is crashing in the browser.

Arrow-rs debugging (Error: Expected to read 2166784 metadata bytes, but only read 486.) [Solved]

For a while, before switching to arrow2/parquet2, (i.e. up until this commit) I was using the arrow and parquet crates from https://github.com/apache/arrow-rs. I repeatedly had an issue with some files, where the Parquet file would be readable in Rust, and then the generated Arrow IPC data wouldn't be readable in JS. This caused a ton of frustration, and switching to Arrow2/Parquet2 seemed to solve it, but I didn't know why.

With more debugging, (crucial was logging the vector in Rust right before returning and the Uint8Array from JS), I realized that the data wasn't successfully being transferred back to JS correctly! E.g. when testing at this commit with the test file 1-partition-snappy.parquet, the arrays on the JS and Rust sides had the same length, but changed data.

It appears the entire issue was the reliance on unsafe { Uint8Array::view(&file) }. When I instead create a new Uint8Array and copy the file into the newly created Uint8Array, the array in JS and in Rust matches, and the file is read successfully by Arrow JS.

From the wasm-bindgen docs

Views into WebAssembly memory are only valid so long as the backing buffer isn’t resized in JS. Once this function is called any future calls to Box::new (or malloc of any form) may cause the returned value here to be invalidated. Use with caution!

Additionally the returned object can be safely mutated but the input slice isn’t guaranteed to be mutable.

Finally, the returned object is disconnected from the input slice’s lifetime, so there’s no guarantee that the data is read at the right time.

To be honest, I'm not entirely sure where I was violating these principles (or maybe it was some internals from the arrow FileWriter). So makes sense (at least for now) to remove the unsafe code and create a new Uint8Array buffer to solve this 🙂 .

Note that creating another Uint8Array buffer would put more memory pressure on WebAssembly, which seems to run out of memory after using 1GB, but that's a problem for the future (ideally we'll be able to return a stream of record batches to JS).

TypeError: wasm.__wbindgen_add_to_stack_pointer is not a function

Hi,

I was looking for a JS library for reading parquet files on the browser, and I found parquet-wasm.

My current setup is the following:

import { readParquet } from "parquet-wasm";

const parquetFile = files[0]; // File picked from input tag
const fileData = new Blob([parquetFile]);
const promise = new Promise(getBuffer(fileData));

promise
	.then(function (data) {
		const arrowStream = readParquet(data);
	})
	.catch(function (err) {
		console.log("Error: ", err);
	});

function getBuffer(fileData) {
	return function (resolve) {
		const reader = new FileReader();
		reader.readAsArrayBuffer(fileData);
		reader.onload = function () {
			const arrayBuffer = reader.result;
			const bytes = new Uint8Array(arrayBuffer);
			resolve(bytes);
		};
	};
}

Unfortunately I get the following error Error: TypeError: wasm.__wbindgen_add_to_stack_pointer is not a function.

Do you have any suggestion? Thanks for the help.

Split functions into non-wasm-bindgen helpers

For a while, there will probably be issues with the APIs, either in the wasm bindings, my bindings, or the underlying libraries. It will be necessary to debug these problems outside of the web environment, at least to the extent possible.

To that end I think it'll be very helpful to have a debug CLI, where essentially the exact same binding code is run, but locally instead of in wasm.

This means:

  • Decoupling any JS specific code out of read_parquet and write_parquet. They should take as input and output rust slices and buffers, and return rust errors, not js errors. (Maybe read up on how method()?; works, which would make the code a lot cleaner).
  • Creating an optional feature with main.rs which would be a CLI input to these four functions.
// lib.rs

#[cfg(feature = "arrow1")]
#[wasm_bindgen(js_name = readParquet1)]
pub fn read_parquet(parquet_file: &[u8]) -> Result<Uint8Array, JsValue> {
  match crate::arrow1::read_parquet() {
    // This function would return a rust vec that would be copied to a Uint8Array here
    Ok(buffer) => buffer,
    Err(error) => JsValue::from_str(format!("{}", error).as_str())
  }
}
// main.rs
// CLI that wraps crate::arrow1::read_parquet and writes output to a local file

Document debug cli

cargo run --example parquet_read --features io_parquet,io_parquet_compression -- 1-partition-lz4.parquet

Update Readme

Update footnote

[^0]: I originally decoded Parquet files to the Arrow IPC File format, but Arrow JS occasionally produced bugs such as `Error: Expected to read 1901288 metadata bytes, but only read 644` when parsing using `arrow.tableFromIPC`. When testing the same buffer in Pyarrow, `pa.ipc.open_file` succeeded but `pa.ipc.open_stream` failed, leading me to believe that the Arrow JS implementation has some bugs to decide when `arrow.tableFromIPC` should internally use the `RecordBatchStreamReader` vs the `RecordBatchFileReader`.

after investigation from #19

Use &self on methods on wasm bindgen structs

When you make a method:

    #[wasm_bindgen]
    pub fn version(self) -> i32 {
        self.0.version()
    }

the self consumes the instance, and you can't use it again. So if you try to call .version() twice from JS you see:

/Users/kyle/github/rust/parquet-wasm/tmp/arrow1.js:403
        if (this.ptr == 0) throw new Error('Attempt to use a moved value');
                           ^

Error: Attempt to use a moved value
    at FileMetaData.version (/Users/kyle/github/rust/parquet-wasm/tmp/arrow1.js:403:34)
    at evalmachine.<anonymous>:1:10
    at Script.runInThisContext (node:vm:129:12)
    at Object.runInThisContext (node:vm:305:38)
    at run ([eval]:1020:15)
    at onRunRequest ([eval]:864:18)
    at onMessage ([eval]:828:13)
    at process.emit (node:events:520:28)
    at emit (node:internal/child_process:938:14)
    at processTicksAndRejections (node:internal/process/task_queues:84:21)

Instead you can take a reference (?) to self:

    #[wasm_bindgen]
    pub fn version(&self) -> i32 {
        self.0.version()
    }

And now .version() doesn't consume the instance.

You might also want to consider using a mutable self here, instead of making a new instance every time.

pub fn set_writer_version(self, value: WriterVersion) -> Self {
Self(self.0.set_writer_version(value.to_arrow1()))
}

Docstrings for exported functions

Any wasm-bindgen function annotated with /// before the function (seems to work when it's on the line before #[wasm_bindgen]) becomes a jsdoc when built 🙀 😍 . So it will be nice to copy all documentation into the function docstrings.

Return iterator of arrow record batches to JS

Motivation: Parquet and Arrow are chunked formats. Therefore we shouldn't need to wait for the entire dataset to load/parse before getting some data back.

However I'm still not aware of a way to return an iterable or an async iterable from rust to js. To get around this, I think we can "drive" the iteration from JS. Essentially this:

import * as wasm from 'parquet-wasm';

const arr = new Uint8Array(); // Parquet bytes
// name readSchema to align with pyarrow api?
const parquetFile = new wasm.ParquetFile(arr);
const schemaIPC = parquetFile.schema();
for (let i = 0; i < parquetFile.numRowGroups; i++) {
  const recordBatchIPC = parquetFile.readRowGroup(i);
}

And ideally we'll have an async version of this too

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.