libertydsnp / parquetjs Goto Github PK
View Code? Open in Web Editor NEWThis project forked from zjonsson/parquetjs
Fully asynchronous, pure JavaScript implementation of the Parquet file format with additional features
License: MIT License
This project forked from zjonsson/parquetjs
Fully asynchronous, pure JavaScript implementation of the Parquet file format with additional features
License: MIT License
npm install
to update package-lock.jsonTreat this as a template for future updates
This behavior is from the original version that this repo forked from. It writes test files to the test directory and never cleans them up. There are test files already in test/test-files, which are also used by some tests. This can be confusing for debugging and actually did confuse me when trying to debug some test failures.
Secondly, the test named "reads parquet files via http" in test/reader.js
depends upon the file generated by bloomFilterIntegration.ts
and that is poor test practice.
Test code that reads generated files will have to be correctly distinguished from test code that opens test/test-files, and point to the files in /tmp.
The current version of thrift within util.js has issues with readString. The later patches of thrift do not have this issue and could allow us to remove the fixedTFramedTransport class altogether. Below are the fixes to thrift
https://github.com/apache/thrift/blob/master/CHANGES.md#092
https://issues.apache.org/jira/browse/THRIFT-1841
According to the Parquet type definitions, ConvertedType
s (such as TIME_MILLIS
, TIME_MICROS
TIMESTAMP_MILLIS
, and TIMESTAMP_MICROS
) are deprecated in favor of LogicalType
s (such as TIME
and TIMESTAMP
). However, I don't see any support for these types (such as in SchemaDefinition
, ParquetType
, etc.). Please add support for these types.
Useful tool for inspecting output files: https://github.com/manojkarthick/pqrs
help needed
I am trying to generate parquet file with existing schema having field with following datatype
fixed_len_byte_array(16) LICENSE_TERM_IN_MONTHS (DECIMAL(38,0));
requesting for enabling support for this dataType
Currently, only PLAIN, PLAIN_DICTIONARY and RLE are considered valid encoding. https://github.com/LibertyDSNP/parquetjs/blob/c07e7e81847523f4d74edd0adf9b2f9b6bbd1d90/lib/codec/index.ts
Opened on behalf of our user. microsoft/AzureStorageExplorer#7506
Acceptance criteria:
convert the files in the tasks to typescript without using any
s anywhere
Seems it used AWS V2 SDK
Not so much a bug per se but rather a request to update this library to support AWS SDK V3
This is how we solved it in our application
// This is a hack file to support things that @dsnp/parquetjs doesn't support quite yet
import { GetObjectCommand, HeadObjectCommand, S3Client, GetObjectCommandInput } from "@aws-sdk/client-s3";
import { Readable } from "stream";
import { Blob } from "buffer";
const parquet = require("@dsnp/parquetjs");
const { ParquetReader, ParquetEnvelopeReader } = parquet;
export const openS3Reader = async (
client: S3Client,
params: GetObjectCommandInput,
options?: any
): Promise<typeof ParquetReader> => {
const fileStat = async () => {
const headObjectResult = await client.send(new HeadObjectCommand(params));
return headObjectResult.ContentLength;
};
const readFn = async (offset: number, length: number, file: string): Promise<Buffer> => {
if (file) {
return Promise.reject("external references are not supported");
}
const Range = `bytes=${offset}-${offset + length - 1}`;
const response = await client.send(new GetObjectCommand({ ...{ Range }, ...params }));
const body = response.Body;
if (body) {
return streamToBuffer(body);
}
return Buffer.of();
};
const closeFn = () => ({});
const envelopeReader = new ParquetEnvelopeReader(readFn, closeFn, fileStat, options);
return ParquetReader.openEnvelopeReader(envelopeReader, options);
};
async function streamToBuffer(body: any): Promise<Buffer> {
const blob = body as Blob;
if (blob.arrayBuffer !== undefined) {
const arrayBuffer = await blob.arrayBuffer();
const uint8Array: Uint8Array = new Uint8Array(arrayBuffer);
return new Buffer(uint8Array);
}
//Assumed to be a Readable like object
const readable = body as Readable;
return await new Promise((resolve, reject) => {
const chunks: Uint8Array[] = [];
readable.on("data", (chunk) => chunks.push(chunk));
readable.on("error", reject);
readable.on("end", () => resolve(Buffer.concat(chunks)));
});
}
Points: 3
Hi there ๐
Firstly, thank you for this amazing library!
I'm curious to know how to add bloom filters to LIST
types.
For example, given this schema:
{
querystring: {
type: "LIST",
fields: {
list: {
repeated: true,
fields: {
element: {
fields: {
key: { type: "UTF8" },
value: { type: "UTF8" }
}
}
}
}
}
}
}
How do you add a bloom filter for the querystring.list.element.key
field?
[
{
column: "querystring.list.element.key",
numDistinct: 100
}
]
I assume the above won't work? (Sorry in advance if that literally is how you do it!)
Thanks in advance!
Thanks for reporting an issue!
const reader = await parquet.ParquetReader.openFile(<path to parquet file>)
invalid parquet type: DECIMAL
You should receive errors for each column which had a problem as well as, include the column name for each error such as invalid parquet type: DECIMAL, for Column: quantity
You will receive this error invalid parquet type: DECIMAL
I've created this pr #75 to implement this enhancement.
However I don't know what process you follow to bring in pr's from outside developers. If there is anything else I need to do to my pr to help get it merged please let me know.
xxhash-wasm has a 3-4x improvement, but requires a minimum of Node 15+.
We are finally ready to drop Node 14 now that Node 20 is out.
https://github.com/jungomi/xxhash-wasm/blob/main/CHANGELOG.md
We have an integration test that fails for users with Node version != 14 in out using the Stream/Transform API
test suite:
Callback called multiple times
This failure is not happening in CI, because our remote test runner is using Node version 14.16.0.
We currently used @ts-ignore to ignore typescript issue with the prototypes of PageLocation along with their respective read/write functions.
To Do:
PageLocation
Hello,
Wanted to suggest an interesting feature for this package.
One of the common compression methods for Parquet files is ZSTD.
This algorithm is however not supported in this package currently (and also not supported natively by Node.js).
ZSTD give a very good compression time / decompression time / compression ratio results and would make it easier to handle Parquet files between Node.js and other programming languages or Parquet file producers.
Could you please generate a new release with the new updates for AWS S3 v3
Containerize React Application with installed parquetjs dependency, use Dockerfile below:
Version Info:
React Scripts: 5.0.1
Node: 16.14.2
@dsnp/parquetjs: 1.3.5
package.json build: react-scripts build
WORKDIR /usr/src/app
ENV PATH /usr/src/app/node_modules/.bin:$PATH
COPY package.json ./package.json
RUN npm cache clean --force && npm install --legacy-peer-deps
COPY . ./
RUN npm run build
FROM nginx:1.21.1-alpine
COPY --from=build /usr/src/app/build /usr/share/nginx/html
RUN rm /etc/nginx/conf.d/default.conf
COPY deploy/nginx/nginx.conf /etc/nginx/conf.d
COPY docker-entrypoint.sh generate-config-js.sh /
EXPOSE 80
CMD ["/docker-entrypoint.sh"]
In the containerized React application, the code snippet below at line getCursor
, throws the error below:
reader.getCursor should not throw an error.
Code:
import parquetjs from "@dsnp/parquetjs/dist/browser/parquet.esm";
const buffer = Buffer.from(arrayData[1], "base64");
const reader = await parquetjs.ParquetReader.openBuffer(buffer);
const cursor = reader.getCursor();
Stacktrace:
helper.ts:478 TypeError: Cannot read properties of null (reading 'includes')
at new e (parquet.esm.js:78:21246)
at e.openEnvelopeReader (parquet.esm.js:78:22667)
at async helper.ts:463:13
at async Promise.all (:8080/index 0)
// parquet.esm.js
!nP.includes(t.version))throw"invalid parquet version"
Variable nP is null and script is trying to read includes attribute.
Running my React application with the React dev server, no issues arise. Versions of libraries between local environment and Containerized environments are matching: Node, @dsnp/parquetjs.
What I have tried:
*.esm, *.cjs
Acceptance criteria:
Convert these files to typescript without using anys anywhere
Thanks for reporting an issue!
Create a schema where a field specifies type to DECIMAL and scale to 0. Write a row including that field.
Works.
Fails with this error: Failed to generate test file invalid schema for type: DECIMAL, for Column: decimal, scale is required
Maybe it's not supported for writing?
Either way, I think this line of code where checking for scale is not taking the value 0 into account.
https://github.com/LibertyDSNP/parquetjs/blob/main/lib/schema.ts#L232
According to https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal, "scale must be zero or a positive integer less than or equal to the precision." The library should allow 0 as the scale value. And it seems the scale should be optional as well since the doc specifies a default value 0.
Hi, thanks for the effort. I have finally found a parquet library that is in active development.
I want to say that #55 and #47 work well.
I have a parquet file with several million row, and originally, reading it row by row is painfully slow. But with changes in these PR, things work fast as expected.
The change to Array.shift
might not be noticeable in a small array. But with large array, that operation could be very slow depend on the engine implementation.
Please incorporate that fix, it's quick and very helpful in real world situation.
Originally posted by @peara in #11 (comment)
Hi,
I have currently a problem with the generated parquet file, that I get under (yet) unknown cases the following error:
Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: Required field 'null_pages' was not present! Struct: ColumnIndex(null_pages:null, min_values:[69 74 41 50 53 5F 30 31], max_values:[69 74 41 50 53 5F 30 31], boundary_order:null)
at org.apache.parquet.format.ColumnIndex.validate(ColumnIndex.java:782)
at org.apache.parquet.format.ColumnIndex$ColumnIndexStandardScheme.read(ColumnIndex.java:918)
at org.apache.parquet.format.ColumnIndex$ColumnIndexStandardScheme.read(ColumnIndex.java:818)
at org.apache.parquet.format.ColumnIndex.read(ColumnIndex.java:722)
at org.apache.parquet.format.Util.read(Util.java:363)
... 47 more
I analyzed it already a bit and found a difference in the schema.
To check the schema, I have used pqrs
( https://github.com/manojkarthick/pqrs ).
In the parquet file which is generated via parquetjs, the metadata
information is empty.
version: 1
num of rows: 2
created by: @dsnp/parquetjs
metadata:
message root {
OPTIONAL BYTE_ARRAY change_id (UTF8);
OPTIONAL BYTE_ARRAY status (UTF8);
OPTIONAL BYTE_ARRAY approval_status (UTF8);
In the parquet file which was generated via pyspark, where the metadata is filled:
version: 1
num of rows: 4096
created by: parquet-mr version 1.12.2 (build f2610ad5b0d33f2882d1d235f0ecbb70da391aea)
metadata:
org.apache.spark.version: 3.3.2
org.apache.spark.sql.parquet.row.metadata: {"type":"struct","fields":[{"name":"change_id","type":"string","nullable":true,"metadata":{}},{"name":"status","type":"string","nullable":true,"metadata":{}},{"name":"approval_status","type":"string","nullable":true,"metadata":{}},]}
message spark_schema {
OPTIONAL BYTE_ARRAY change_id (STRING);
OPTIONAL BYTE_ARRAY status (STRING);
OPTIONAL BYTE_ARRAY approval_status (STRING);
Here the snippet which I use to generate the parquet file:
import parquet from '@dsnp/parquetjs'
const schema = new parquet.ParquetSchema({
change_id: parquet.ParquetFieldBuilder.createStringField(true),
status: parquet.ParquetFieldBuilder.createStringField(true),
approval_status: parquet.ParquetFieldBuilder.createStringField(true),
})
const records = [
{
"change_id": "C-01",
"status": "closed",
"approval_status": "approved",
},
{
"change_id": "C-02",
"status": "closed",
"approval_status": "approved",
},
]
export default async function main() {
const writer = await parquet.ParquetWriter.openFile(
schema,
'change.parquet'
)
for (const record of records) {
await writer.appendRow(record)
}
await writer.close()
}
Currently I haven't found the code snippet which produces the error.
Running
df = spark.read.parquet('<path_to_parquetjs_generated_file>')
df.show()
directly in a Jupyter Notebook doesn't trigger the error.
It could be somewhere in our calculations/joins which we call in the original script, but I have to analyze this.
The error isn't shown ๐
Tbh. I'm not sure if the missing metadata is the rootcause for this issue.
In the next days I will try to provide some example files with the relevant python code to trigger the error, but I have to finish my work at first to make sure our customer is happy :)
Update 1:
Current workaround is to have a notebook which reads the parquet file and saves it with a new name:
df_change = spark.read.parquet(f'{TEST_DATA_PATH}/change.parquet')
df_change.write.mode("overwrite").parquet(f'{TEST_DATA_PATH}/change_spark.parquet')
This solves the issue for now - but not really what I want :D
Update 2:
While trying to find someone else with the same issue, I have found the following:
https://repost.aws/questions/QUSdc0Pgo9RtSoHOSBwTi8PQ/hive-cannot-open-split-can-not-read-class-org-apache-parquet-format-columnindex
import parquetjs from "@dsnp/parquetjs"
var schema1 = new parquetjs.ParquetSchema({
age: { type: 'UINT_32', encoding: 'RLE', bitWidth: 7 },
});
It should at least compile.
Error because there is no bitWidth property on the FieldDefinition type.
Object literal may only specify known properties, and 'bitWidth' does not exist in type 'FieldDefinition'.ts(2353)
declare.d.ts(14, 5):
The expected type comes from this index signature.
(property) bitWidth: number
It would be nice to be able to use for await with the reader for easy async reading.
To this end the reader would need to implement Symbol.asyncIterator
on ParquetReader
. Could be something as simple as:
async* [Symbol.asyncIterator]() {
const cursor = this.getCursor();
let record = null;
while (record = await cursor.next()) {
yield record;
}
}
There are a bunch of PRs against ZJONSSON/parquetjs, our forked source. Let's keep a list below and the review status. We should automatically ignore anything that has failed their CI. We can then make separate PRs for each one we want to integrate.
reviewed/unreviewed
integrate/ignore/use partial
notes
45 fromPrimitive_TIMESTAMP_MILLIS BigInt issue
55 Improve reader performancel
59 Fix for issue 58 from Primitive_TIMESTAMP_MICROS (TypeError when dividing a string by BigInt)
64 Add openS3 Compatibility for AWS SDK v3
66 Simple select for array columns
Acceptance criteria:
If you have type "BYTE_ARRAY" in your schema and that field contains UInt8Array data, then when parquetjs goes to write the header statistics on close
, the statistics data is wrong. It attempts to call "copy" on a UInt8Array, expecting a Buffer type.
To reproduce:
Note: Check the other forks of Parquetjs that might have a fix for this.
Hey guys,
we're currently integrating the parquetjs package into our datalake-graphql-wrapper to provide the functionality to upload data in our data lake via graphql.
After some try&error we were able to generate a parquet file which can used in the trino cluster.
Not sure if someone else had already this problem or a solution.. anyway... here our helper functions + example.
// License Apache-2
// helpers/parquet.ts
import { FieldDefinition, ParquetType } from '@dsnp/parquetjs/dist/lib/declare'
export function createStringField({
optional = true,
}: Partial<{
optional: boolean
}>): Partial<FieldDefinition> {
return createField({ type: 'UTF8', optional })
}
export function createBooleanField({
optional = true,
}: {
optional: boolean
}): Partial<FieldDefinition> {
return createField({ type: 'BOOLEAN', optional })
}
export function createIntField({
optional = true,
}: {
optional?: boolean
}): Partial<FieldDefinition> {
return createField({ type: 'INT64', optional })
}
export function createFloatField({
optional = true,
}: Partial<{
optional: boolean
}>): Partial<FieldDefinition> {
return createField({ type: 'FLOAT', optional })
}
export function createDecimalField({
precision = 3,
optional = true,
}: Partial<{
precision?: number
optional?: boolean
}>): Partial<FieldDefinition> {
return createField({ type: 'DECIMAL', precision, optional })
}
export function createTimestampField({
optional = true,
}: Partial<{
optional?: boolean
}>) {
return createField({ type: 'TIMESTAMP_MILLIS', optional })
}
export function createRepeatableStructField({
fields,
}: {
fields: { [fieldName: string]: FieldDefinition }
}): Partial<FieldDefinition> {
return {
optional: true,
type: 'LIST',
fields: {
list: {
optional: false,
repeated: true,
fields: {
element: {
optional: true,
repeated: false,
fields: fields,
},
},
},
},
}
}
export function createStructField({
fields,
}: {
fields: { [fieldName: string]: FieldDefinition }
}): Partial<FieldDefinition> {
return {
optional: true,
fields: fields,
}
}
export function createArrayField({
type,
optional = true,
}: Partial<{
type: ParquetType
optional?: boolean
}>): Partial<FieldDefinition> {
return createField({
optional,
type: 'LIST',
fields: {
list: {
optional: false,
repeated: true,
fields: {
element: {
type,
optional: true,
},
},
},
},
})
}
export function createField(
definition: FieldDefinition,
): Partial<FieldDefinition> {
return definition
}
And here a "short" example:
// License Apache-2
import path from 'path'
import parquetjs from '@dsnp/parquetjs'
import {
createArrayField,
createFloatField,
createIntField,
createRepeatableStructField,
createStringField,
createStructField,
createTimestampField,
} from './helpers/parquet'
const examplePath = path.resolve('test_parquet.parquet')
const parquetSchema = new parquetjs.ParquetSchema({
stringfield: createStringField({}),
intfield: createIntField({}),
floatfield: createFloatField({}),
timestampfield: createTimestampField({}),
arrayfield: createArrayField({ type: 'UTF8' }),
objfield: createStructField({
fields: {
sub1: createStringField({}),
sub2: createStringField({}),
},
}),
structfield: createRepeatableStructField({
fields: {
structfield_array: createArrayField({ type: 'UTF8' }),
structfield_string: createStringField({}),
structfield_struct: createStructField({
fields: {
structfield_struct_string1: createStringField({}),
structfield_struct_string2: createStringField({}),
},
}),
},
}),
})
const writer = await parquetjs.ParquetWriter.openFile(
parquetSchema,
examplePath,
)
await writer.appendRow({
stringfield: 'string value',
intfield: 10,
floatfield: 10.5,
timestampfield: new Date(),
arrayfield: {
list: [{ element: 'arrayfield val1' }, { element: 'arrayfield val2' }],
},
objfield: {
sub1: 'objfield_sub1 val',
sub2: 'objfield_sub2 val',
},
structfield: {
list: [
{
element: {
structfield_array: {
list: [{ element: 'val1' }, { element: 'val2' }],
},
structfield_string: 'structfield_string val',
structfield_struct: {
structfield_struct_string1: 'structfield_struct_string1 val',
structfield_struct_string2: 'structfield_struct_string2 val',
},
},
},
],
},
})
await writer.close()
const example_df = await parquetjs.ParquetReader.openFile(examplePath)
console.log(JSON.stringify(example_df.schema.schema, null, 2))
Hope that helps you as it helps us :)
Acceptance criteria:
Convert these files to typescript without using anys anywhere
I have typescript types for this package in my project https://github.com/multiprocessio/datastation/blob/master/type-overrides/dsnp__parquetjs.d.ts. I'd rather if they were upstreamed though.
Making this issue to discuss you including a .d.ts file with this repo so others don't have to type it. If you are not interested maybe I'll try to upstream it into the @types repo.
We are 3 major versions of NodeJS behind. xxwasm upgrade would be included, may need incremental upgrades first. See #62
Also:
WARNING: node-v15.12.0 is in LTS Maintenance mode and nearing its end of life.
As part of this, address:
npm WARN deprecated @types/[email protected]: This is a stub types definition. bson provides its own type definitions, so you do not need this installed.
npm WARN deprecated [email protected]: This package has been deprecated in favour of @sinonjs/samsam
BufferReader doesn't have any test coverage at all. We should add some, both unit and integration.
async
on BufferReader.read
Try to open a Parquet file with version other than v1
It should open
Error invalid parquet version
https://github.com/LibertyDSNP/parquetjs/blob/17cb5ed3533f72f199e6683cc9842935ff07595a/lib/reader.ts#LL24C1-L27C27
https://github.com/LibertyDSNP/parquetjs/blob/17cb5ed3533f72f199e6683cc9842935ff07595a/lib/reader.ts#LL169C1-L171C6
Don't merge until after: #31
We currently have a function force32
that forces 64bit numbers into 32bit. This is dangerous because we want parquetjs to handle 64 bit numbers. We should look into removing this function.
Review a15d62d which is where it was added.
To Do:
force32
Currently this library only supports DECIMAL reading and writing when the precision is <= 18
To annotate the Parquet Spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal
DECIMAL
can be used to annotate the following types:
int32
: for 1 <= precision <= 9int64
: for 1 <= precision <= 18; precision < 10 will produce afixed_len_byte_array
: precision is limited by the array size. Length n
floor(log_10(2^(8*n - 1) - 1))
base-10 digitsbinary
: precision
is not limited, but is required. The minimum number ofTest Files:
Related Issues:
Setup:
Typescript == 4.9.5
node == 20.0.0
theia == 1.45.0
@dsnp/parquetjs == 1.6.0
Webpack == 5.90.3
I develop a theia application where I added @dsnp/parquetjs to a theia extension with
yarn add @dsnp/parquetjs
After that, @dsnp/parquetjs version 1.6.0 was added. I implemented the ParquetReader example in the backend. The build was conducted without any error. However, at runtime, I get an error in the backend saying that the module util could not be imported from wasm_brotli_nodejs.
To fix this, we make the following changes in wasm_brotli_nodejs.js:
// old
// const { TextDecoder } = require(String.raw`util`);
// this fixes the import error
const { TextDecoder } = require(`util`);
Subsequently, we get another runtime error that wasm_brotli_nodejs_bg.wasm could not be found in applications\theia-browser\lib\backend. This error could be solved by copying the file into the directory. As an alternative, it was also possible to solve this issue with some modifications in wasm_brotli_nodejs_bg.js.
The question is: What has to be done to consume @dsnp/parquetjs without these modifications from our node app with webpack 5?
One of the advantages of this format is the compression. Is there a way to activate compression when creating parquet files with the library?
As a user of Frequency, it would be nice if Parquetjs had a helper function that took in the Parquet Schema Model data and initialized a ParquetWriter from it.
Data Model in Frequency: https://github.com/LibertyDSNP/frequency/blob/main/common/primitives/src/parquet.rs#L21
Example Schema Conversion helper function: https://github.com/LibertyDSNP/schemas/blob/main/helpers/parquet.ts#L22
ParquetSchema
from row dataParquetEnvelopeWriter
ParquetWriter
with all default optionsThere are other ways to test this but this is the easiest:
mocha -r ts-node/register test/integration.js
sampleColumnHeaders
in test/integration.ts
to point at test/test-files/fruits.parquet
and not fruits.parquet
(which actually is a generated file created by other tests in this file).mocha -r ts-node/register -f "verify statistics" test/integration.js
I would expect the statistics tests to fail, but only due to the statistics values, not in the buffer read. If you import test/test-files/fruits.parquet into https://parquetreader.com, the file is parsed just fine. That leads me to believe it's a bug in this repo.
When debugging this failure, I found the error originates from readFooter. It appears to read bytes beyond the length of the buffer. When I stepped into the code I found the error being thrown in util.ts, function fread, line 133. The file fruits.parquet read failed here because length = 8 and bytesRead = 4. The position was -4. So it appears the read was trying to go 4 bytes past the end of the buffer.
1) Parquet
with DataPageHeaderV1
verify statistics:
Error: read failed
at ~/github/parquetjs/lib/util.ts:133:23
at FSReqCallback.wrapper [as oncomplete] (node:fs:684:5)
I haven't done more debugging to find out why this is failing, and have not checked the commits to see when the bug appeared.
Try to read a parquet file with SNAPPY compression, e.g this one:
Sample - Superstore(2018)-snappy.parquet.zip
SNAPPY is supposed to be supported, and so we're supposed to be able to read it from the browser to.
An error is thrown and cursor.next()
rejects.
TypeError: e.buffer.readInt32LE is not a function
at MR (parquet.esm.js:77:35149)
at Object.QR (parquet.esm.js:77:37481)
at Tn (parquet.esm.js:77:63515)
at rD (parquet.esm.js:77:66120)
at async sb (parquet.esm.js:77:64215)
at async fb (parquet.esm.js:77:64733)
at async Cr.readRowGroup (parquet.esm.js:77:62318)
at async Ry.next (parquet.esm.js:77:54940)
The fix seems actually quite simple, I replaced
parquetjs/lib/browser/compression.js
Line 63 in a2cd4ff
With
return Buffer.from(snappy.uncompress(value));
And it seems to work.
I don't open a PR yet because I must admit I'm not 100% sure I did include Buffer the right way in my front-end code: I did link to a browserified version of feross's Buffer from my own page, since passing an ArrayBuffer was complaining that I didn't pass a correct parquet file, is it how it's supposed to be done? Sounds quite unfortunate we have to load this Buffer package twice. (Actually 4 times since it's also in bson and browserfs).
Also I couldn't find unit-tests for this particular script (compression.js), so I'm not sure if this change breaks anything else, and I didn't check either if deflate requires the same treatment, I don't do write in my project yet.
Hi,
I'm trying to read a parquet file in the browser, and it seems to take a lot longer than it does in Python. Testing with the largest parquet file in this repo, test/test-files/customer.impala.parquet
, in Python:
#!/usr/bin/env python3
import pandas as pd
import time
start = time.time()
df = pd.read_parquet("test/test-files/customer.impala.parquet", engine='pyarrow')
print(df)
end = time.time()
print(f"Took {end-start}s to read with pyarrow")
start = time.time()
df = pd.read_parquet("test/test-files/customer.impala.parquet", engine='fastparquet')
end = time.time()
print(f"Took {end-start}s to read with fastparquet")
outputs:
Took 0.1700916290283203s to read with pyarrow
Took 0.10409688949584961s to read with fastparquet
Whereas in the browser, using this test HTML/JS:
<html>
<head>
<script type="module">
const parquet = await import("https://unpkg.com/@dsnp/[email protected]/dist/browser/parquet.esm.js");
const buffer_library = await import("https://esm.sh/buffer");
console.log(buffer_library)
console.log(parquet)
const URL = "test/test-files/customer.impala.parquet";
let resp = await fetch(URL)
let buffer = await resp.arrayBuffer()
console.log(buffer)
buffer = buffer_library.Buffer.from(buffer);
const reader = await parquet.ParquetReader.openBuffer(buffer);
//const reader = await parquet.ParquetReader.openUrl(URL);
window.reader = reader
console.log(reader)
var startTime = performance.now()
let cursor = reader.getCursor();
await cursor.next()
console.log(`Time to read first row: ${(performance.now() - startTime)/1000}s`)
let record = null;
while (record = await cursor.next()) {
//console.log(record);
}
var endTime = performance.now()
console.log(`Took ${(endTime - startTime)/1000}s to read ${URL}`)
</script>
</head>
</html>
The console outputs:
Time to read first row: 0.6747999997138977s
Took 1.0477999997138978s to read test/test-files/customer.impala.parquet
Which is ~10x slower than Python
Any ideas on how to improve browser read performance?
The bulk of the time seems to spent reading the first row.
Xxhasher was not returning hex-encoded string values, instead returns base 10.
XxHasher.hash64("15") returns "17181926294437511708"
XxHasher.hash64("15") returns "ee7276ee58e4421c"
@dsnp/parquetjs
as a dependency.const { ParquetWriter } = require('@dsnp/parquetjs');
in the main file.The application should bundle without errors and the @dsnp/parquetjs
module should be correctly imported and functional when running the bundled code.
Upon running the Webpack bundle, the application throws an error: Error: Cannot find module 'util'
. This suggests that Webpack is unable to resolve the util
module, a core Node.js module, which is required by @dsnp/parquetjs
or its dependencies.
Error: Cannot find module 'util'
at t (index.js:2:2329285)
at 86275 (index.js:2:2327909)
... [Additional stack trace] ...
at 19785 (index.js:2:565811) {
code: 'MODULE_NOT_FOUND'
}
util
in Webpack did not succeed.@dsnp/parquetjs
in a Webpack-bundled Node.js environment.Run the LZO tests in test/integration.js
The tests should pass
The round-trip test fails with:
Error: Decompression failed with code: LZO_E_OUTPUT_OVERRUN
at Object.decompress (node_modules/lzo/index.js:59:13)
at Object.inflate_lzo [as inflate] (lib/compression.js:91:14)
at Object.inflate (lib/compression.js:75:52)
at decodeDataPageV2 (lib/reader.js:932:47)
at decodePage (lib/reader.js:710:20)
at decodePages (lib/reader.js:747:28)
at /Users/shannonwells/github.com/ProjectLiberty/parquetjs/lib/reader.js:609:85
at ParquetEnvelopeReader.readRowGroup (lib/reader.js:567:35)
at ParquetCursor.next (lib/reader.js:67:23)
at readTestFile (test/integration.js:293:24)
From: ZJONSSON#81
- RLE encoding and decoding does not work correctly if the dictionary has > 255 entries.
- A column chunk with a dictionary and data pages in both PLAIN and PLAIN_DICTIONARY decodes the PLAN pages incorrectly.
I'm not sure if there's a problem with the parquet data I'm using, or if this is a bug in the library, but filing anyway.
Parquet file should be written to the console (in JSON?).
Node raises an exception.
node:internal/process/promises:288
triggerUncaughtException(err, true /* fromPromise */);
^
[UnhandledPromiseRejection: This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The promise rejected with the reason "invalid encoding: RLE_DICTIONARY".] {
code: 'ERR_UNHANDLED_REJECTION'
}
Node.js v18.17.0
parquet-tools, which uses the same parquet.thrift as parquetjs, parses the file OK.
From what I can tell, https://github.com/LibertyDSNP/parquetjs/blob/main/lib/reader.ts#L704 attempts to load the codec for RLE_DICTIONARY
from the parquet_codec
hash, as imported via import * as parquet_codec from './codec';
.
An issue to track progress and notes on converting all the remaining JS files to TypeScript.
When done this would trigger the ability to close #25 by just building the .d.ts files and including them in the package.
Long story short: It would be cool to generate timestamp fields from the json schema.
Currently the script just checks the type
from each field definition inside the json schema.
For the string type we have to extend it to check if there is a format
property defined.
If the value is date-time
, use createTimestampField
instead of createStringField
.
What do you think? If you want, I can create a PR with the changes.
Install package and import into a client-side app (react & typescript) like mentioned in the instructions:
import parquetjs from '@dsnp/parquetjs/browser/parquetjs';
Should be possible to use the parquetjs
instance and all the methods, for example parquetjs.ParquetReader...
FIrst, there's the import error
Cannot find module '@dsnp/parquetjs/browser/parquetjs' or its corresponding type declarations.ts(2307)
Then, I tried to import it as:
import parquetjs from '@dsnp/parquetjs/dist/browser/parquet';
and the import error is gone but when I try to use it like this
const reader = await parquetjs.ParquetReader.openBuffer(fileDataBuffer);
there's this error
TypeError: _dsnp_parquetjs_dist_browser_parquet__WEBPACK_IMPORTED_MODULE_1___default().ParquetReader is undefined
How can I make it work in the browser?
Just use the default configs
If the linter has lots of crazy errors that require large refactors, reach out for help and also use @ts-ignore
while filing bug reports
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.