Coder Social home page Coder Social logo

libertydsnp / parquetjs Goto Github PK

View Code? Open in Web Editor NEW

This project forked from zjonsson/parquetjs

51.0 51.0 26.0 10.79 MB

Fully asynchronous, pure JavaScript implementation of the Parquet file format with additional features

License: MIT License

JavaScript 32.16% Thrift 8.20% TypeScript 59.50% HTML 0.14%
bloom-filter javascript javascript-library parquet

parquetjs's People

Contributors

aletheios avatar aramikm avatar asmuth avatar dgaudet avatar dominictarr avatar dopatraman avatar enddynayn avatar harryalaw avatar j4ys0n avatar jasonyemsft avatar jeffbski-rga avatar kessler avatar kvalev avatar kyleboyer-optum avatar markov00 avatar mehtaishita avatar mpotter avatar mytusshar avatar noxify avatar saraswatpuneet avatar saritvakrat avatar shannonwells avatar si-mw avatar tusharbochare avatar waylandli avatar wgalecki avatar wilwade avatar yechunan avatar zectbynmo avatar zjonsson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

parquetjs's Issues

RLE Boolean Encoding Fails

Even after #112, the test for the standard file rle_boolean_encoding.parquet still fails.

// Tracked in https://github.com/LibertyDSNP/parquetjs/issues/113
it.skip('rle_boolean_encoding.parquet loads', async function() {
const data = await readData('rle/rle_boolean_encoding.parquet');
assert.deepEqual(data[0],{ datatype_boolean: true });
assert.deepEqual(data[1],{ datatype_boolean: false });
});

Upgrade to nodejs 20+ etc

Acceptance criteria

  • changed .tool-versions,
  • changed package.json
  • ran npm install to update package-lock.json
  • updated all Git workflows
  • updated @types/node
  • updated README
  • CI should pass
  • thrift should build

Treat this as a template for future updates

Generated test files should be written to /tmp and not the test directory

Problem

This behavior is from the original version that this repo forked from. It writes test files to the test directory and never cleans them up. There are test files already in test/test-files, which are also used by some tests. This can be confusing for debugging and actually did confuse me when trying to debug some test failures.

Secondly, the test named "reads parquet files via http" in test/reader.js depends upon the file generated by bloomFilterIntegration.ts and that is poor test practice.

Solution

  • Write generated test files to /tmp where they will be cleaned up automatically by the system. This is the kind of thing /tmp is for, and IMO this is the best choice rather than having to create and maintain cleanup test code.
  • Replace hard-coded strings used throughout tests for file names and locations in test with constants and use that instead.

Test code that reads generated files will have to be correctly distinguished from test code that opens test/test-files, and point to the files in /tmp.

additional support for decimal type

help needed
I am trying to generate parquet file with existing schema having field with following datatype
fixed_len_byte_array(16) LICENSE_TERM_IN_MONTHS (DECIMAL(38,0));

requesting for enabling support for this dataType

TS Conversion: Independent Files

Acceptance criteria:
convert the files in the tasks to typescript without using anys anywhere

  • /lib/bufferReader.js
  • /lib/codec/*
  • /lib/compression.js
  • /lib/types.js

Upgrade to AWS SDK V3

Steps to reproduce

  1. Using https://www.npmjs.com/package/@aws-sdk/client-s3, pass that AWS Client into https://github.com/LibertyDSNP/parquetjs/blob/main/lib/reader.js#L115
  2. Notice it errors out client.getObject() is undefined

Expected behaviour

  1. Ideally would work with the newer version of the AWS V3 SDK

Actual behaviour

Seems it used AWS V2 SDK

Any other comments?

Not so much a bug per se but rather a request to update this library to support AWS SDK V3

This is how we solved it in our application

// This is a hack file to support things that @dsnp/parquetjs doesn't support quite yet
import { GetObjectCommand, HeadObjectCommand, S3Client, GetObjectCommandInput } from "@aws-sdk/client-s3";
import { Readable } from "stream";
import { Blob } from "buffer";
const parquet = require("@dsnp/parquetjs");
const { ParquetReader, ParquetEnvelopeReader } = parquet;

export const openS3Reader = async (
  client: S3Client,
  params: GetObjectCommandInput,
  options?: any
): Promise<typeof ParquetReader> => {
  const fileStat = async () => {
    const headObjectResult = await client.send(new HeadObjectCommand(params));
    return headObjectResult.ContentLength;
  };

  const readFn = async (offset: number, length: number, file: string): Promise<Buffer> => {
    if (file) {
      return Promise.reject("external references are not supported");
    }
    const Range = `bytes=${offset}-${offset + length - 1}`;
    const response = await client.send(new GetObjectCommand({ ...{ Range }, ...params }));

    const body = response.Body;
    if (body) {
      return streamToBuffer(body);
    }
    return Buffer.of();
  };

  const closeFn = () => ({});

  const envelopeReader = new ParquetEnvelopeReader(readFn, closeFn, fileStat, options);

  return ParquetReader.openEnvelopeReader(envelopeReader, options);
};

async function streamToBuffer(body: any): Promise<Buffer> {
  const blob = body as Blob;
  if (blob.arrayBuffer !== undefined) {
    const arrayBuffer = await blob.arrayBuffer();
    const uint8Array: Uint8Array = new Uint8Array(arrayBuffer);
    return new Buffer(uint8Array);
  }

  //Assumed to be a Readable like object
  const readable = body as Readable;
  return await new Promise((resolve, reject) => {
    const chunks: Uint8Array[] = [];
    readable.on("data", (chunk) => chunks.push(chunk));
    readable.on("error", reject);
    readable.on("end", () => resolve(Buffer.concat(chunks)));
  });
}

Notes

  • It is OK to lose support for AWS V2 when doing this update

Points: 3

Are bloom filters supported on LIST types?

Hi there ๐Ÿ‘‹

Firstly, thank you for this amazing library!

I'm curious to know how to add bloom filters to LIST types.

For example, given this schema:

{
  querystring: {
    type: "LIST",
    fields: {
      list: {
        repeated: true,
        fields: {
          element: {
            fields: {
              key: { type: "UTF8" },
              value: { type: "UTF8" }
            }
          }
        }
      }
    }
  }
}

How do you add a bloom filter for the querystring.list.element.key field?

[
  {
    column: "querystring.list.element.key",
    numDistinct: 100
  }
]

I assume the above won't work? (Sorry in advance if that literally is how you do it!)

Thanks in advance!

Feature - collect and report multiple field errors

Thanks for reporting an issue!

Steps to reproduce

  1. Download this parquet file
  2. attempt to open this parquet with this library const reader = await parquet.ParquetReader.openFile(<path to parquet file>)
  3. You will receive this error invalid parquet type: DECIMAL

Suggested Improvement behaviour

You should receive errors for each column which had a problem as well as, include the column name for each error such as invalid parquet type: DECIMAL, for Column: quantity

Actual behaviour

You will receive this error invalid parquet type: DECIMAL

Any other comments?

I've created this pr #75 to implement this enhancement.

However I don't know what process you follow to bring in pr's from outside developers. If there is anything else I need to do to my pr to help get it merged please let me know.

Fix writer.js bug regarding multiple callbacks

Description

We have an integration test that fails for users with Node version != 14 in out using the Stream/Transform API test suite:

Callback called multiple times

This failure is not happening in CI, because our remote test runner is using Node version 14.16.0.

Util.ts - PageLocation prototypes

We currently used @ts-ignore to ignore typescript issue with the prototypes of PageLocation along with their respective read/write functions.

To Do:

  • Stop mutating the prototype of PageLocation
    • utils.ts
    • reader.ts (Pending #44)

Feature: ZSTD compression

Hello,

Wanted to suggest an interesting feature for this package.
One of the common compression methods for Parquet files is ZSTD.
This algorithm is however not supported in this package currently (and also not supported natively by Node.js).
ZSTD give a very good compression time / decompression time / compression ratio results and would make it easier to handle Parquet files between Node.js and other programming languages or Parquet file producers.

new release please

Could you please generate a new release with the new updates for AWS S3 v3

React App Reader Cursor Throws Error

Steps to reproduce

Containerize React Application with installed parquetjs dependency, use Dockerfile below:

Version Info:
React Scripts: 5.0.1
Node: 16.14.2
@dsnp/parquetjs: 1.3.5

package.json build: react-scripts build

WORKDIR /usr/src/app
ENV PATH /usr/src/app/node_modules/.bin:$PATH

COPY package.json ./package.json
RUN npm cache clean --force && npm install --legacy-peer-deps

COPY . ./

RUN npm run build
FROM nginx:1.21.1-alpine

COPY --from=build /usr/src/app/build /usr/share/nginx/html


RUN rm /etc/nginx/conf.d/default.conf
COPY deploy/nginx/nginx.conf /etc/nginx/conf.d
COPY docker-entrypoint.sh generate-config-js.sh /

EXPOSE 80
CMD ["/docker-entrypoint.sh"]

Expected behaviour

In the containerized React application, the code snippet below at line getCursor, throws the error below:

Actual behaviour

reader.getCursor should not throw an error.

Any logs, error output, etc?

Code:

import parquetjs from "@dsnp/parquetjs/dist/browser/parquet.esm";
const buffer = Buffer.from(arrayData[1], "base64");
const reader = await parquetjs.ParquetReader.openBuffer(buffer);
const cursor = reader.getCursor();

Stacktrace:

helper.ts:478 TypeError: Cannot read properties of null (reading 'includes')
    at new e (parquet.esm.js:78:21246)
    at e.openEnvelopeReader (parquet.esm.js:78:22667)
    at async helper.ts:463:13
    at async Promise.all (:8080/index 0)
// parquet.esm.js
!nP.includes(t.version))throw"invalid parquet version"

Variable nP is null and script is trying to read includes attribute.

Any other comments?

Running my React application with the React dev server, no issues arise. Versions of libraries between local environment and Containerized environments are matching: Node, @dsnp/parquetjs.

What I have tried:

  • Disabled minification in the product build using the react-app-rewired library
    ** Compared the diff between the parquet.esm.js files in production file and the node_modules file (they are equal, no name mangling)
  • Tried various import statements: *.esm, *.cjs
  • Copied the parquet.esm file into my project, instead of import form node_modules and produced production build

TS Conversion: Reader/Writer Files

Acceptance criteria:
Convert these files to typescript without using anys anywhere

  • /lib/reader.js
  • /lib/writer.js
    • Type mismatch for โ€˜offsetโ€™ and โ€˜row countโ€™
    • Write and Close functions for parquet envelope writer
    • Converting row group to columns in metadata (thereโ€™s no need to loop over row_group)
    • await and flush (callback): not sure about this one
    • parquet_codec encoding type
    • paquet encoding โ€˜statisticsโ€™
  • Tests should all pass
    • Int64 Pass by Reference issue
    • BeforeEach test
  • tsc build should pass

scale for DECIMAL field cannot be 0

Thanks for reporting an issue!

Steps to reproduce

Create a schema where a field specifies type to DECIMAL and scale to 0. Write a row including that field.

Expected behaviour

Works.

Actual behaviour

Fails with this error: Failed to generate test file invalid schema for type: DECIMAL, for Column: decimal, scale is required

Any other comments?

Maybe it's not supported for writing?

Either way, I think this line of code where checking for scale is not taking the value 0 into account.
https://github.com/LibertyDSNP/parquetjs/blob/main/lib/schema.ts#L232

According to https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal, "scale must be zero or a positive integer less than or equal to the precision." The library should allow 0 as the scale value. And it seems the scale should be optional as well since the doc specifies a default value 0.

Incorporate Upstream Performance Improvements

Hi, thanks for the effort. I have finally found a parquet library that is in active development.
I want to say that #55 and #47 work well.
I have a parquet file with several million row, and originally, reading it row by row is painfully slow. But with changes in these PR, things work fast as expected.
The change to Array.shift might not be noticeable in a small array. But with large array, that operation could be very slow depend on the engine implementation.
Please incorporate that fix, it's quick and very helpful in real world situation.

Originally posted by @peara in #11 (comment)

TS Conversion: Types in Build & Release

Blocked: #29
Acceptance criteria:

  • Building browser & commonjs
  • Confirm no js files remaining to convert
  • Types are published with the browser and commonjs builds
  • Release Notes Updated
  • Release new version

Closes #25

Missing column index information in generated parquet file

Hi,

I have currently a problem with the generated parquet file, that I get under (yet) unknown cases the following error:

Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: Required field 'null_pages' was not present! Struct: ColumnIndex(null_pages:null, min_values:[69 74 41 50 53 5F 30 31], max_values:[69 74 41 50 53 5F 30 31], boundary_order:null)
	at org.apache.parquet.format.ColumnIndex.validate(ColumnIndex.java:782)
	at org.apache.parquet.format.ColumnIndex$ColumnIndexStandardScheme.read(ColumnIndex.java:918)
	at org.apache.parquet.format.ColumnIndex$ColumnIndexStandardScheme.read(ColumnIndex.java:818)
	at org.apache.parquet.format.ColumnIndex.read(ColumnIndex.java:722)
	at org.apache.parquet.format.Util.read(Util.java:363)
	... 47 more

I analyzed it already a bit and found a difference in the schema.

To check the schema, I have used pqrs ( https://github.com/manojkarthick/pqrs ).

In the parquet file which is generated via parquetjs, the metadata information is empty.

version: 1
num of rows: 2
created by: @dsnp/parquetjs
metadata:
message root {
  OPTIONAL BYTE_ARRAY change_id (UTF8);
  OPTIONAL BYTE_ARRAY status (UTF8);
  OPTIONAL BYTE_ARRAY approval_status (UTF8);

In the parquet file which was generated via pyspark, where the metadata is filled:

version: 1
num of rows: 4096
created by: parquet-mr version 1.12.2 (build f2610ad5b0d33f2882d1d235f0ecbb70da391aea)
metadata:
  org.apache.spark.version: 3.3.2
  org.apache.spark.sql.parquet.row.metadata: {"type":"struct","fields":[{"name":"change_id","type":"string","nullable":true,"metadata":{}},{"name":"status","type":"string","nullable":true,"metadata":{}},{"name":"approval_status","type":"string","nullable":true,"metadata":{}},]}
message spark_schema {
  OPTIONAL BYTE_ARRAY change_id (STRING);
  OPTIONAL BYTE_ARRAY status (STRING);
  OPTIONAL BYTE_ARRAY approval_status (STRING);

Here the snippet which I use to generate the parquet file:

import parquet from '@dsnp/parquetjs'

const schema = new parquet.ParquetSchema({
  change_id: parquet.ParquetFieldBuilder.createStringField(true),
  status: parquet.ParquetFieldBuilder.createStringField(true),
  approval_status: parquet.ParquetFieldBuilder.createStringField(true),
})

const records = [
  {
    "change_id": "C-01",
    "status": "closed",
    "approval_status": "approved",
  },
  {
    "change_id": "C-02",
    "status": "closed",
    "approval_status": "approved",
  },
]

export default async function main() {
  const writer = await parquet.ParquetWriter.openFile(
    schema,
    'change.parquet'
  )

  for (const record of records) {
    await writer.appendRow(record)
  }
  await writer.close()
}

Steps to reproduce

Currently I haven't found the code snippet which produces the error.

Running

df = spark.read.parquet('<path_to_parquetjs_generated_file>')
df.show()

directly in a Jupyter Notebook doesn't trigger the error.
It could be somewhere in our calculations/joins which we call in the original script, but I have to analyze this.

Expected behaviour

The error isn't shown ๐Ÿ™ˆ

Any other comments?

Tbh. I'm not sure if the missing metadata is the rootcause for this issue.

In the next days I will try to provide some example files with the relevant python code to trigger the error, but I have to finish my work at first to make sure our customer is happy :)


Update 1:

Current workaround is to have a notebook which reads the parquet file and saves it with a new name:

df_change = spark.read.parquet(f'{TEST_DATA_PATH}/change.parquet')
df_change.write.mode("overwrite").parquet(f'{TEST_DATA_PATH}/change_spark.parquet')

This solves the issue for now - but not really what I want :D


Update 2:

While trying to find someone else with the same issue, I have found the following:
https://repost.aws/questions/QUSdc0Pgo9RtSoHOSBwTi8PQ/hive-cannot-open-split-can-not-read-class-org-apache-parquet-format-columnindex

RLE example does not work - 'bitWidth' does not exist in type 'FieldDefinition'

Steps to reproduce

  • Install parquetjs from npm
  • Try to compile the following code from the README:
import parquetjs from "@dsnp/parquetjs"
var schema1 = new parquetjs.ParquetSchema({
age: { type: 'UINT_32', encoding: 'RLE', bitWidth: 7 },
});

Expected behaviour

It should at least compile.

Actual behaviour

Error because there is no bitWidth property on the FieldDefinition type.

Object literal may only specify known properties, and 'bitWidth' does not exist in type 'FieldDefinition'.ts(2353)
declare.d.ts(14, 5): 
The expected type comes from this index signature.
(property) bitWidth: number

Reader `for await` Support

It would be nice to be able to use for await with the reader for easy async reading.

To this end the reader would need to implement Symbol.asyncIterator on ParquetReader. Could be something as simple as:

  async* [Symbol.asyncIterator]() {
    const cursor = this.getCursor();
    let record = null;
    while (record = await cursor.next()) {
      yield record;
    }
}

Review PRs to ZJONSSON and decide what we should bring in

There are a bunch of PRs against ZJONSSON/parquetjs, our forked source. Let's keep a list below and the review status. We should automatically ignore anything that has failed their CI. We can then make separate PRs for each one we want to integrate.

Statistics = true on a schema type "BYTE_ARRAY" with UInt8Array value throws exception

If you have type "BYTE_ARRAY" in your schema and that field contains UInt8Array data, then when parquetjs goes to write the header statistics on close, the statistics data is wrong. It attempts to call "copy" on a UInt8Array, expecting a Buffer type.

To reproduce:

  • Create a schema with a field of type "BYTE_ARRAY"
  • Add a row to the parquet file using UInt8Array data
  • Call parquetjs.close()
  • Should throw the error.

Note: Check the other forks of Parquetjs that might have a fix for this.

Snippet: Generate parquet schema

Hey guys,

we're currently integrating the parquetjs package into our datalake-graphql-wrapper to provide the functionality to upload data in our data lake via graphql.

After some try&error we were able to generate a parquet file which can used in the trino cluster.

Not sure if someone else had already this problem or a solution.. anyway... here our helper functions + example.

// License Apache-2
// helpers/parquet.ts

import { FieldDefinition, ParquetType } from '@dsnp/parquetjs/dist/lib/declare'

export function createStringField({
  optional = true,
}: Partial<{
  optional: boolean
}>): Partial<FieldDefinition> {
  return createField({ type: 'UTF8', optional })
}

export function createBooleanField({
  optional = true,
}: {
  optional: boolean
}): Partial<FieldDefinition> {
  return createField({ type: 'BOOLEAN', optional })
}

export function createIntField({
  optional = true,
}: {
  optional?: boolean
}): Partial<FieldDefinition> {
  return createField({ type: 'INT64', optional })
}

export function createFloatField({
  optional = true,
}: Partial<{
  optional: boolean
}>): Partial<FieldDefinition> {
  return createField({ type: 'FLOAT', optional })
}

export function createDecimalField({
  precision = 3,
  optional = true,
}: Partial<{
  precision?: number
  optional?: boolean
}>): Partial<FieldDefinition> {
  return createField({ type: 'DECIMAL', precision, optional })
}

export function createTimestampField({
  optional = true,
}: Partial<{
  optional?: boolean
}>) {
  return createField({ type: 'TIMESTAMP_MILLIS', optional })
}

export function createRepeatableStructField({
  fields,
}: {
  fields: { [fieldName: string]: FieldDefinition }
}): Partial<FieldDefinition> {
  return {
    optional: true,
    type: 'LIST',
    fields: {
      list: {
        optional: false,
        repeated: true,
        fields: {
          element: {
            optional: true,
            repeated: false,
            fields: fields,
          },
        },
      },
    },
  }
}

export function createStructField({
  fields,
}: {
  fields: { [fieldName: string]: FieldDefinition }
}): Partial<FieldDefinition> {
  return {
    optional: true,
    fields: fields,
  }
}

export function createArrayField({
  type,
  optional = true,
}: Partial<{
  type: ParquetType
  optional?: boolean
}>): Partial<FieldDefinition> {
  return createField({
    optional,
    type: 'LIST',
    fields: {
      list: {
        optional: false,
        repeated: true,
        fields: {
          element: {
            type,
            optional: true,
          },
        },
      },
    },
  })
}

export function createField(
  definition: FieldDefinition,
): Partial<FieldDefinition> {
  return definition
}

And here a "short" example:

// License Apache-2

import path from 'path'
import parquetjs from '@dsnp/parquetjs'
import {
  createArrayField,
  createFloatField,
  createIntField,
  createRepeatableStructField,
  createStringField,
  createStructField,
  createTimestampField,
} from './helpers/parquet'

const examplePath = path.resolve('test_parquet.parquet')

const parquetSchema = new parquetjs.ParquetSchema({
  stringfield: createStringField({}),
  intfield: createIntField({}),
  floatfield: createFloatField({}),
  timestampfield: createTimestampField({}),
  arrayfield: createArrayField({ type: 'UTF8' }),

  objfield: createStructField({
    fields: {
      sub1: createStringField({}),
      sub2: createStringField({}),
    },
  }),

  structfield: createRepeatableStructField({
    fields: {
      structfield_array: createArrayField({ type: 'UTF8' }),
      structfield_string: createStringField({}),
      structfield_struct: createStructField({
        fields: {
          structfield_struct_string1: createStringField({}),
          structfield_struct_string2: createStringField({}),
        },
      }),
    },
  }),
})

const writer = await parquetjs.ParquetWriter.openFile(
  parquetSchema,
  examplePath,
)

await writer.appendRow({
  stringfield: 'string value',
  intfield: 10,
  floatfield: 10.5,
  timestampfield: new Date(),

  arrayfield: {
    list: [{ element: 'arrayfield val1' }, { element: 'arrayfield val2' }],
  },

  objfield: {
    sub1: 'objfield_sub1 val',
    sub2: 'objfield_sub2 val',
  },

  structfield: {
    list: [
      {
        element: {
          structfield_array: {
            list: [{ element: 'val1' }, { element: 'val2' }],
          },
          structfield_string: 'structfield_string val',
          structfield_struct: {
            structfield_struct_string1: 'structfield_struct_string1 val',
            structfield_struct_string2: 'structfield_struct_string2 val',
          },
        },
      },
    ],
  },
})

await writer.close()

const example_df = await parquetjs.ParquetReader.openFile(examplePath)

console.log(JSON.stringify(example_df.schema.schema, null, 2))

Hope that helps you as it helps us :)

TS Conversion: Utility Files

Acceptance criteria:
Convert these files to typescript without using anys anywhere

  • /lib/util.js
  • /lib/schema.js
  • /lib/shred.js
  • /lib/compression.js

Upgrade to Node 18+

We are 3 major versions of NodeJS behind. xxwasm upgrade would be included, may need incremental upgrades first. See #62

Also:

WARNING: node-v15.12.0 is in LTS Maintenance mode and nearing its end of life.

As part of this, address:

npm WARN deprecated @types/[email protected]: This is a stub types definition. bson provides its own type definitions, so you do not need this installed.
npm WARN deprecated [email protected]: This package has been deprecated in favour of @sinonjs/samsam

Add Test Coverage to BufferReader

BufferReader doesn't have any test coverage at all. We should add some, both unit and integration.

  • Add unit tests to BufferReader
  • Add integration test to BufferReader through ParquetEnvelopeReader
  • Remove (if possible) redundant async on BufferReader.read

Util.ts - force32

Don't merge until after: #31

We currently have a function force32 that forces 64bit numbers into 32bit. This is dangerous because we want parquetjs to handle 64 bit numbers. We should look into removing this function.

Review a15d62d which is where it was added.

To Do:

  • Remove exported functionforce32

Decimal Support for Binary Precision

Currently this library only supports DECIMAL reading and writing when the precision is <= 18

To annotate the Parquet Spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal

DECIMAL can be used to annotate the following types:

  • int32: for 1 <= precision <= 9
  • int64: for 1 <= precision <= 18; precision < 10 will produce a
    warning
  • fixed_len_byte_array: precision is limited by the array size. Length n
    can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits
  • binary: precision is not limited, but is required. The minimum number of
    bytes to store the unscaled value should be used.

Test Files:

Related Issues:

  • Thanks to @nirmal82 in #81 bringing this up
  • Current Decimal Read Support Added: #79
  • Current Decimal Write Support Added: #90
  • Byte array Support added: #97

Module util could not be found

Steps to reproduce

Setup:
Typescript == 4.9.5
node == 20.0.0
theia == 1.45.0
@dsnp/parquetjs == 1.6.0
Webpack == 5.90.3

I develop a theia application where I added @dsnp/parquetjs to a theia extension with

yarn add @dsnp/parquetjs

After that, @dsnp/parquetjs version 1.6.0 was added. I implemented the ParquetReader example in the backend. The build was conducted without any error. However, at runtime, I get an error in the backend saying that the module util could not be imported from wasm_brotli_nodejs.
To fix this, we make the following changes in wasm_brotli_nodejs.js:

// old
// const { TextDecoder } = require(String.raw`util`);
// this fixes the import error
const { TextDecoder } = require(`util`);

Subsequently, we get another runtime error that wasm_brotli_nodejs_bg.wasm could not be found in applications\theia-browser\lib\backend. This error could be solved by copying the file into the directory. As an alternative, it was also possible to solve this issue with some modifications in wasm_brotli_nodejs_bg.js.

The question is: What has to be done to consume @dsnp/parquetjs without these modifications from our node app with webpack 5?

Compression

One of the advantages of this format is the compression. Is there a way to activate compression when creating parquet files with the library?

Support Frequency Parquet Schema Model Helper Function

As a user of Frequency, it would be nice if Parquetjs had a helper function that took in the Parquet Schema Model data and initialized a ParquetWriter from it.

Data Model in Frequency: https://github.com/LibertyDSNP/frequency/blob/main/common/primitives/src/parquet.rs#L21
Example Schema Conversion helper function: https://github.com/LibertyDSNP/schemas/blob/main/helpers/parquet.ts#L22

  • Infer ParquetSchema from row data
  • Construct empty ParquetEnvelopeWriter
  • Return ParquetWriter with all default options

Valid parquet v1 file test/test-files.parquet fails to be read by tests in integration.js

Steps to reproduce

There are other ways to test this but this is the easiest:

  1. Verify that all the tests in test/integration.js pass:
    mocha -r ts-node/register test/integration.js
  2. Change function sampleColumnHeaders in test/integration.ts to point at test/test-files/fruits.parquet and not fruits.parquet (which actually is a generated file created by other tests in this file).
  3. Run one of the integration tests, for example:
    mocha -r ts-node/register -f "verify statistics" test/integration.js
  4. Note that two tests fail in util.ts in with 'read failed'.

Expected behaviour

I would expect the statistics tests to fail, but only due to the statistics values, not in the buffer read. If you import test/test-files/fruits.parquet into https://parquetreader.com, the file is parsed just fine. That leads me to believe it's a bug in this repo.

Actual behaviour

When debugging this failure, I found the error originates from readFooter. It appears to read bytes beyond the length of the buffer. When I stepped into the code I found the error being thrown in util.ts, function fread, line 133. The file fruits.parquet read failed here because length = 8 and bytesRead = 4. The position was -4. So it appears the read was trying to go 4 bytes past the end of the buffer.

Any logs, error output, etc?

 1) Parquet
       with DataPageHeaderV1
         verify statistics:
     Error: read failed
      at ~/github/parquetjs/lib/util.ts:133:23
      at FSReqCallback.wrapper [as oncomplete] (node:fs:684:5)

Any other comments?

I haven't done more debugging to find out why this is failing, and have not checked the commits to see when the bug appeared.

Reading snappy compressed files doesn't seem to work in browser

Steps to reproduce

Try to read a parquet file with SNAPPY compression, e.g this one:
Sample - Superstore(2018)-snappy.parquet.zip

Expected behaviour

SNAPPY is supposed to be supported, and so we're supposed to be able to read it from the browser to.

Actual behaviour

An error is thrown and cursor.next() rejects.

Any logs, error output, etc?

TypeError: e.buffer.readInt32LE is not a function
at MR (parquet.esm.js:77:35149)
at Object.QR (parquet.esm.js:77:37481)
at Tn (parquet.esm.js:77:63515)
at rD (parquet.esm.js:77:66120)
at async sb (parquet.esm.js:77:64215)
at async fb (parquet.esm.js:77:64733)
at async Cr.readRowGroup (parquet.esm.js:77:62318)
at async Ry.next (parquet.esm.js:77:54940)

Any other comments?

The fix seems actually quite simple, I replaced

return snappy.uncompress(value);

With

    return Buffer.from(snappy.uncompress(value));

And it seems to work.

I don't open a PR yet because I must admit I'm not 100% sure I did include Buffer the right way in my front-end code: I did link to a browserified version of feross's Buffer from my own page, since passing an ArrayBuffer was complaining that I didn't pass a correct parquet file, is it how it's supposed to be done? Sounds quite unfortunate we have to load this Buffer package twice. (Actually 4 times since it's also in bson and browserfs).
Also I couldn't find unit-tests for this particular script (compression.js), so I'm not sure if this change breaks anything else, and I didn't check either if deflate requires the same treatment, I don't do write in my project yet.

Performance of cursor.next() could be improved with typedarray

Hi,

I'm trying to read a parquet file in the browser, and it seems to take a lot longer than it does in Python. Testing with the largest parquet file in this repo, test/test-files/customer.impala.parquet, in Python:

#!/usr/bin/env python3

import pandas as pd
import time

start = time.time()
df = pd.read_parquet("test/test-files/customer.impala.parquet", engine='pyarrow')
print(df)
end = time.time()
print(f"Took {end-start}s to read with pyarrow")

start = time.time()
df = pd.read_parquet("test/test-files/customer.impala.parquet", engine='fastparquet')
end = time.time()
print(f"Took {end-start}s to read with fastparquet")

outputs:

Took 0.1700916290283203s to read with pyarrow
Took 0.10409688949584961s to read with fastparquet

Whereas in the browser, using this test HTML/JS:

<html>
  <head>
    <script type="module">
      const parquet = await import("https://unpkg.com/@dsnp/[email protected]/dist/browser/parquet.esm.js");
      const buffer_library = await import("https://esm.sh/buffer");
      console.log(buffer_library)
      console.log(parquet)
      const URL = "test/test-files/customer.impala.parquet";
      let resp = await fetch(URL)
      let buffer = await resp.arrayBuffer()
      console.log(buffer)
      buffer = buffer_library.Buffer.from(buffer);
      const reader = await parquet.ParquetReader.openBuffer(buffer);
      //const reader = await parquet.ParquetReader.openUrl(URL);
      window.reader = reader
      console.log(reader)
      var startTime = performance.now()
      let cursor = reader.getCursor();
      await cursor.next()
      console.log(`Time to read first row: ${(performance.now() - startTime)/1000}s`)
      let record = null;
      while (record = await cursor.next()) {
        //console.log(record);
      }
      var endTime = performance.now()
      console.log(`Took ${(endTime - startTime)/1000}s to read ${URL}`)
    </script>
  </head>
</html>

The console outputs:

Time to read first row: 0.6747999997138977s
Took 1.0477999997138978s to read test/test-files/customer.impala.parquet

Which is ~10x slower than Python

Any ideas on how to improve browser read performance?

The bulk of the time seems to spent reading the first row.

Xxhasher not returning hex-encoded string values

Xxhasher was not returning hex-encoded string values, instead returns base 10.

Steps to reproduce

XxHasher.hash64("15") returns "17181926294437511708"

Expected behaviour

XxHasher.hash64("15") returns "ee7276ee58e4421c"

Webpack Issues

Steps to Reproduce

  1. Set up a Node.js project with @dsnp/parquetjs as a dependency.
  2. Configure Webpack for the project with the following settings:
    • Target: 'node'
    • Output library target: 'umd'
    • Entry: [Point to the main file of the test package]
  3. Include the import statement const { ParquetWriter } = require('@dsnp/parquetjs'); in the main file.
  4. Run Webpack to bundle the project.
  5. Execute the bundled code.

Expected Behaviour

The application should bundle without errors and the @dsnp/parquetjs module should be correctly imported and functional when running the bundled code.

Actual Behaviour

Upon running the Webpack bundle, the application throws an error: Error: Cannot find module 'util'. This suggests that Webpack is unable to resolve the util module, a core Node.js module, which is required by @dsnp/parquetjs or its dependencies.

Any Logs, Error Output, Etc.?

Error: Cannot find module 'util'
    at t (index.js:2:2329285)
    at 86275 (index.js:2:2327909)
    ... [Additional stack trace] ...
    at 19785 (index.js:2:565811) {
  code: 'MODULE_NOT_FOUND'
}

Any Other Comments?

  • Attempts to resolve the issue by adding a fallback configuration for util in Webpack did not succeed.
  • The project works as expected without Webpack bundling.
  • This issue seems to arise specifically when using @dsnp/parquetjs in a Webpack-bundled Node.js environment.
  • Any insights or recommendations on configuration changes or workarounds would be greatly appreciated.

LZO and LZO_RAW Support

Steps to reproduce

Run the LZO tests in test/integration.js

Expected behaviour

The tests should pass

Actual behaviour

The round-trip test fails with:

Error: Decompression failed with code: LZO_E_OUTPUT_OVERRUN
    at Object.decompress (node_modules/lzo/index.js:59:13)
    at Object.inflate_lzo [as inflate] (lib/compression.js:91:14)
    at Object.inflate (lib/compression.js:75:52)
    at decodeDataPageV2 (lib/reader.js:932:47)
    at decodePage (lib/reader.js:710:20)
    at decodePages (lib/reader.js:747:28)
    at /Users/shannonwells/github.com/ProjectLiberty/parquetjs/lib/reader.js:609:85
    at ParquetEnvelopeReader.readRowGroup (lib/reader.js:567:35)
    at ParquetCursor.next (lib/reader.js:67:23)
    at readTestFile (test/integration.js:293:24)

invalid encoding: RLE_DICTIONARY

I'm not sure if there's a problem with the parquet data I'm using, or if this is a bug in the library, but filing anyway.

Steps to reproduce

  1. Create a parquet file with RLE_DICTIONARY encoding.
  2. Parse the file with the reader example: https://github.com/LibertyDSNP/parquetjs/blob/main/examples/reader.js

Expected behaviour

Parquet file should be written to the console (in JSON?).

Actual behaviour

Node raises an exception.

Any logs, error output, etc?

node:internal/process/promises:288
            triggerUncaughtException(err, true /* fromPromise */);
            ^

[UnhandledPromiseRejection: This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The promise rejected with the reason "invalid encoding: RLE_DICTIONARY".] {
  code: 'ERR_UNHANDLED_REJECTION'
}

Node.js v18.17.0

Any other comments?

parquet-tools, which uses the same parquet.thrift as parquetjs, parses the file OK.

From what I can tell, https://github.com/LibertyDSNP/parquetjs/blob/main/lib/reader.ts#L704 attempts to load the codec for RLE_DICTIONARY from the parquet_codec hash, as imported via import * as parquet_codec from './codec';.

Converstion to Full TypeScript

An issue to track progress and notes on converting all the remaining JS files to TypeScript.

  • Document the list of things to convert
  • Document the chunks that are reasonable to convert in
  • Document a general estimate of time to convert

When done this would trigger the ability to close #25 by just building the .d.ts files and including them in the package.

Feature Request: Timestamp support for `schema.fromJsonSchema`

Long story short: It would be cool to generate timestamp fields from the json schema.

Currently the script just checks the type from each field definition inside the json schema.
For the string type we have to extend it to check if there is a format property defined.

If the value is date-time, use createTimestampField instead of createStringField.

What do you think? If you want, I can create a PR with the changes.

Not working in the browser

Steps to reproduce

Install package and import into a client-side app (react & typescript) like mentioned in the instructions:
import parquetjs from '@dsnp/parquetjs/browser/parquetjs';

Expected behaviour

Should be possible to use the parquetjs instance and all the methods, for example parquetjs.ParquetReader...

Actual behaviour

FIrst, there's the import error

Cannot find module '@dsnp/parquetjs/browser/parquetjs' or its corresponding type declarations.ts(2307)

Then, I tried to import it as:
import parquetjs from '@dsnp/parquetjs/dist/browser/parquet';
and the import error is gone but when I try to use it like this

const reader = await parquetjs.ParquetReader.openBuffer(fileDataBuffer);

there's this error

TypeError: _dsnp_parquetjs_dist_browser_parquet__WEBPACK_IMPORTED_MODULE_1___default().ParquetReader is undefined

How can I make it work in the browser?

Add Linter and Formatter

  • Eslint
  • Prettier

Just use the default configs

Warning

If the linter has lots of crazy errors that require large refactors, reach out for help and also use @ts-ignore while filing bug reports

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.