libertydsnp / parquetjs Goto Github PK

View Code? Open in Web Editor NEW

This project forked from zjonsson/parquetjs

51.0 51.0 26.0 10.79 MB

Fully asynchronous, pure JavaScript implementation of the Parquet file format with additional features

License: MIT License

JavaScript 32.16% Thrift 8.20% TypeScript 59.50% HTML 0.14%

bloom-filter javascript javascript-library parquet

parquetjs's People

Contributors

Stargazers

Watchers

parquetjs's Issues

RLE Boolean Encoding Fails

Even after #112, the test for the standard file rle_boolean_encoding.parquet still fails.

parquetjs/test/test-files.js

Lines 193 to 198 in 6fdb9da

    
           // Tracked in https://github.com/LibertyDSNP/parquetjs/issues/113 
        
           it.skip('rle_boolean_encoding.parquet loads', async function() { 
        
             const data = await readData('rle/rle_boolean_encoding.parquet'); 
        
             assert.deepEqual(data[0],{ datatype_boolean: true }); 
        
             assert.deepEqual(data[1],{ datatype_boolean: false }); 
        
           });

Acceptance criteria

Treat this as a template for future updates

Generated test files should be written to /tmp and not the test directory

Problem

This behavior is from the original version that this repo forked from. It writes test files to the test directory and never cleans them up. There are test files already in test/test-files, which are also used by some tests. This can be confusing for debugging and actually did confuse me when trying to debug some test failures.

Secondly, the test named "reads parquet files via http" in test/reader.js depends upon the file generated by bloomFilterIntegration.ts and that is poor test practice.

Solution

Write generated test files to /tmp where they will be cleaned up automatically by the system. This is the kind of thing /tmp is for, and IMO this is the best choice rather than having to create and maintain cleanup test code.
Replace hard-coded strings used throughout tests for file names and locations in test with constants and use that instead.

Test code that reads generated files will have to be correctly distinguished from test code that opens test/test-files, and point to the files in /tmp.

Util.ts - fixedTFramedTransport

The current version of thrift within util.js has issues with readString. The later patches of thrift do not have this issue and could allow us to remove the fixedTFramedTransport class altogether. Below are the fixes to thrift

https://github.com/apache/thrift/blob/master/CHANGES.md#092
https://issues.apache.org/jira/browse/THRIFT-1841

Add support for `TIMESTAMP` and other logical types

According to the Parquet type definitions, ConvertedTypes (such as TIME_MILLIS, TIME_MICROS TIMESTAMP_MILLIS, and TIMESTAMP_MICROS) are deprecated in favor of LogicalTypes (such as TIME and TIMESTAMP). However, I don't see any support for these types (such as in SchemaDefinition, ParquetType, etc.). Please add support for these types.

Useful tool for inspecting output files: https://github.com/manojkarthick/pqrs

additional support for decimal type

help needed
I am trying to generate parquet file with existing schema having field with following datatype
fixed_len_byte_array(16) LICENSE_TERM_IN_MONTHS (DECIMAL(38,0));

requesting for enabling support for this dataType

Support DELTA_BINARY_PACKED

https://parquet.apache.org/docs/file-format/data-pages/encodings/#a-namedeltaencadelta-encoding-delta_binary_packed--5

Currently, only PLAIN, PLAIN_DICTIONARY and RLE are considered valid encoding. https://github.com/LibertyDSNP/parquetjs/blob/c07e7e81847523f4d74edd0adf9b2f9b6bbd1d90/lib/codec/index.ts

Opened on behalf of our user. microsoft/AzureStorageExplorer#7506

TS Conversion: Independent Files

Acceptance criteria:
convert the files in the tasks to typescript without using anys anywhere

/lib/bufferReader.js
/lib/codec/*
/lib/compression.js
/lib/types.js

Upgrade to AWS SDK V3

Steps to reproduce

Using https://www.npmjs.com/package/@aws-sdk/client-s3, pass that AWS Client into https://github.com/LibertyDSNP/parquetjs/blob/main/lib/reader.js#L115
Notice it errors out client.getObject() is undefined

Expected behaviour

Ideally would work with the newer version of the AWS V3 SDK

Actual behaviour

Seems it used AWS V2 SDK

Any other comments?

Not so much a bug per se but rather a request to update this library to support AWS SDK V3

This is how we solved it in our application

// This is a hack file to support things that @dsnp/parquetjs doesn't support quite yet
import { GetObjectCommand, HeadObjectCommand, S3Client, GetObjectCommandInput } from "@aws-sdk/client-s3";
import { Readable } from "stream";
import { Blob } from "buffer";
const parquet = require("@dsnp/parquetjs");
const { ParquetReader, ParquetEnvelopeReader } = parquet;

export const openS3Reader = async (
  client: S3Client,
  params: GetObjectCommandInput,
  options?: any
): Promise<typeof ParquetReader> => {
  const fileStat = async () => {
    const headObjectResult = await client.send(new HeadObjectCommand(params));
    return headObjectResult.ContentLength;
  };

  const readFn = async (offset: number, length: number, file: string): Promise<Buffer> => {
    if (file) {
      return Promise.reject("external references are not supported");
    }
    const Range = `bytes=${offset}-${offset + length - 1}`;
    const response = await client.send(new GetObjectCommand({ ...{ Range }, ...params }));

    const body = response.Body;
    if (body) {
      return streamToBuffer(body);
    }
    return Buffer.of();
  };

  const closeFn = () => ({});

  const envelopeReader = new ParquetEnvelopeReader(readFn, closeFn, fileStat, options);

  return ParquetReader.openEnvelopeReader(envelopeReader, options);
};

async function streamToBuffer(body: any): Promise<Buffer> {
  const blob = body as Blob;
  if (blob.arrayBuffer !== undefined) {
    const arrayBuffer = await blob.arrayBuffer();
    const uint8Array: Uint8Array = new Uint8Array(arrayBuffer);
    return new Buffer(uint8Array);
  }

  //Assumed to be a Readable like object
  const readable = body as Readable;
  return await new Promise((resolve, reject) => {
    const chunks: Uint8Array[] = [];
    readable.on("data", (chunk) => chunks.push(chunk));
    readable.on("error", reject);
    readable.on("end", () => resolve(Buffer.concat(chunks)));
  });
}

Notes

It is OK to lose support for AWS V2 when doing this update

Points: 3

Are bloom filters supported on LIST types?

Hi there 👋

Firstly, thank you for this amazing library!

I'm curious to know how to add bloom filters to LIST types.

For example, given this schema:

{
  querystring: {
    type: "LIST",
    fields: {
      list: {
        repeated: true,
        fields: {
          element: {
            fields: {
              key: { type: "UTF8" },
              value: { type: "UTF8" }
            }
          }
        }
      }
    }
  }
}

How do you add a bloom filter for the querystring.list.element.key field?

[
  {
    column: "querystring.list.element.key",
    numDistinct: 100
  }
]

I assume the above won't work? (Sorry in advance if that literally is how you do it!)

Thanks in advance!

Feature - collect and report multiple field errors

Thanks for reporting an issue!

Steps to reproduce

Download this parquet file
attempt to open this parquet with this library const reader = await parquet.ParquetReader.openFile(<path to parquet file>)
You will receive this error invalid parquet type: DECIMAL

Suggested Improvement behaviour

You should receive errors for each column which had a problem as well as, include the column name for each error such as invalid parquet type: DECIMAL, for Column: quantity

Actual behaviour

You will receive this error invalid parquet type: DECIMAL

Any other comments?

I've created this pr #75 to implement this enhancement.

However I don't know what process you follow to bring in pr's from outside developers. If there is anything else I need to do to my pr to help get it merged please let me know.

Upgrade to Node 16 minimum with xxhash-wasm v1.0+

xxhash-wasm has a 3-4x improvement, but requires a minimum of Node 15+.

We are finally ready to drop Node 14 now that Node 20 is out.

https://github.com/jungomi/xxhash-wasm/blob/main/CHANGELOG.md

Fix writer.js bug regarding multiple callbacks

Description

We have an integration test that fails for users with Node version != 14 in out using the Stream/Transform API test suite:

Callback called multiple times

This failure is not happening in CI, because our remote test runner is using Node version 14.16.0.

Util.ts - PageLocation prototypes

We currently used @ts-ignore to ignore typescript issue with the prototypes of PageLocation along with their respective read/write functions.

To Do:

Stop mutating the prototype of PageLocation
- utils.ts
- reader.ts (Pending #44)

Feature: ZSTD compression

Hello,

Wanted to suggest an interesting feature for this package.
One of the common compression methods for Parquet files is ZSTD.
This algorithm is however not supported in this package currently (and also not supported natively by Node.js).
ZSTD give a very good compression time / decompression time / compression ratio results and would make it easier to handle Parquet files between Node.js and other programming languages or Parquet file producers.

new release please

Could you please generate a new release with the new updates for AWS S3 v3

React App Reader Cursor Throws Error

Steps to reproduce

Containerize React Application with installed parquetjs dependency, use Dockerfile below:

Version Info:
React Scripts: 5.0.1
Node: 16.14.2
@dsnp/parquetjs: 1.3.5

package.json build: react-scripts build

WORKDIR /usr/src/app
ENV PATH /usr/src/app/node_modules/.bin:$PATH

COPY package.json ./package.json
RUN npm cache clean --force && npm install --legacy-peer-deps

COPY . ./

RUN npm run build
FROM nginx:1.21.1-alpine

COPY --from=build /usr/src/app/build /usr/share/nginx/html


RUN rm /etc/nginx/conf.d/default.conf
COPY deploy/nginx/nginx.conf /etc/nginx/conf.d
COPY docker-entrypoint.sh generate-config-js.sh /

EXPOSE 80
CMD ["/docker-entrypoint.sh"]

Expected behaviour

In the containerized React application, the code snippet below at line getCursor, throws the error below:

Actual behaviour

reader.getCursor should not throw an error.

Any logs, error output, etc?

Code:

import parquetjs from "@dsnp/parquetjs/dist/browser/parquet.esm";
const buffer = Buffer.from(arrayData[1], "base64");
const reader = await parquetjs.ParquetReader.openBuffer(buffer);
const cursor = reader.getCursor();

Stacktrace:

helper.ts:478 TypeError: Cannot read properties of null (reading 'includes')
    at new e (parquet.esm.js:78:21246)
    at e.openEnvelopeReader (parquet.esm.js:78:22667)
    at async helper.ts:463:13
    at async Promise.all (:8080/index 0)

// parquet.esm.js
!nP.includes(t.version))throw"invalid parquet version"

Variable nP is null and script is trying to read includes attribute.

Any other comments?

Running my React application with the React dev server, no issues arise. Versions of libraries between local environment and Containerized environments are matching: Node, @dsnp/parquetjs.

What I have tried:

Disabled minification in the product build using the react-app-rewired library
** Compared the diff between the parquet.esm.js files in production file and the node_modules file (they are equal, no name mangling)
Tried various import statements: *.esm, *.cjs
Copied the parquet.esm file into my project, instead of import form node_modules and produced production build

TS Conversion: Reader/Writer Files

Acceptance criteria:
Convert these files to typescript without using anys anywhere

scale for DECIMAL field cannot be 0

Thanks for reporting an issue!

Steps to reproduce

Create a schema where a field specifies type to DECIMAL and scale to 0. Write a row including that field.

Expected behaviour

Works.

Actual behaviour

Fails with this error: Failed to generate test file invalid schema for type: DECIMAL, for Column: decimal, scale is required

Any other comments?

Maybe it's not supported for writing?

Either way, I think this line of code where checking for scale is not taking the value 0 into account.
https://github.com/LibertyDSNP/parquetjs/blob/main/lib/schema.ts#L232

According to https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal, "scale must be zero or a positive integer less than or equal to the precision." The library should allow 0 as the scale value. And it seems the scale should be optional as well since the doc specifies a default value 0.

Incorporate Upstream Performance Improvements

Hi, thanks for the effort. I have finally found a parquet library that is in active development.
I want to say that #55 and #47 work well.
I have a parquet file with several million row, and originally, reading it row by row is painfully slow. But with changes in these PR, things work fast as expected.
The change to Array.shift might not be noticeable in a small array. But with large array, that operation could be very slow depend on the engine implementation.
Please incorporate that fix, it's quick and very helpful in real world situation.

Originally posted by @peara in #11 (comment)

TS Conversion: Types in Build & Release

Blocked: #29
Acceptance criteria:

Building browser & commonjs
Confirm no js files remaining to convert
Types are published with the browser and commonjs builds
Release Notes Updated
Release new version

Closes #25

Missing column index information in generated parquet file

Hi,

I have currently a problem with the generated parquet file, that I get under (yet) unknown cases the following error:

Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: Required field 'null_pages' was not present! Struct: ColumnIndex(null_pages:null, min_values:[69 74 41 50 53 5F 30 31], max_values:[69 74 41 50 53 5F 30 31], boundary_order:null)
	at org.apache.parquet.format.ColumnIndex.validate(ColumnIndex.java:782)
	at org.apache.parquet.format.ColumnIndex$ColumnIndexStandardScheme.read(ColumnIndex.java:918)
	at org.apache.parquet.format.ColumnIndex$ColumnIndexStandardScheme.read(ColumnIndex.java:818)
	at org.apache.parquet.format.ColumnIndex.read(ColumnIndex.java:722)
	at org.apache.parquet.format.Util.read(Util.java:363)
	... 47 more

I analyzed it already a bit and found a difference in the schema.

To check the schema, I have used pqrs ( https://github.com/manojkarthick/pqrs ).

In the parquet file which is generated via parquetjs, the metadata information is empty.

version: 1
num of rows: 2
created by: @dsnp/parquetjs
metadata:
message root {
  OPTIONAL BYTE_ARRAY change_id (UTF8);
  OPTIONAL BYTE_ARRAY status (UTF8);
  OPTIONAL BYTE_ARRAY approval_status (UTF8);

In the parquet file which was generated via pyspark, where the metadata is filled:

version: 1
num of rows: 4096
created by: parquet-mr version 1.12.2 (build f2610ad5b0d33f2882d1d235f0ecbb70da391aea)
metadata:
  org.apache.spark.version: 3.3.2
  org.apache.spark.sql.parquet.row.metadata: {"type":"struct","fields":[{"name":"change_id","type":"string","nullable":true,"metadata":{}},{"name":"status","type":"string","nullable":true,"metadata":{}},{"name":"approval_status","type":"string","nullable":true,"metadata":{}},]}
message spark_schema {
  OPTIONAL BYTE_ARRAY change_id (STRING);
  OPTIONAL BYTE_ARRAY status (STRING);
  OPTIONAL BYTE_ARRAY approval_status (STRING);

Here the snippet which I use to generate the parquet file:

import parquet from '@dsnp/parquetjs'

const schema = new parquet.ParquetSchema({
  change_id: parquet.ParquetFieldBuilder.createStringField(true),
  status: parquet.ParquetFieldBuilder.createStringField(true),
  approval_status: parquet.ParquetFieldBuilder.createStringField(true),
})

const records = [
  {
    "change_id": "C-01",
    "status": "closed",
    "approval_status": "approved",
  },
  {
    "change_id": "C-02",
    "status": "closed",
    "approval_status": "approved",
  },
]

export default async function main() {
  const writer = await parquet.ParquetWriter.openFile(
    schema,
    'change.parquet'
  )

  for (const record of records) {
    await writer.appendRow(record)
  }
  await writer.close()
}

Steps to reproduce

Currently I haven't found the code snippet which produces the error.

Running

df = spark.read.parquet('<path_to_parquetjs_generated_file>')
df.show()

directly in a Jupyter Notebook doesn't trigger the error.
It could be somewhere in our calculations/joins which we call in the original script, but I have to analyze this.

Expected behaviour

The error isn't shown 🙈

Any other comments?

Tbh. I'm not sure if the missing metadata is the rootcause for this issue.

In the next days I will try to provide some example files with the relevant python code to trigger the error, but I have to finish my work at first to make sure our customer is happy :)

Update 1:

Current workaround is to have a notebook which reads the parquet file and saves it with a new name:

df_change = spark.read.parquet(f'{TEST_DATA_PATH}/change.parquet')
df_change.write.mode("overwrite").parquet(f'{TEST_DATA_PATH}/change_spark.parquet')

This solves the issue for now - but not really what I want :D

Update 2:

While trying to find someone else with the same issue, I have found the following:
https://repost.aws/questions/QUSdc0Pgo9RtSoHOSBwTi8PQ/hive-cannot-open-split-can-not-read-class-org-apache-parquet-format-columnindex

RLE example does not work - 'bitWidth' does not exist in type 'FieldDefinition'

Steps to reproduce

Install parquetjs from npm
Try to compile the following code from the README:

import parquetjs from "@dsnp/parquetjs"
var schema1 = new parquetjs.ParquetSchema({
age: { type: 'UINT_32', encoding: 'RLE', bitWidth: 7 },
});

Expected behaviour

It should at least compile.

Actual behaviour

Error because there is no bitWidth property on the FieldDefinition type.

Object literal may only specify known properties, and 'bitWidth' does not exist in type 'FieldDefinition'.ts(2353)
declare.d.ts(14, 5): 
The expected type comes from this index signature.
(property) bitWidth: number

Reader `for await` Support

It would be nice to be able to use for await with the reader for easy async reading.

To this end the reader would need to implement Symbol.asyncIterator on ParquetReader. Could be something as simple as:

  async* [Symbol.asyncIterator]() {
    const cursor = this.getCursor();
    let record = null;
    while (record = await cursor.next()) {
      yield record;
    }
}

Review PRs to ZJONSSON and decide what we should bring in

There are a bunch of PRs against ZJONSSON/parquetjs, our forked source. Let's keep a list below and the review status. We should automatically ignore anything that has failed their CI. We can then make separate PRs for each one we want to integrate.

reviewed/unreviewed
integrate/ignore/use partial
notes
44 support decimal
- reviewed
- integrate
- ignore
- use partial
  Notes:
- Support could be considered without messing with int53 library, but should be a separate feature request.
45 fromPrimitive_TIMESTAMP_MILLIS BigInt issue
- reviewed
- integrate
- ignore
- use partial
  Notes:
47 support decimal
- reviewed
- integrate
- ignore
- use partial
  Notes: This and the next one break tests and don't actually fix anything. This also includes changes to LZO compression which I tested, and it throws an out of bounds error. This RedHat Bugzilla report hints at what could be the problem. If so it points to LZO maybe needing some attention from us.
55 Improve reader performancel
- reviewed
- integrate
- ignore
- use partial
  Notes: Hardly a difference to 47. Doesn't work.
59 Fix for issue 58 from Primitive_TIMESTAMP_MICROS (TypeError when dividing a string by BigInt)
- reviewed
- integrate
- ignore
- use partial
  Notes: It seems like this fix should be applied everywhere, not just this conversion.
61 Improve write speed
- reviewed - sort of
- integrate
- ignore
- use partial
  Notes: This change replaces "snappyjs" with "snappy" and it appears to be node-only. We need to be careful of any changes to compression as they seem to often rely on system-compiled libraries. Also can compression be async? The other changes that actually focus on write speed should be reviewed for validity
64 Add openS3 Compatibility for AWS SDK v3
- reviewed
- integrate
- ignore
- use partial
  Notes: Seems like a good idea. Pushed to issue #32
65 A competing fix for Issue 58, fromPrimitive_TIMESTAMP_MICROS (TypeError when dividing a string by BigInt)
- reviewed
- integrate
- ignore
- use partial
  Notes: Compare this with PR 59.
66 Simple select for array columns
- reviewed
- integrate
- ignore
- use partial
  Notes: Appears to be more of a feature, and lacking tests or much in the way of use cases. No need for it for now.

TS Conversion: Update Typescript & esbuild

Acceptance criteria:

[] Update Typescript >= 4.4.x
[] Update esbuild >= 0.13.x

Statistics = true on a schema type "BYTE_ARRAY" with UInt8Array value throws exception

If you have type "BYTE_ARRAY" in your schema and that field contains UInt8Array data, then when parquetjs goes to write the header statistics on close, the statistics data is wrong. It attempts to call "copy" on a UInt8Array, expecting a Buffer type.

To reproduce:

Create a schema with a field of type "BYTE_ARRAY"
Add a row to the parquet file using UInt8Array data
Call parquetjs.close()
Should throw the error.

Note: Check the other forks of Parquetjs that might have a fix for this.

Snippet: Generate parquet schema

Hey guys,

we're currently integrating the parquetjs package into our datalake-graphql-wrapper to provide the functionality to upload data in our data lake via graphql.

After some try&error we were able to generate a parquet file which can used in the trino cluster.

Not sure if someone else had already this problem or a solution.. anyway... here our helper functions + example.

// License Apache-2
// helpers/parquet.ts

import { FieldDefinition, ParquetType } from '@dsnp/parquetjs/dist/lib/declare'

export function createStringField({
  optional = true,
}: Partial<{
  optional: boolean
}>): Partial<FieldDefinition> {
  return createField({ type: 'UTF8', optional })
}

export function createBooleanField({
  optional = true,
}: {
  optional: boolean
}): Partial<FieldDefinition> {
  return createField({ type: 'BOOLEAN', optional })
}

export function createIntField({
  optional = true,
}: {
  optional?: boolean
}): Partial<FieldDefinition> {
  return createField({ type: 'INT64', optional })
}

export function createFloatField({
  optional = true,
}: Partial<{
  optional: boolean
}>): Partial<FieldDefinition> {
  return createField({ type: 'FLOAT', optional })
}

export function createDecimalField({
  precision = 3,
  optional = true,
}: Partial<{
  precision?: number
  optional?: boolean
}>): Partial<FieldDefinition> {
  return createField({ type: 'DECIMAL', precision, optional })
}

export function createTimestampField({
  optional = true,
}: Partial<{
  optional?: boolean
}>) {
  return createField({ type: 'TIMESTAMP_MILLIS', optional })
}

export function createRepeatableStructField({
  fields,
}: {
  fields: { [fieldName: string]: FieldDefinition }
}): Partial<FieldDefinition> {
  return {
    optional: true,
    type: 'LIST',
    fields: {
      list: {
        optional: false,
        repeated: true,
        fields: {
          element: {
            optional: true,
            repeated: false,
            fields: fields,
          },
        },
      },
    },
  }
}

export function createStructField({
  fields,
}: {
  fields: { [fieldName: string]: FieldDefinition }
}): Partial<FieldDefinition> {
  return {
    optional: true,
    fields: fields,
  }
}

export function createArrayField({
  type,
  optional = true,
}: Partial<{
  type: ParquetType
  optional?: boolean
}>): Partial<FieldDefinition> {
  return createField({
    optional,
    type: 'LIST',
    fields: {
      list: {
        optional: false,
        repeated: true,
        fields: {
          element: {
            type,
            optional: true,
          },
        },
      },
    },
  })
}

export function createField(
  definition: FieldDefinition,
): Partial<FieldDefinition> {
  return definition
}

And here a "short" example:

// License Apache-2

import path from 'path'
import parquetjs from '@dsnp/parquetjs'
import {
  createArrayField,
  createFloatField,
  createIntField,
  createRepeatableStructField,
  createStringField,
  createStructField,
  createTimestampField,
} from './helpers/parquet'

const examplePath = path.resolve('test_parquet.parquet')

const parquetSchema = new parquetjs.ParquetSchema({
  stringfield: createStringField({}),
  intfield: createIntField({}),
  floatfield: createFloatField({}),
  timestampfield: createTimestampField({}),
  arrayfield: createArrayField({ type: 'UTF8' }),

  objfield: createStructField({
    fields: {
      sub1: createStringField({}),
      sub2: createStringField({}),
    },
  }),

  structfield: createRepeatableStructField({
    fields: {
      structfield_array: createArrayField({ type: 'UTF8' }),
      structfield_string: createStringField({}),
      structfield_struct: createStructField({
        fields: {
          structfield_struct_string1: createStringField({}),
          structfield_struct_string2: createStringField({}),
        },
      }),
    },
  }),
})

const writer = await parquetjs.ParquetWriter.openFile(
  parquetSchema,
  examplePath,
)

await writer.appendRow({
  stringfield: 'string value',
  intfield: 10,
  floatfield: 10.5,
  timestampfield: new Date(),

  arrayfield: {
    list: [{ element: 'arrayfield val1' }, { element: 'arrayfield val2' }],
  },

  objfield: {
    sub1: 'objfield_sub1 val',
    sub2: 'objfield_sub2 val',
  },

  structfield: {
    list: [
      {
        element: {
          structfield_array: {
            list: [{ element: 'val1' }, { element: 'val2' }],
          },
          structfield_string: 'structfield_string val',
          structfield_struct: {
            structfield_struct_string1: 'structfield_struct_string1 val',
            structfield_struct_string2: 'structfield_struct_string2 val',
          },
        },
      },
    ],
  },
})

await writer.close()

const example_df = await parquetjs.ParquetReader.openFile(examplePath)

console.log(JSON.stringify(example_df.schema.schema, null, 2))

Hope that helps you as it helps us :)

TS Conversion: Utility Files

Acceptance criteria:
Convert these files to typescript without using anys anywhere

/lib/util.js
/lib/schema.js
/lib/shred.js
/lib/compression.js

Upstreaming some typescript types

I have typescript types for this package in my project https://github.com/multiprocessio/datastation/blob/master/type-overrides/dsnp__parquetjs.d.ts. I'd rather if they were upstreamed though.

Making this issue to discuss you including a .d.ts file with this repo so others don't have to type it. If you are not interested maybe I'll try to upstream it into the @types repo.

Upgrade to Node 18+

We are 3 major versions of NodeJS behind. xxwasm upgrade would be included, may need incremental upgrades first. See #62

Also:

WARNING: node-v15.12.0 is in LTS Maintenance mode and nearing its end of life.

As part of this, address:

npm WARN deprecated @types/[email protected]: This is a stub types definition. bson provides its own type definitions, so you do not need this installed.
npm WARN deprecated [email protected]: This package has been deprecated in favour of @sinonjs/samsam

Add Test Coverage to BufferReader

BufferReader doesn't have any test coverage at all. We should add some, both unit and integration.

Add unit tests to BufferReader
Add integration test to BufferReader through ParquetEnvelopeReader
Remove (if possible) redundant async on BufferReader.read

Support for Parquet file versions above v1?

Steps to reproduce

Try to open a Parquet file with version other than v1

Expected behaviour

It should open

Actual behaviour

Error invalid parquet version

Any other comments?

https://github.com/LibertyDSNP/parquetjs/blob/17cb5ed3533f72f199e6683cc9842935ff07595a/lib/reader.ts#LL24C1-L27C27
https://github.com/LibertyDSNP/parquetjs/blob/17cb5ed3533f72f199e6683cc9842935ff07595a/lib/reader.ts#LL169C1-L171C6

Util.ts - force32

Don't merge until after: #31

We currently have a function force32 that forces 64bit numbers into 32bit. This is dangerous because we want parquetjs to handle 64 bit numbers. We should look into removing this function.

Review a15d62d which is where it was added.

To Do:

Remove exported functionforce32

Decimal Support for Binary Precision

Currently this library only supports DECIMAL reading and writing when the precision is <= 18

To annotate the Parquet Spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal

DECIMAL can be used to annotate the following types:

int32: for 1 <= precision <= 9
int64: for 1 <= precision <= 18; precision < 10 will produce a
warning
fixed_len_byte_array: precision is limited by the array size. Length n
can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits
binary: precision is not limited, but is required. The minimum number of
bytes to store the unscaled value should be used.

Test Files:

#81 (comment)
See Decimal test files here: https://github.com/apache/parquet-testing/tree/master/data

Related Issues:

Thanks to @nirmal82 in #81 bringing this up
Current Decimal Read Support Added: #79
Current Decimal Write Support Added: #90
Byte array Support added: #97

Module util could not be found

Steps to reproduce

Setup:
Typescript == 4.9.5
node == 20.0.0
theia == 1.45.0
@dsnp/parquetjs == 1.6.0
Webpack == 5.90.3

I develop a theia application where I added @dsnp/parquetjs to a theia extension with

yarn add @dsnp/parquetjs

After that, @dsnp/parquetjs version 1.6.0 was added. I implemented the ParquetReader example in the backend. The build was conducted without any error. However, at runtime, I get an error in the backend saying that the module util could not be imported from wasm_brotli_nodejs.
To fix this, we make the following changes in wasm_brotli_nodejs.js:

// old
// const { TextDecoder } = require(String.raw`util`);
// this fixes the import error
const { TextDecoder } = require(`util`);

Subsequently, we get another runtime error that wasm_brotli_nodejs_bg.wasm could not be found in applications\theia-browser\lib\backend. This error could be solved by copying the file into the directory. As an alternative, it was also possible to solve this issue with some modifications in wasm_brotli_nodejs_bg.js.

The question is: What has to be done to consume @dsnp/parquetjs without these modifications from our node app with webpack 5?

Compression

One of the advantages of this format is the compression. Is there a way to activate compression when creating parquet files with the library?

Support Frequency Parquet Schema Model Helper Function

As a user of Frequency, it would be nice if Parquetjs had a helper function that took in the Parquet Schema Model data and initialized a ParquetWriter from it.

Data Model in Frequency: https://github.com/LibertyDSNP/frequency/blob/main/common/primitives/src/parquet.rs#L21
Example Schema Conversion helper function: https://github.com/LibertyDSNP/schemas/blob/main/helpers/parquet.ts#L22

Infer ParquetSchema from row data
Construct empty ParquetEnvelopeWriter
Return ParquetWriter with all default options

Valid parquet v1 file test/test-files.parquet fails to be read by tests in integration.js

Steps to reproduce

There are other ways to test this but this is the easiest:

Verify that all the tests in test/integration.js pass:
```
mocha -r ts-node/register test/integration.js
```
Change function sampleColumnHeaders in test/integration.ts to point at test/test-files/fruits.parquet and not fruits.parquet (which actually is a generated file created by other tests in this file).

Run one of the integration tests, for example:

mocha -r ts-node/register -f "verify statistics" test/integration.js

Note that two tests fail in util.ts in with 'read failed'.

Expected behaviour

I would expect the statistics tests to fail, but only due to the statistics values, not in the buffer read. If you import test/test-files/fruits.parquet into https://parquetreader.com, the file is parsed just fine. That leads me to believe it's a bug in this repo.

Actual behaviour

When debugging this failure, I found the error originates from readFooter. It appears to read bytes beyond the length of the buffer. When I stepped into the code I found the error being thrown in util.ts, function fread, line 133. The file fruits.parquet read failed here because length = 8 and bytesRead = 4. The position was -4. So it appears the read was trying to go 4 bytes past the end of the buffer.

Any logs, error output, etc?

 1) Parquet
       with DataPageHeaderV1
         verify statistics:
     Error: read failed
      at ~/github/parquetjs/lib/util.ts:133:23
      at FSReqCallback.wrapper [as oncomplete] (node:fs:684:5)

Any other comments?

I haven't done more debugging to find out why this is failing, and have not checked the commits to see when the bug appeared.

Reading snappy compressed files doesn't seem to work in browser

Steps to reproduce

Try to read a parquet file with SNAPPY compression, e.g this one:
Sample - Superstore(2018)-snappy.parquet.zip

Expected behaviour

SNAPPY is supposed to be supported, and so we're supposed to be able to read it from the browser to.

Actual behaviour

An error is thrown and cursor.next() rejects.

Any logs, error output, etc?

TypeError: e.buffer.readInt32LE is not a function
at MR (parquet.esm.js:77:35149)
at Object.QR (parquet.esm.js:77:37481)
at Tn (parquet.esm.js:77:63515)
at rD (parquet.esm.js:77:66120)
at async sb (parquet.esm.js:77:64215)
at async fb (parquet.esm.js:77:64733)
at async Cr.readRowGroup (parquet.esm.js:77:62318)
at async Ry.next (parquet.esm.js:77:54940)

Any other comments?

The fix seems actually quite simple, I replaced

parquetjs/lib/browser/compression.js

Line 63 in a2cd4ff

return snappy.uncompress(value);

With

    return Buffer.from(snappy.uncompress(value));

And it seems to work.

I don't open a PR yet because I must admit I'm not 100% sure I did include Buffer the right way in my front-end code: I did link to a browserified version of feross's Buffer from my own page, since passing an ArrayBuffer was complaining that I didn't pass a correct parquet file, is it how it's supposed to be done? Sounds quite unfortunate we have to load this Buffer package twice. (Actually 4 times since it's also in bson and browserfs).
Also I couldn't find unit-tests for this particular script (compression.js), so I'm not sure if this change breaks anything else, and I didn't check either if deflate requires the same treatment, I don't do write in my project yet.

Performance of cursor.next() could be improved with typedarray

Hi,

I'm trying to read a parquet file in the browser, and it seems to take a lot longer than it does in Python. Testing with the largest parquet file in this repo, test/test-files/customer.impala.parquet, in Python:

#!/usr/bin/env python3

import pandas as pd
import time

start = time.time()
df = pd.read_parquet("test/test-files/customer.impala.parquet", engine='pyarrow')
print(df)
end = time.time()
print(f"Took {end-start}s to read with pyarrow")

start = time.time()
df = pd.read_parquet("test/test-files/customer.impala.parquet", engine='fastparquet')
end = time.time()
print(f"Took {end-start}s to read with fastparquet")

outputs:

Took 0.1700916290283203s to read with pyarrow
Took 0.10409688949584961s to read with fastparquet

Whereas in the browser, using this test HTML/JS:

<html>
  <head>
    <script type="module">
      const parquet = await import("https://unpkg.com/@dsnp/[email protected]/dist/browser/parquet.esm.js");
      const buffer_library = await import("https://esm.sh/buffer");
      console.log(buffer_library)
      console.log(parquet)
      const URL = "test/test-files/customer.impala.parquet";
      let resp = await fetch(URL)
      let buffer = await resp.arrayBuffer()
      console.log(buffer)
      buffer = buffer_library.Buffer.from(buffer);
      const reader = await parquet.ParquetReader.openBuffer(buffer);
      //const reader = await parquet.ParquetReader.openUrl(URL);
      window.reader = reader
      console.log(reader)
      var startTime = performance.now()
      let cursor = reader.getCursor();
      await cursor.next()
      console.log(`Time to read first row: ${(performance.now() - startTime)/1000}s`)
      let record = null;
      while (record = await cursor.next()) {
        //console.log(record);
      }
      var endTime = performance.now()
      console.log(`Took ${(endTime - startTime)/1000}s to read ${URL}`)
    </script>
  </head>
</html>

The console outputs:

Time to read first row: 0.6747999997138977s
Took 1.0477999997138978s to read test/test-files/customer.impala.parquet

Which is ~10x slower than Python

Any ideas on how to improve browser read performance?

The bulk of the time seems to spent reading the first row.

Xxhasher not returning hex-encoded string values

Xxhasher was not returning hex-encoded string values, instead returns base 10.

Steps to reproduce

XxHasher.hash64("15") returns "17181926294437511708"

Expected behaviour

XxHasher.hash64("15") returns "ee7276ee58e4421c"

Webpack Issues

Steps to Reproduce

Set up a Node.js project with @dsnp/parquetjs as a dependency.
Configure Webpack for the project with the following settings:
- Target: 'node'
- Output library target: 'umd'
- Entry: [Point to the main file of the test package]
Include the import statement const { ParquetWriter } = require('@dsnp/parquetjs'); in the main file.
Run Webpack to bundle the project.
Execute the bundled code.

Expected Behaviour

The application should bundle without errors and the @dsnp/parquetjs module should be correctly imported and functional when running the bundled code.

Actual Behaviour

Upon running the Webpack bundle, the application throws an error: Error: Cannot find module 'util'. This suggests that Webpack is unable to resolve the util module, a core Node.js module, which is required by @dsnp/parquetjs or its dependencies.

Any Logs, Error Output, Etc.?

Error: Cannot find module 'util'
    at t (index.js:2:2329285)
    at 86275 (index.js:2:2327909)
    ... [Additional stack trace] ...
    at 19785 (index.js:2:565811) {
  code: 'MODULE_NOT_FOUND'
}

Any Other Comments?

Attempts to resolve the issue by adding a fallback configuration for util in Webpack did not succeed.
The project works as expected without Webpack bundling.
This issue seems to arise specifically when using @dsnp/parquetjs in a Webpack-bundled Node.js environment.
Any insights or recommendations on configuration changes or workarounds would be greatly appreciated.

LZO and LZO_RAW Support

Steps to reproduce

Run the LZO tests in test/integration.js

Expected behaviour

The tests should pass

Actual behaviour

The round-trip test fails with:

Error: Decompression failed with code: LZO_E_OUTPUT_OVERRUN
    at Object.decompress (node_modules/lzo/index.js:59:13)
    at Object.inflate_lzo [as inflate] (lib/compression.js:91:14)
    at Object.inflate (lib/compression.js:75:52)
    at decodeDataPageV2 (lib/reader.js:932:47)
    at decodePage (lib/reader.js:710:20)
    at decodePages (lib/reader.js:747:28)
    at /Users/shannonwells/github.com/ProjectLiberty/parquetjs/lib/reader.js:609:85
    at ParquetEnvelopeReader.readRowGroup (lib/reader.js:567:35)
    at ParquetCursor.next (lib/reader.js:67:23)
    at readTestFile (test/integration.js:293:24)

Pull in Upstream RLE and dictionary bug fixes

From: ZJONSSON#81

- RLE encoding and decoding does not work correctly if the dictionary has > 255 entries.
- A column chunk with a dictionary and data pages in both PLAIN and PLAIN_DICTIONARY decodes the PLAN pages incorrectly.

invalid encoding: RLE_DICTIONARY

I'm not sure if there's a problem with the parquet data I'm using, or if this is a bug in the library, but filing anyway.

Steps to reproduce

Create a parquet file with RLE_DICTIONARY encoding.
Parse the file with the reader example: https://github.com/LibertyDSNP/parquetjs/blob/main/examples/reader.js

Expected behaviour

Parquet file should be written to the console (in JSON?).

Actual behaviour

Node raises an exception.

Any logs, error output, etc?

node:internal/process/promises:288
            triggerUncaughtException(err, true /* fromPromise */);
            ^

[UnhandledPromiseRejection: This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The promise rejected with the reason "invalid encoding: RLE_DICTIONARY".] {
  code: 'ERR_UNHANDLED_REJECTION'
}

Node.js v18.17.0

Any other comments?

parquet-tools, which uses the same parquet.thrift as parquetjs, parses the file OK.

From what I can tell, https://github.com/LibertyDSNP/parquetjs/blob/main/lib/reader.ts#L704 attempts to load the codec for RLE_DICTIONARY from the parquet_codec hash, as imported via import * as parquet_codec from './codec';.

Converstion to Full TypeScript

An issue to track progress and notes on converting all the remaining JS files to TypeScript.

Document the list of things to convert
Document the chunks that are reasonable to convert in
Document a general estimate of time to convert

When done this would trigger the ability to close #25 by just building the .d.ts files and including them in the package.

Feature Request: Timestamp support for `schema.fromJsonSchema`

Long story short: It would be cool to generate timestamp fields from the json schema.

Currently the script just checks the type from each field definition inside the json schema.
For the string type we have to extend it to check if there is a format property defined.

If the value is date-time, use createTimestampField instead of createStringField.

What do you think? If you want, I can create a PR with the changes.

Not working in the browser

Steps to reproduce

Install package and import into a client-side app (react & typescript) like mentioned in the instructions:
import parquetjs from '@dsnp/parquetjs/browser/parquetjs';

Expected behaviour

Should be possible to use the parquetjs instance and all the methods, for example parquetjs.ParquetReader...

Actual behaviour

FIrst, there's the import error

Cannot find module '@dsnp/parquetjs/browser/parquetjs' or its corresponding type declarations.ts(2307)

Then, I tried to import it as:
import parquetjs from '@dsnp/parquetjs/dist/browser/parquet';
and the import error is gone but when I try to use it like this

const reader = await parquetjs.ParquetReader.openBuffer(fileDataBuffer);

there's this error

TypeError: _dsnp_parquetjs_dist_browser_parquet__WEBPACK_IMPORTED_MODULE_1___default().ParquetReader is undefined

How can I make it work in the browser?

Add Linter and Formatter

Eslint
Prettier

Just use the default configs

Warning

If the linter has lots of crazy errors that require large refactors, reach out for help and also use @ts-ignore while filing bug reports

	// Tracked in https://github.com/LibertyDSNP/parquetjs/issues/113
	it.skip('rle_boolean_encoding.parquet loads', async function() {
	const data = await readData('rle/rle_boolean_encoding.parquet');
	assert.deepEqual(data[0],{ datatype_boolean: true });
	assert.deepEqual(data[1],{ datatype_boolean: false });
	});

libertydsnp / parquetjs Goto Github PK

parquetjs's People

Contributors

Stargazers

Watchers

Forkers

parquetjs's Issues

Acceptance criteria

Problem

Solution

Steps to reproduce

Expected behaviour

Actual behaviour

Any other comments?

Notes

Steps to reproduce

Suggested Improvement behaviour

Actual behaviour

Any other comments?

Description

Steps to reproduce

Expected behaviour

Actual behaviour

Any logs, error output, etc?

Any other comments?

Steps to reproduce

Expected behaviour

Actual behaviour

Any other comments?

Steps to reproduce

Expected behaviour

Any other comments?

Steps to reproduce

Expected behaviour

Actual behaviour

Steps to reproduce

Expected behaviour

Actual behaviour

Any other comments?

Steps to reproduce

Steps to reproduce

Expected behaviour

Actual behaviour

Any logs, error output, etc?

Any other comments?

Steps to reproduce

Expected behaviour

Actual behaviour

Any logs, error output, etc?

Any other comments?

Steps to reproduce

Expected behaviour

Steps to Reproduce

Expected Behaviour

Actual Behaviour

Any Logs, Error Output, Etc.?

Any Other Comments?

Steps to reproduce

Expected behaviour

Actual behaviour

Steps to reproduce

Expected behaviour

Actual behaviour

Any logs, error output, etc?

Any other comments?

Steps to reproduce

Expected behaviour

Actual behaviour

Warning

Recommend Projects

Recommend Topics

Recommend Org