Coder Social home page Coder Social logo

nielsbasjes / splittablegzip Goto Github PK

View Code? Open in Web Editor NEW
68.0 8.0 8.0 1.4 MB

Splittable Gzip codec for Hadoop

License: Apache License 2.0

Java 89.92% Shell 1.30% PigLatin 1.62% Makefile 7.15%
hadoop codec gzip splittable pig gzip-codec gzipped-files mapreduce-java spark

splittablegzip's Issues

Handle very small files

If you have a file that is smaller than 4KB the codec will fail.
This is undesired behaviour.

Guide/readme/example for using with AWS Glue ETL job

I wonder if you could make suggestions on how to use this in an AWS glue job. My method does not involve using spark-submit but rather creating job definitions and run-job using boto3 tools.

When I try to use this in my script, i get:
pyspark.sql.utils.IllegalArgumentException: Compression codec nl.basjes.hadoop.io.compress.SplittableGzipCodec not found.

have tried passing --conf nl.basjes.hadoop.io.compress.SplittableGzipCodec, -packages nl.basjes.hadoop.io.compress.SplittableGzipCodec and other methods as args to job to no avail. I think I must need to put a copy of the codec on s3 and point to it with extra-files or other arg?

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

This repository currently has no open or pending branches.

Detected dependencies

github-actions
.github/workflows/build.yml
  • actions/checkout v4
  • actions/cache v4
  • actions/setup-java v4
  • codecov/codecov-action v4.4.1
maven
Benchmark/javamr/pom.xml
  • junit:junit 4.13.2
  • org.apache.hadoop:hadoop-client 3.4.0
Benchmark/pig/pom.xml
hadoop-codec/pom.xml
  • junit:junit 4.13.2
  • org.slf4j:slf4j-api 2.0.13
  • org.apache.maven.plugins:maven-gpg-plugin 3.2.4
  • org.apache.maven.plugins:maven-source-plugin 3.3.1
  • org.apache.maven.plugins:maven-deploy-plugin 3.1.2
  • org.apache.maven.plugins:maven-javadoc-plugin 3.6.3
  • org.apache.maven.plugins:maven-pmd-plugin 3.22.0
  • org.codehaus.mojo:findbugs-maven-plugin 3.0.5
pom.xml
  • org.apache.hadoop:hadoop-client 3.4.0

  • Check this box to trigger a request for Renovate to run again on this repository

Failing on Spark 3.1.2

Hi,
I get an exception trying to start session with spark.jars.packages='nl.basjes.hadoop:splittablegzip:1.3'
java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: Could not initialize class org.slf4j.LoggerFactory when creating Hive client using classpath

Anyone with experience in running on Spark 3+?

Action Required: Fix Renovate Configuration

There is an error with this repository's Renovate configuration that needs to be fixed. As a precaution, Renovate will stop PRs until it is resolved.

Location: renovate.json
Error type: The renovate configuration file contains some invalid settings
Message: Invalid regExp for packageRules[0].allowedVersions: '!/^(?i).*[-_\.](Alpha|Beta|RC|M|EA|Snap|snapshot|jboss|atlassian)[-_\.]?[0-9]?.*$/'

Thank you - SplittableGzip works out of the box with Apache Spark!

I came across this project via a comment on a Spark Jira ticket where I was thinking about a way to split gzip files that is similar to what this project does. I was delighted to learn that someone had already thought through and implemented such a solution, and from the looks of it done a much better job at it than I could have.

So I just wanted to report here for the record, since gzipped files are a common stumbling block for Apache Spark users, that your solution works with Apache Spark without modification.

Here is an example, which I've tested against Apache Spark 2.4.4 using the Python DataFrame API:

# splittable-gzip.py
from pyspark.sql import SparkSession


if __name__ == '__main__':
    spark = (
        SparkSession.builder
        # If you want to change the split size, you need to use this config
        # instead of mapreduce.input.fileinputformat.split.maxsize.
        # I don't think Spark DataFrames offer an equivalent setting for
        # mapreduce.input.fileinputformat.split.minsize.
        .config('spark.sql.files.maxPartitionBytes', 1000 * (1024 ** 2))
        .getOrCreate()
    )

    print(
        spark.read
        # You can also specify this option against the SparkSession.
        .option('io.compression.codecs', 'nl.basjes.hadoop.io.compress.SplittableGzipCodec')
        .csv(...)
        .count()
    )

Run this script as follows:

spark-submit --packages "nl.basjes.hadoop:splittablegzip:1.2" splittable-gzip.py

Here's what I see in the Spark UI when I run this script against a 20 GB gzip file on my laptop:

Screen Shot 2019-10-04 at 8 46 35 PM

You can see in the task list the behavior described in the README, with each task reading from the start of the file to its target split.

And here is the Executor UI, which shows every available core running concurrently against this single file:

Screen Shot 2019-10-04 at 9 19 20 PM

I will experiment some more with this project -- and perhaps ask some questions on here, if you don't mind -- and then promote it on Stack Overflow and in the Boston area.

Thank you for open sourcing this work!

Fails on spark with "The provided InputSplit bytes which is too small"

Hi! Thank you for this great library. We used it to process our large input gz files, but we faced with some problem.

java.lang.IllegalArgumentException: The provided InputSplit (786432000;786439029] is 7029 bytes which is too small. (Minimum is 65536)

In our company we use HDP 2.6 with spark 2.3.
I tried to find min split parameter for spark, but spark.hadoop.mapreduce.input.fileinputformat.split.minsize doesn't work. Only spark.sql.files.maxPartitionBytes setting realy works.
Give me some advice, please, what can I do? Or may be it's possible to fix problem in library?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.