Coder Social home page Coder Social logo

Comments (3)

rudolfix avatar rudolfix commented on May 27, 2024

@jorritsandbrink looks like a good first step.

I want to read files from a dataset hosted on the Hugging Face Hub as a dlt source

I'm however curious how the HF datasets library would behave with the data we produce. if you you create a few parquet files where schema evolves (we append columns at the end) and push them to the same dataset, then request with load_dataset what is going to happen? Can I upload csvs and see them as single parquet on the client side? also streaming

dataset = load_dataset('oscar-corpus/OSCAR-2201', 'en', split='train', streaming=True)

so:

I want to access my dataset, train and stream like any other HF datasets

from dlt.

jorritsandbrink avatar jorritsandbrink commented on May 27, 2024

@rudolfix

  1. Schema evolution seems not supported. Did a simple test with CSV and Parquet. Multiple files can be handled, but only if they contain the same column names (error is thrown otherwise). Couldn't find a config to enable it.
load_dataset("jorritsandbrink/dlt_dev", data_files=["foo.csv", "baz.csv"], sep=";")  # baz.csv has extra column "bla"

Error:

DatasetGenerationCastError: An error occurred while generating the dataset

All the data files must have the same columns, but at some point there are 1 new columns ({' bla'})

This happened while the csv dataset builder was generating data using

hf://datasets/jorritsandbrink/dlt_dev/baz.csv (at revision 6dab0737041dfeef3cc8446f61a1ecd059bec7e0)

Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)

It is however possible to load only the initial set of columns by ignoring the new column by specifying features:

load_dataset(
    "jorritsandbrink/dlt_dev",
    data_files=["foo.csv", "baz.csv"],
    features=Features(
        {"foo": Value(dtype="int64", id=None), "bar": Value(dtype="int64", id=None), "baz": Value(dtype="int64", id=None)}
    ),
    sep=";"
)  # baz.csv has extra column "bla"

Result:

DatasetDict({
    train: Dataset({
        features: ['foo', ' bar', ' baz'],
        num_rows: 2
    })
})

Specifying a column in features that is not present in all data files causes an error.

I'd say we restrict schema evolution to prevent uploading datasets that are difficult to use (we don't want to place the burden of specifying features on the user). Schema contracts provide this functionality, but do we want to use that while it's still in experimental phase?

  1. Don't know what you mean with "see them as single parquet on the client side", but streaming from multiple CSV files seems to work.
    image

Edit:

Okay, I learned there's a parquet-converter bot that asynchronously converts the files in the dataset to Parquet in a dedicated branch called refs/convert/parquet.

from dlt.

rudolfix avatar rudolfix commented on May 27, 2024

@jorritsandbrink seems that we'd need to bring the files locally and merge them using arrow or duckdb then emit unified dataset. this seems both very useful and doable. However we need to get more info on what the HF users need because this is IMO quite an investment.

from dlt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.