Feature deion I want to read files from a <code class=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hugging Face Hub integration about dlt HOT 3 OPEN

jorritsandbrink commented on May 27, 2024

Hugging Face Hub integration

from dlt.

Comments (3)

rudolfix commented on May 27, 2024

@jorritsandbrink looks like a good first step.

I want to read files from a dataset hosted on the Hugging Face Hub as a dlt source

I'm however curious how the HF datasets library would behave with the data we produce. if you you create a few parquet files where schema evolves (we append columns at the end) and push them to the same dataset, then request with load_dataset what is going to happen? Can I upload csvs and see them as single parquet on the client side? also streaming

dataset = load_dataset('oscar-corpus/OSCAR-2201', 'en', split='train', streaming=True)

so:

I want to access my dataset, train and stream like any other HF datasets

from dlt.

jorritsandbrink commented on May 27, 2024

@rudolfix

Schema evolution seems not supported. Did a simple test with CSV and Parquet. Multiple files can be handled, but only if they contain the same column names (error is thrown otherwise). Couldn't find a config to enable it.

load_dataset("jorritsandbrink/dlt_dev", data_files=["foo.csv", "baz.csv"], sep=";")  # baz.csv has extra column "bla"

Error:

DatasetGenerationCastError: An error occurred while generating the dataset

All the data files must have the same columns, but at some point there are 1 new columns ({' bla'})

This happened while the csv dataset builder was generating data using

hf://datasets/jorritsandbrink/dlt_dev/baz.csv (at revision 6dab0737041dfeef3cc8446f61a1ecd059bec7e0)

Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)

It is however possible to load only the initial set of columns by ignoring the new column by specifying features:

load_dataset(
    "jorritsandbrink/dlt_dev",
    data_files=["foo.csv", "baz.csv"],
    features=Features(
        {"foo": Value(dtype="int64", id=None), "bar": Value(dtype="int64", id=None), "baz": Value(dtype="int64", id=None)}
    ),
    sep=";"
)  # baz.csv has extra column "bla"

Result:

DatasetDict({
    train: Dataset({
        features: ['foo', ' bar', ' baz'],
        num_rows: 2
    })
})

Specifying a column in features that is not present in all data files causes an error.

I'd say we restrict schema evolution to prevent uploading datasets that are difficult to use (we don't want to place the burden of specifying features on the user). Schema contracts provide this functionality, but do we want to use that while it's still in experimental phase?

Don't know what you mean with "see them as single parquet on the client side", but streaming from multiple CSV files seems to work.

Edit:

Okay, I learned there's a parquet-converter bot that asynchronously converts the files in the dataset to Parquet in a dedicated branch called refs/convert/parquet.

from dlt.

rudolfix commented on May 27, 2024

@jorritsandbrink seems that we'd need to bring the files locally and merge them using arrow or duckdb then emit unified dataset. this seems both very useful and doable. However we need to get more info on what the HF users need because this is IMO quite an investment.

from dlt.

Hugging Face Hub integration about dlt HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent