Comments (3)
@jorritsandbrink looks like a good first step.
I want to read files from a dataset hosted on the Hugging Face Hub as a dlt source
I'm however curious how the HF datasets library would behave with the data we produce. if you you create a few parquet files where schema evolves (we append columns at the end) and push them to the same dataset, then request with load_dataset
what is going to happen? Can I upload csvs and see them as single parquet on the client side? also streaming
dataset = load_dataset('oscar-corpus/OSCAR-2201', 'en', split='train', streaming=True)
so:
I want to access my dataset, train and stream like any other HF datasets
from dlt.
- Schema evolution seems not supported. Did a simple test with CSV and Parquet. Multiple files can be handled, but only if they contain the same column names (error is thrown otherwise). Couldn't find a config to enable it.
load_dataset("jorritsandbrink/dlt_dev", data_files=["foo.csv", "baz.csv"], sep=";") # baz.csv has extra column "bla"
Error:
DatasetGenerationCastError: An error occurred while generating the dataset
All the data files must have the same columns, but at some point there are 1 new columns ({' bla'})
This happened while the csv dataset builder was generating data using
hf://datasets/jorritsandbrink/dlt_dev/baz.csv (at revision 6dab0737041dfeef3cc8446f61a1ecd059bec7e0)
Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)
It is however possible to load only the initial set of columns by ignoring the new column by specifying features
:
load_dataset(
"jorritsandbrink/dlt_dev",
data_files=["foo.csv", "baz.csv"],
features=Features(
{"foo": Value(dtype="int64", id=None), "bar": Value(dtype="int64", id=None), "baz": Value(dtype="int64", id=None)}
),
sep=";"
) # baz.csv has extra column "bla"
Result:
DatasetDict({
train: Dataset({
features: ['foo', ' bar', ' baz'],
num_rows: 2
})
})
Specifying a column in features
that is not present in all data files causes an error.
I'd say we restrict schema evolution to prevent uploading datasets that are difficult to use (we don't want to place the burden of specifying features
on the user). Schema contracts provide this functionality, but do we want to use that while it's still in experimental phase?
- Don't know what you mean with "see them as single parquet on the client side", but streaming from multiple CSV files seems to work.
Edit:
Okay, I learned there's a parquet-converter
bot that asynchronously converts the files in the dataset to Parquet in a dedicated branch called refs/convert/parquet
.
from dlt.
@jorritsandbrink seems that we'd need to bring the files locally and merge them using arrow or duckdb then emit unified dataset. this seems both very useful and doable. However we need to get more info on what the HF users need because this is IMO quite an investment.
from dlt.
Related Issues (20)
- Add tenant_id authentification mechanism to azure blob storage
- Replace create pipeline using github
- Implement OAuth2 Implicit Flow
- RESTClient: extend paginators to handle the POST method
- Add info to docs how to re-order columns HOT 1
- LanceDB Destination
- Use .get for column nullability in mssql as in other sql destinations
- rest_api: docs: Update Define resource relationships section
- ClickHouse MergeTree Support
- Allow resolving of parent resource fields in sub-resource
- rest_client: single item endpoints not detected when path has extension
- Cannot pass session to rest client
- Nullable fields cause `column mismatch` error
- Rest client: Offset paginator fails if total_path is not specified HOT 1
- CI: optimize snippet validation performance
- bump duckdb to 0.10.x
- Documentation: update installation command in destination docs
- Documentation: update docs with examples on how to retrieve secrets from google secrets and azure key vault
- Total_path in PageNumberPaginator does not work always
- Streamlit front-end for creating dlt pipelines
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dlt.