huggingface / datasets Goto Github PK

View Code? Open in Web Editor NEW

18.4K 278.0 2.5K 86.03 MB

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Home Page: https://huggingface.co/docs/datasets

License: Apache License 2.0

Python 91.09% Jupyter Notebook 8.90% Makefile 0.01%

nlp datasets pytorch tensorflow pandas numpy natural-language-processing computer-vision machine-learning deep-learning

datasets's Introduction

Hugging Face Datasets Library

🤗 Datasets is a lightweight library providing two main features:

one-line dataloaders for many public datasets: one-liners to download and pre-process any of the major public datasets (image datasets, audio datasets, text datasets in 467 languages and dialects, etc.) provided on the HuggingFace Datasets Hub. With a simple command like squad_dataset = load_dataset("squad"), get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),
efficient data pre-processing: simple, fast and reproducible data pre-processing for the public datasets as well as your own local datasets in CSV, JSON, text, PNG, JPEG, WAV, MP3, Parquet, etc. With simple commands like processed_dataset = dataset.map(process_example), efficiently prepare the dataset for inspection and ML model evaluation and training.

🎓 Documentation 🔎 Find a dataset in the Hub 🌟 Share a dataset on the Hub

🤗 Datasets is designed to let the community easily add and share new datasets.

🤗 Datasets has many additional interesting features:

Thrive on large datasets: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped using an efficient zero-serialization cost backend (Apache Arrow).
Smart caching: never wait for your data to process several times.
Lightweight and fast with a transparent and pythonic API (multi-processing/caching/memory-mapping).
Built-in interoperability with NumPy, pandas, PyTorch, TensorFlow 2 and JAX.
Native support for audio and image data.
Enable streaming mode to save disk space and start iterating over the dataset immediately.

🤗 Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and tfds can be found in the section Main differences between 🤗 Datasets and tfds.

Installation

With pip

🤗 Datasets can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)

pip install datasets

With conda

🤗 Datasets can be installed using conda as follows:

conda install -c huggingface -c conda-forge datasets

Follow the installation pages of TensorFlow and PyTorch to see how to install them with conda.

For more details on installation, check the installation page in the documentation: https://huggingface.co/docs/datasets/installation

Installation to use with PyTorch/TensorFlow/pandas

If you plan to use 🤗 Datasets with PyTorch (1.0+), TensorFlow (2.2+) or pandas, you should also install PyTorch, TensorFlow or pandas.

For more details on using the library with NumPy, pandas, PyTorch or TensorFlow, check the quick start page in the documentation: https://huggingface.co/docs/datasets/quickstart

Usage

🤗 Datasets is made to be very simple to use - the API is centered around a single function, datasets.load_dataset(dataset_name, **kwargs), that instantiates a dataset.

This library can be used for text/image/audio/etc. datasets. Here is an example to load a text dataset:

Here is a quick example:

from datasets import load_dataset

# Print all the available datasets
from huggingface_hub import list_datasets
print([dataset.id for dataset in list_datasets()])

# Load a dataset and print the first example in the training set
squad_dataset = load_dataset('squad')
print(squad_dataset['train'][0])

# Process the dataset - add a column with the length of the context texts
dataset_with_length = squad_dataset.map(lambda x: {"length": len(x["context"])})

# Process the dataset - tokenize the context texts (using a tokenizer from the 🤗 Transformers library)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

tokenized_dataset = squad_dataset.map(lambda x: tokenizer(x['context']), batched=True)

If your dataset is bigger than your disk or if you don't want to wait to download the data, you can use streaming:

# If you want to use the dataset immediately and efficiently stream the data as you iterate over the dataset
image_dataset = load_dataset('cifar100', streaming=True)
for example in image_dataset["train"]:
    break

For more details on using the library, check the quick start page in the documentation: https://huggingface.co/docs/datasets/quickstart and the specific pages on:

Loading a dataset: https://huggingface.co/docs/datasets/loading
What's in a Dataset: https://huggingface.co/docs/datasets/access
Processing data with 🤗 Datasets: https://huggingface.co/docs/datasets/process
- Processing audio data: https://huggingface.co/docs/datasets/audio_process
- Processing image data: https://huggingface.co/docs/datasets/image_process
- Processing text data: https://huggingface.co/docs/datasets/nlp_process
Streaming a dataset: https://huggingface.co/docs/datasets/stream
Writing your own dataset loading script: https://huggingface.co/docs/datasets/dataset_script
etc.

Add a new dataset to the Hub

We have a very detailed step-by-step guide to add a new dataset to the datasets already provided on the HuggingFace Datasets Hub.

You can find:

Main differences between 🤗 Datasets and `tfds`

If you are familiar with the great TensorFlow Datasets, here are the main differences between 🤗 Datasets and tfds:

the scripts in 🤗 Datasets are not provided within the library but are queried, downloaded/cached and dynamically loaded upon request
the backend serialization of 🤗 Datasets is based on Apache Arrow instead of TF Records and leverage python dataclasses for info and features with some diverging features (we mostly don't do encoding and store the raw data as much as possible in the backend serialization cache).
the user-facing dataset object of 🤗 Datasets is not a tf.data.Dataset but a built-in framework-agnostic dataset class with methods inspired by what we like in tf.data (like a map() method). It basically wraps a memory-mapped Arrow table cache.

Disclaimers

🤗 Datasets may run Python code defined by the dataset authors to parse certain data formats or structures. For security reasons, we ask users to:

check the dataset scripts they're going to run beforehand and
pin the revision of the repositories they use.

If you're a dataset owner and wish to update any part of it (description, citation, license, etc.), or do not want your dataset to be included in the Hugging Face Hub, please get in touch by opening a discussion or a pull request in the Community tab of the dataset page. Thanks for your contribution to the ML community!

BibTeX

If you want to cite our 🤗 Datasets library, you can use our paper:

@inproceedings{lhoest-etal-2021-datasets,
    title = "Datasets: A Community Library for Natural Language Processing",
    author = "Lhoest, Quentin  and
      Villanova del Moral, Albert  and
      Jernite, Yacine  and
      Thakur, Abhishek  and
      von Platen, Patrick  and
      Patil, Suraj  and
      Chaumond, Julien  and
      Drame, Mariama  and
      Plu, Julien  and
      Tunstall, Lewis  and
      Davison, Joe  and
      {\v{S}}a{\v{s}}ko, Mario  and
      Chhablani, Gunjan  and
      Malik, Bhavitvya  and
      Brandeis, Simon  and
      Le Scao, Teven  and
      Sanh, Victor  and
      Xu, Canwen  and
      Patry, Nicolas  and
      McMillan-Major, Angelina  and
      Schmid, Philipp  and
      Gugger, Sylvain  and
      Delangue, Cl{\'e}ment  and
      Matussi{\`e}re, Th{\'e}o  and
      Debut, Lysandre  and
      Bekman, Stas  and
      Cistac, Pierric  and
      Goehringer, Thibault  and
      Mustar, Victor  and
      Lagunas, Fran{\c{c}}ois  and
      Rush, Alexander  and
      Wolf, Thomas",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-demo.21",
    pages = "175--184",
    abstract = "The scale, variety, and quantity of publicly-available NLP datasets has grown rapidly as researchers propose new tasks, larger models, and novel benchmarks. Datasets is a community library for contemporary NLP designed to support this ecosystem. Datasets aims to standardize end-user interfaces, versioning, and documentation, while providing a lightweight front-end that behaves similarly for small datasets as for internet-scale corpora. The design of the library incorporates a distributed, community-driven approach to adding datasets and documenting usage. After a year of development, the library now includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects and shared tasks. The library is available at https://github.com/huggingface/datasets.",
    eprint={2109.02846},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
}

If you need to cite a specific version of our 🤗 Datasets library for reproducibility, you can use the corresponding version Zenodo DOI from this list.

datasets's People

Contributors

Stargazers

Watchers

Forkers

aspirincode pranv sts-sadr kaushik88 codehunk628 renaud rajesh16702 databill86 heiwais25 tejamoy ai-natural-language-processing-lab generalzh felixgithub2017 alvations deepchatterjeevns himanshu1686 abhinavm24 mesrinathreddy krzemienski bharatr21 vanshikagoell murilo shaunstanislauslau madamalarevanth goswamig skhanna2109 jbragg skycreeper2000 wang247 entilzha harrymelon xwild thuanbvv cy-dev-tex dreadlord1984 yjernite peters111 jeanru vikas95 e-budur jbr000 ddua delip airklizz alanagiasi dragomirradev cedricconol yani-abdesselam codeaudit asyrofist ram12a dixitishan811 tayciryahmed berzak carlos-aguayo shivanishimpi saswat0 compguesswhat alesuglia adai183 theophileblard ftarlaci rachelker arachchi ejhortala askintution easonnie ashwinraikar88 huolala2020 cicyfan jxmorris12 ladx64 walkerhart askmetoo sn696 kayalks liyunbin ankitshah009 huggingworld varal7 c3suryansu 5l1v3r1 lucidrains zedauna k-halid cxz rubenszimbres ssitb mayurnewase baitcenter recitalai emberlee marxav kiitaamuuraa stjordanis jarednielsen thiagonoma dorucioclea millerjohnp petrosstav

datasets's Issues

Error when citation is not given in the DatasetInfo

The following error is raised when the citation parameter is missing when we instantiate a DatasetInfo:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jplu/dev/jplu/datasets/src/nlp/info.py", line 338, in __repr__
    citation_pprint = _indent('"""{}"""'.format(self.citation.strip()))
AttributeError: 'NoneType' object has no attribute 'strip'

I propose to do the following change in the info.py file. The method:

def __repr__(self):
        splits_pprint = _indent("\n".join(["{"] + [
                "    '{}': {},".format(k, split.num_examples)
                for k, split in sorted(self.splits.items())
        ] + ["}"]))
        features_pprint = _indent(repr(self.features))
        citation_pprint = _indent('"""{}"""'.format(self.citation.strip()))
        return INFO_STR.format(
                name=self.name,
                version=self.version,
                description=self.description,
                total_num_examples=self.splits.total_num_examples,
                features=features_pprint,
                splits=splits_pprint,
                citation=citation_pprint,
                homepage=self.homepage,
                supervised_keys=self.supervised_keys,
                # Proto add a \n that we strip.
                license=str(self.license).strip())

Becomes:

def __repr__(self):
        splits_pprint = _indent("\n".join(["{"] + [
                "    '{}': {},".format(k, split.num_examples)
                for k, split in sorted(self.splits.items())
        ] + ["}"]))
        features_pprint = _indent(repr(self.features))
        ## the strip is done only is the citation is given
        citation_pprint = self.citation

        if self.citation:
            citation_pprint = _indent('"""{}"""'.format(self.citation.strip()))
        return INFO_STR.format(
                name=self.name,
                version=self.version,
                description=self.description,
                total_num_examples=self.splits.total_num_examples,
                features=features_pprint,
                splits=splits_pprint,
                citation=citation_pprint,
                homepage=self.homepage,
                supervised_keys=self.supervised_keys,
                # Proto add a \n that we strip.
                license=str(self.license).strip())

And now it is ok. @thomwolf are you ok with this fix?

How can we add more datasets to nlp library?

Clone not working on Windows environment

Cloning in a windows environment is not working because of use of special character '?' in folder name ..
Please consider changing the folder name ....
Reference to folder -
nlp/datasets/cnn_dailymail/dummy/3.0.0/3.0.0/dummy_data-zip-extracted/dummy_data/uc?export=download&id=0BwmD_VLjROrfM1BxdkxVaTY2bWs/dailymail/stories/

error log:
fatal: cannot create directory at 'datasets/cnn_dailymail/dummy/3.0.0/3.0.0/dummy_data-zip-extracted/dummy_data/uc?export=download&id=0BwmD_VLjROrfM1BxdkxVaTY2bWs': Invalid argument

Cannot upload my own dataset

I look into nlp-cli and user.py to learn how to upload my own data.

It is supposed to work like this

Register to get username, password at huggingface.co
nlp-cli login and type username, passworld
I have a single file to upload at ./ttc/ttc_freq_extra.csv
nlp-cli upload ttc/ttc_freq_extra.csv

But I got this error.

2020-05-21 16:33:52.722464: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
About to upload file /content/ttc/ttc_freq_extra.csv to S3 under filename ttc/ttc_freq_extra.csv and namespace korakot
Proceed? [Y/n] y
Uploading... This might take a while if files are large
Traceback (most recent call last):
  File "/usr/local/bin/nlp-cli", line 33, in <module>
    service.run()
  File "/usr/local/lib/python3.6/dist-packages/nlp/commands/user.py", line 234, in run
    token=token, filename=filename, filepath=filepath, organization=self.args.organization
  File "/usr/local/lib/python3.6/dist-packages/nlp/hf_api.py", line 141, in presign_and_upload
    urls = self.presign(token, filename=filename, organization=organization)
  File "/usr/local/lib/python3.6/dist-packages/nlp/hf_api.py", line 132, in presign
    return PresignedUrl(**d)
TypeError: __init__() got an unexpected keyword argument 'cdn'

Some error inside nlp.load_dataset()

First of all, nice work!

I am going through this overview notebook

In simple step dataset = nlp.load_dataset('squad', split='validation[:10%]')

I get an error, which is connected with some inner code, I think:

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-8-d848d3a99b8c> in <module>()
      1 # Downloading and loading a dataset
      2 
----> 3 dataset = nlp.load_dataset('squad', split='validation[:10%]')

8 frames

/usr/local/lib/python3.6/dist-packages/nlp/load.py in load_dataset(path, name, version, data_dir, data_files, split, cache_dir, download_config, download_mode, ignore_verifications, save_infos, **config_kwargs)
    515         download_mode=download_mode,
    516         ignore_verifications=ignore_verifications,
--> 517         save_infos=save_infos,
    518     )
    519 

/usr/local/lib/python3.6/dist-packages/nlp/builder.py in download_and_prepare(self, download_config, download_mode, ignore_verifications, save_infos, dl_manager, **download_and_prepare_kwargs)
    361                 verify_infos = not save_infos and not ignore_verifications
    362                 self._download_and_prepare(
--> 363                     dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
    364                 )
    365                 # Sync info

/usr/local/lib/python3.6/dist-packages/nlp/builder.py in _download_and_prepare(self, dl_manager, verify_infos, **prepare_split_kwargs)
    414             try:
    415                 # Prepare split will record examples associated to the split
--> 416                 self._prepare_split(split_generator, **prepare_split_kwargs)
    417             except OSError:
    418                 raise OSError("Cannot find data file. " + (self.MANUAL_DOWNLOAD_INSTRUCTIONS or ""))

/usr/local/lib/python3.6/dist-packages/nlp/builder.py in _prepare_split(self, split_generator)
    585         fname = "{}-{}.arrow".format(self.name, split_generator.name)
    586         fpath = os.path.join(self._cache_dir, fname)
--> 587         examples_type = self.info.features.type
    588         writer = ArrowWriter(data_type=examples_type, path=fpath, writer_batch_size=self._writer_batch_size)
    589 

/usr/local/lib/python3.6/dist-packages/nlp/features.py in type(self)
    460     @property
    461     def type(self):
--> 462         return get_nested_type(self)
    463 
    464     @classmethod

/usr/local/lib/python3.6/dist-packages/nlp/features.py in get_nested_type(schema)
    370     # Nested structures: we allow dict, list/tuples, sequences
    371     if isinstance(schema, dict):
--> 372         return pa.struct({key: get_nested_type(value) for key, value in schema.items()})
    373     elif isinstance(schema, (list, tuple)):
    374         assert len(schema) == 1, "We defining list feature, you should just provide one example of the inner type"

/usr/local/lib/python3.6/dist-packages/nlp/features.py in <dictcomp>(.0)
    370     # Nested structures: we allow dict, list/tuples, sequences
    371     if isinstance(schema, dict):
--> 372         return pa.struct({key: get_nested_type(value) for key, value in schema.items()})
    373     elif isinstance(schema, (list, tuple)):
    374         assert len(schema) == 1, "We defining list feature, you should just provide one example of the inner type"

/usr/local/lib/python3.6/dist-packages/nlp/features.py in get_nested_type(schema)
    379         # We allow to reverse list of dict => dict of list for compatiblity with tfds
    380         if isinstance(inner_type, pa.StructType):
--> 381             return pa.struct(dict((f.name, pa.list_(f.type, schema.length)) for f in inner_type))
    382         return pa.list_(inner_type, schema.length)
    383 

/usr/local/lib/python3.6/dist-packages/nlp/features.py in <genexpr>(.0)
    379         # We allow to reverse list of dict => dict of list for compatiblity with tfds
    380         if isinstance(inner_type, pa.StructType):
--> 381             return pa.struct(dict((f.name, pa.list_(f.type, schema.length)) for f in inner_type))
    382         return pa.list_(inner_type, schema.length)
    383 

TypeError: list_() takes exactly one argument (2 given)

[Feature] More dataset outputs

Add the following dataset outputs:

Spark
Pandas

Mistaken `_KWARGS_DESCRIPTION` for XNLI metric

Hi!

The _KWARGS_DESCRIPTION for the XNLI metric uses Args and Returns text from BLEU metric:

_KWARGS_DESCRIPTION = """
Computes XNLI score which is just simple accuracy.
Args:
    predictions: list of translations to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
    max_order: Maximum n-gram order to use when computing BLEU score.
    smooth: Whether or not to apply Lin et al. 2004 smoothing.
Returns:
    'bleu': bleu score,
    'precisions': geometric mean of n-gram precisions,
    'brevity_penalty': brevity penalty,
    'length_ratio': ratio of lengths,
    'translation_length': translation_length,
    'reference_length': reference_length
"""

But it should be something like:

_KWARGS_DESCRIPTION = """
Computes XNLI score which is just simple accuracy.
Args:
    predictions: Predicted labels.
    references: Ground truth labels.
Returns:
    'accuracy': accuracy

ArrowTypeError in squad metrics

squad_metric.compute is giving following error

ArrowTypeError: Could not convert [{'text': 'Denver Broncos'}, {'text': 'Denver Broncos'}, {'text': 'Denver Broncos'}] with type list: was not a dict, tuple, or recognized null value for conversion to struct type

This is how my predictions and references look like

predictions[0]
# {'id': '56be4db0acb8001400a502ec', 'prediction_text': 'Denver Broncos'}

references[0]
# {'answers': [{'text': 'Denver Broncos'},
  {'text': 'Denver Broncos'},
  {'text': 'Denver Broncos'}],
 'id': '56be4db0acb8001400a502ec'}

These are structured as per the squad_metric.compute help string.

[Question] BERT-style multiple choice formatting

Hello, I am wondering what the equivalent formatting of a dataset should be to allow for multiple-choice answering prediction, BERT-style. Previously, this was done by passing a list of InputFeatures to the dataloader instead of a list of InputFeature, where InputFeatures contained lists of length equal to the number of answer choices in the MCQ instead of single items. I'm a bit confused on what the output of my feature conversion function should be when using dataset.map() to ensure similar behavior.

Thanks!

🐛 Trying to use ROUGE metric : pyarrow.lib.ArrowInvalid: Column 1 named references expected length 534 but got length 323

I'm trying to use rouge metric.

I have to files : test.pred.tokenized and test.gold.tokenized with each line containing a sentence.
I tried :

import nlp

rouge = nlp.load_metric('rouge')
with open("test.pred.tokenized") as p, open("test.gold.tokenized") as g:
    for lp, lg in zip(p, g):
            rouge.add(lp, lg)

But I meet following error :

pyarrow.lib.ArrowInvalid: Column 1 named references expected length 534 but got length 323

Full stack-trace :

Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/home/me/.venv/transformers/lib/python3.6/site-packages/nlp/metric.py", line 224, in add
    self.writer.write_batch(batch)
  File "/home/me/.venv/transformers/lib/python3.6/site-packages/nlp/arrow_writer.py", line 148, in write_batch
    pa_table: pa.Table = pa.Table.from_pydict(batch_examples, schema=self._schema)
  File "pyarrow/table.pxi", line 1550, in pyarrow.lib.Table.from_pydict
  File "pyarrow/table.pxi", line 1503, in pyarrow.lib.Table.from_arrays
  File "pyarrow/public-api.pxi", line 390, in pyarrow.lib.pyarrow_wrap_table
  File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 1 named references expected length 534 but got length 323

(nlp installed from source)

❓ How to remove specific rows of a dataset ?

I saw on the example notebook how to remove a specific column :

dataset.drop('id')

But I didn't find how to remove a specific row.

For example, how can I remove all sample with id < 10 ?

nlp.load_dataset() gives "TypeError: list_() takes exactly one argument (2 given)"

I'm trying to load datasets from nlp but there seems to have error saying
"TypeError: list_() takes exactly one argument (2 given)"

gist can be found here
https://gist.github.com/saahiluppal/c4b878f330b10b9ab9762bc0776c0a6a

Issue to read a local dataset

Hello,

As proposed by @thomwolf, I open an issue to explain what I'm trying to do without success. What I want to do is to create and load a local dataset, the script I have done is the following:

import os
import csv

import nlp


class BbcConfig(nlp.BuilderConfig):
    def __init__(self, **kwargs):
        super(BbcConfig, self).__init__(**kwargs)


class Bbc(nlp.GeneratorBasedBuilder):
    _DIR = "./data"
    _DEV_FILE = "test.csv"
    _TRAINING_FILE = "train.csv"

    BUILDER_CONFIGS = [BbcConfig(name="bbc", version=nlp.Version("1.0.0"))]

    def _info(self):
        return nlp.DatasetInfo(builder=self, features=nlp.features.FeaturesDict({"id": nlp.string, "text": nlp.string, "label": nlp.string}))

    def _split_generators(self, dl_manager):
        files = {"train": os.path.join(self._DIR, self._TRAINING_FILE), "dev": os.path.join(self._DIR, self._DEV_FILE)}

        return [nlp.SplitGenerator(name=nlp.Split.TRAIN, gen_kwargs={"filepath": files["train"]}),
                nlp.SplitGenerator(name=nlp.Split.VALIDATION, gen_kwargs={"filepath": files["dev"]})]

    def _generate_examples(self, filepath):
        with open(filepath) as f:
            reader = csv.reader(f, delimiter=',', quotechar="\"")
            lines = list(reader)[1:]

            for idx, line in enumerate(lines):
                yield idx, {"idx": idx, "text": line[1], "label": line[0]}

The dataset is attached to this issue as well:
data.zip

Now the steps to reproduce what I would like to do:

unzip data locally (I know the nlp lib can detect and extract archives but I want to reduce and facilitate the reproduction as much as possible)
create the bbc.py script as above at the same location than the unziped data folder.

Now I try to load the dataset in three different ways and none works, the first one with the name of the dataset like I would do with TFDS:

import nlp
from bbc import Bbc
dataset = nlp.load("bbc")

I get:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/envs/transformers/lib/python3.7/site-packages/nlp/load.py", line 280, in load
    dbuilder: DatasetBuilder = builder(path, name, data_dir=data_dir, **builder_kwargs)
  File "/opt/anaconda3/envs/transformers/lib/python3.7/site-packages/nlp/load.py", line 166, in builder
    builder_cls = load_dataset(path, name=name, **builder_kwargs)
  File "/opt/anaconda3/envs/transformers/lib/python3.7/site-packages/nlp/load.py", line 88, in load_dataset
    local_files_only=local_files_only,
  File "/opt/anaconda3/envs/transformers/lib/python3.7/site-packages/nlp/utils/file_utils.py", line 214, in cached_path
    if not is_zipfile(output_path) and not tarfile.is_tarfile(output_path):
  File "/opt/anaconda3/envs/transformers/lib/python3.7/zipfile.py", line 203, in is_zipfile
    with open(filename, "rb") as fp:
TypeError: expected str, bytes or os.PathLike object, not NoneType

But @thomwolf told me that no need to import the script, just put the path of it, then I tried three different way to do:

import nlp
dataset = nlp.load("bbc.py")

And

import nlp
dataset = nlp.load("./bbc.py")

And

import nlp
dataset = nlp.load("/absolute/path/to/bbc.py")

These three ways gives me:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/envs/transformers/lib/python3.7/site-packages/nlp/load.py", line 280, in load
    dbuilder: DatasetBuilder = builder(path, name, data_dir=data_dir, **builder_kwargs)
  File "/opt/anaconda3/envs/transformers/lib/python3.7/site-packages/nlp/load.py", line 166, in builder
    builder_cls = load_dataset(path, name=name, **builder_kwargs)
  File "/opt/anaconda3/envs/transformers/lib/python3.7/site-packages/nlp/load.py", line 124, in load_dataset
    dataset_module = importlib.import_module(module_path)
  File "/opt/anaconda3/envs/transformers/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 965, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'nlp.datasets.2fd72627d92c328b3e9c4a3bf7ec932c48083caca09230cebe4c618da6e93688.bbc'

Any idea of what I'm missing? or I might have spot a bug :)

_download_and_prepare() got an unexpected keyword argument 'verify_infos'

Reproduce

In Colab,

%pip install -q  nlp
%pip install -q apache_beam mwparserfromhell

dataset = nlp.load_dataset('wikipedia')

get

Downloading and preparing dataset wikipedia/20200501.aa (download: Unknown size, generated: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/wikipedia/20200501.aa/1.0.0...

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-6-52471d2a0088> in <module>()
----> 1 dataset = nlp.load_dataset('wikipedia')

1 frames

/usr/local/lib/python3.6/dist-packages/nlp/load.py in load_dataset(path, name, version, data_dir, data_files, split, cache_dir, download_config, download_mode, ignore_verifications, save_infos, **config_kwargs)
    515         download_mode=download_mode,
    516         ignore_verifications=ignore_verifications,
--> 517         save_infos=save_infos,
    518     )
    519 

/usr/local/lib/python3.6/dist-packages/nlp/builder.py in download_and_prepare(self, download_config, download_mode, ignore_verifications, save_infos, dl_manager, **download_and_prepare_kwargs)
    361                 verify_infos = not save_infos and not ignore_verifications
    362                 self._download_and_prepare(
--> 363                     dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
    364                 )
    365                 # Sync info

TypeError: _download_and_prepare() got an unexpected keyword argument 'verify_infos'

[Feature request] Add Ubuntu Dialogue Corpus dataset

https://github.com/rkadlec/ubuntu-ranking-dataset-creator or http://dataset.cs.mcgill.ca/ubuntu-corpus-1.0/

Couldn't reach CNN/DM dataset

I can't get CNN / DailyMail dataset.

import nlp

assert "cnn_dailymail" in [dataset.id for dataset in nlp.list_datasets()]
cnn_dm = nlp.load_dataset('cnn_dailymail')

Colab notebook

gives following error :

ConnectionError: Couldn't reach https://s3.amazonaws.com/datasets.huggingface.co/nlp/cnn_dailymail/cnn_dailymail.py

[Question] How to load wikipedia ? Beam runner ?

When nlp.load_dataset('wikipedia'), I got

WARNING:nlp.builder:Trying to generate a dataset using Apache Beam, yet no Beam Runner or PipelineOptions() has been provided. Please pass a nlp.DownloadConfig(beam_runner=...) object to the builder.download_and_prepare(download_config=...) method. Default values will be used.
AttributeError: 'NoneType' object has no attribute 'size'

Could somebody tell me what should I do ?

Env

On Colab,

git clone https://github.com/huggingface/nlp
cd nlp
pip install -q .

%pip install -q apache_beam mwparserfromhell
-> ERROR: pydrive 1.3.1 has requirement oauth2client>=4.0.0, but you'll have oauth2client 3.0.0 which is incompatible.
ERROR: google-api-python-client 1.7.12 has requirement httplib2<1dev,>=0.17.0, but you'll have httplib2 0.12.0 which is incompatible.
ERROR: chainer 6.5.0 has requirement typing-extensions<=3.6.6, but you'll have typing-extensions 3.7.4.2 which is incompatible.

pip install -q apache-beam[interactive]
ERROR: google-colab 1.0.0 has requirement ipython~=5.5.0, but you'll have ipython 5.10.0 which is incompatible.

The whole message

WARNING:nlp.builder:Trying to generate a dataset using Apache Beam, yet no Beam Runner or PipelineOptions() has been provided. Please pass a nlp.DownloadConfig(beam_runner=...) object to the builder.download_and_prepare(download_config=...) method. Default values will be used.

Downloading and preparing dataset wikipedia/20200501.aa (download: Unknown size, generated: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/wikipedia/20200501.aa/1.0.0...

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

/usr/local/lib/python3.6/dist-packages/apache_beam/runners/common.cpython-36m-x86_64-linux-gnu.so in apache_beam.runners.common.DoFnRunner.process()

44 frames

/usr/local/lib/python3.6/dist-packages/apache_beam/runners/common.cpython-36m-x86_64-linux-gnu.so in apache_beam.runners.common.PerWindowInvoker.invoke_process()

/usr/local/lib/python3.6/dist-packages/apache_beam/runners/common.cpython-36m-x86_64-linux-gnu.so in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window()

/usr/local/lib/python3.6/dist-packages/apache_beam/io/iobase.py in process(self, element, init_result)
   1081       writer.write(e)
-> 1082     return [window.TimestampedValue(writer.close(), timestamp.MAX_TIMESTAMP)]
   1083 

/usr/local/lib/python3.6/dist-packages/apache_beam/io/filebasedsink.py in close(self)
    422   def close(self):
--> 423     self.sink.close(self.temp_handle)
    424     return self.temp_shard_path

/usr/local/lib/python3.6/dist-packages/apache_beam/io/parquetio.py in close(self, writer)
    537     if len(self._buffer[0]) > 0:
--> 538       self._flush_buffer()
    539     if self._record_batches_byte_size > 0:

/usr/local/lib/python3.6/dist-packages/apache_beam/io/parquetio.py in _flush_buffer(self)
    569       for b in x.buffers():
--> 570         size = size + b.size
    571     self._record_batches_byte_size = self._record_batches_byte_size + size

AttributeError: 'NoneType' object has no attribute 'size'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)

<ipython-input-9-340aabccefff> in <module>()
----> 1 dset = nlp.load_dataset('wikipedia')

/usr/local/lib/python3.6/dist-packages/nlp/load.py in load_dataset(path, name, version, data_dir, data_files, split, cache_dir, download_config, download_mode, ignore_verifications, save_infos, **config_kwargs)
    518         download_mode=download_mode,
    519         ignore_verifications=ignore_verifications,
--> 520         save_infos=save_infos,
    521     )
    522 

/usr/local/lib/python3.6/dist-packages/nlp/builder.py in download_and_prepare(self, download_config, download_mode, ignore_verifications, save_infos, dl_manager, **download_and_prepare_kwargs)
    370                 verify_infos = not save_infos and not ignore_verifications
    371                 self._download_and_prepare(
--> 372                     dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
    373                 )
    374                 # Sync info

/usr/local/lib/python3.6/dist-packages/nlp/builder.py in _download_and_prepare(self, dl_manager, verify_infos)
    770         with beam.Pipeline(runner=beam_runner, options=beam_options,) as pipeline:
    771             super(BeamBasedBuilder, self)._download_and_prepare(
--> 772                 dl_manager, pipeline=pipeline, verify_infos=False
    773             )  # TODO{beam} verify infos
    774 

/usr/local/lib/python3.6/dist-packages/apache_beam/pipeline.py in __exit__(self, exc_type, exc_val, exc_tb)
    501   def __exit__(self, exc_type, exc_val, exc_tb):
    502     if not exc_type:
--> 503       self.run().wait_until_finish()
    504 
    505   def visit(self, visitor):

/usr/local/lib/python3.6/dist-packages/apache_beam/pipeline.py in run(self, test_runner_api)
    481       return Pipeline.from_runner_api(
    482           self.to_runner_api(use_fake_coders=True), self.runner,
--> 483           self._options).run(False)
    484 
    485     if self._options.view_as(TypeOptions).runtime_type_check:

/usr/local/lib/python3.6/dist-packages/apache_beam/pipeline.py in run(self, test_runner_api)
    494       finally:
    495         shutil.rmtree(tmpdir)
--> 496     return self.runner.run_pipeline(self, self._options)
    497 
    498   def __enter__(self):

/usr/local/lib/python3.6/dist-packages/apache_beam/runners/direct/direct_runner.py in run_pipeline(self, pipeline, options)
    128       runner = BundleBasedDirectRunner()
    129 
--> 130     return runner.run_pipeline(pipeline, options)
    131 
    132 

/usr/local/lib/python3.6/dist-packages/apache_beam/runners/portability/fn_api_runner.py in run_pipeline(self, pipeline, options)
    553 
    554     self._latest_run_result = self.run_via_runner_api(
--> 555         pipeline.to_runner_api(default_environment=self._default_environment))
    556     return self._latest_run_result
    557 

/usr/local/lib/python3.6/dist-packages/apache_beam/runners/portability/fn_api_runner.py in run_via_runner_api(self, pipeline_proto)
    563     # TODO(pabloem, BEAM-7514): Create a watermark manager (that has access to
    564     #   the teststream (if any), and all the stages).
--> 565     return self.run_stages(stage_context, stages)
    566 
    567   @contextlib.contextmanager

/usr/local/lib/python3.6/dist-packages/apache_beam/runners/portability/fn_api_runner.py in run_stages(self, stage_context, stages)
    704               stage,
    705               pcoll_buffers,
--> 706               stage_context.safe_coders)
    707           metrics_by_stage[stage.name] = stage_results.process_bundle.metrics
    708           monitoring_infos_by_stage[stage.name] = (

/usr/local/lib/python3.6/dist-packages/apache_beam/runners/portability/fn_api_runner.py in _run_stage(self, worker_handler_factory, pipeline_components, stage, pcoll_buffers, safe_coders)
   1071         cache_token_generator=cache_token_generator)
   1072 
-> 1073     result, splits = bundle_manager.process_bundle(data_input, data_output)
   1074 
   1075     def input_for(transform_id, input_id):

/usr/local/lib/python3.6/dist-packages/apache_beam/runners/portability/fn_api_runner.py in process_bundle(self, inputs, expected_outputs)
   2332 
   2333     with UnboundedThreadPoolExecutor() as executor:
-> 2334       for result, split_result in executor.map(execute, part_inputs):
   2335 
   2336         split_result_list += split_result

/usr/lib/python3.6/concurrent/futures/_base.py in result_iterator()
    584                     # Careful not to keep a reference to the popped future
    585                     if timeout is None:
--> 586                         yield fs.pop().result()
    587                     else:
    588                         yield fs.pop().result(end_time - time.monotonic())

/usr/lib/python3.6/concurrent/futures/_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433             else:
    434                 raise TimeoutError()

/usr/lib/python3.6/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

/usr/local/lib/python3.6/dist-packages/apache_beam/utils/thread_pool_executor.py in run(self)
     42       # If the future wasn't cancelled, then attempt to execute it.
     43       try:
---> 44         self._future.set_result(self._fn(*self._fn_args, **self._fn_kwargs))
     45       except BaseException as exc:
     46         # Even though Python 2 futures library has #set_exection(),

/usr/local/lib/python3.6/dist-packages/apache_beam/runners/portability/fn_api_runner.py in execute(part_map)
   2329           self._registered,
   2330           cache_token_generator=self._cache_token_generator)
-> 2331       return bundle_manager.process_bundle(part_map, expected_outputs)
   2332 
   2333     with UnboundedThreadPoolExecutor() as executor:

/usr/local/lib/python3.6/dist-packages/apache_beam/runners/portability/fn_api_runner.py in process_bundle(self, inputs, expected_outputs)
   2243             process_bundle_descriptor_id=self._bundle_descriptor.id,
   2244             cache_tokens=[next(self._cache_token_generator)]))
-> 2245     result_future = self._worker_handler.control_conn.push(process_bundle_req)
   2246 
   2247     split_results = []  # type: List[beam_fn_api_pb2.ProcessBundleSplitResponse]

/usr/local/lib/python3.6/dist-packages/apache_beam/runners/portability/fn_api_runner.py in push(self, request)
   1557       self._uid_counter += 1
   1558       request.instruction_id = 'control_%s' % self._uid_counter
-> 1559     response = self.worker.do_instruction(request)
   1560     return ControlFuture(request.instruction_id, response)
   1561 

/usr/local/lib/python3.6/dist-packages/apache_beam/runners/worker/sdk_worker.py in do_instruction(self, request)
    413       # E.g. if register is set, this will call self.register(request.register))
    414       return getattr(self, request_type)(
--> 415           getattr(request, request_type), request.instruction_id)
    416     else:
    417       raise NotImplementedError

/usr/local/lib/python3.6/dist-packages/apache_beam/runners/worker/sdk_worker.py in process_bundle(self, request, instruction_id)
    448         with self.maybe_profile(instruction_id):
    449           delayed_applications, requests_finalization = (
--> 450               bundle_processor.process_bundle(instruction_id))
    451           monitoring_infos = bundle_processor.monitoring_infos()
    452           monitoring_infos.extend(self.state_cache_metrics_fn())

/usr/local/lib/python3.6/dist-packages/apache_beam/runners/worker/bundle_processor.py in process_bundle(self, instruction_id)
    837         for data in data_channel.input_elements(instruction_id,
    838                                                 expected_transforms):
--> 839           input_op_by_transform_id[data.transform_id].process_encoded(data.data)
    840 
    841       # Finish all operations.

/usr/local/lib/python3.6/dist-packages/apache_beam/runners/worker/bundle_processor.py in process_encoded(self, encoded_windowed_values)
    214       decoded_value = self.windowed_coder_impl.decode_from_stream(
    215           input_stream, True)
--> 216       self.output(decoded_value)
    217 
    218   def try_split(self, fraction_of_remainder, total_buffer_size):

/usr/local/lib/python3.6/dist-packages/apache_beam/runners/worker/operations.cpython-36m-x86_64-linux-gnu.so in apache_beam.runners.worker.operations.Operation.output()

/usr/local/lib/python3.6/dist-packages/apache_beam/runners/worker/operations.cpython-36m-x86_64-linux-gnu.so in apache_beam.runners.worker.operations.Operation.output()

/usr/local/lib/python3.6/dist-packages/apache_beam/runners/worker/operations.cpython-36m-x86_64-linux-gnu.so in apache_beam.runners.worker.operations.SingletonConsumerSet.receive()

/usr/local/lib/python3.6/dist-packages/apache_beam/runners/worker/operations.cpython-36m-x86_64-linux-gnu.so in apache_beam.runners.worker.operations.DoOperation.process()

/usr/local/lib/python3.6/dist-packages/apache_beam/runners/worker/operations.cpython-36m-x86_64-linux-gnu.so in apache_beam.runners.worker.operations.DoOperation.process()

/usr/local/lib/python3.6/dist-packages/apache_beam/runners/common.cpython-36m-x86_64-linux-gnu.so in apache_beam.runners.common.DoFnRunner.process()

/usr/local/lib/python3.6/dist-packages/apache_beam/runners/common.cpython-36m-x86_64-linux-gnu.so in apache_beam.runners.common.DoFnRunner._reraise_augmented()

/usr/local/lib/python3.6/dist-packages/future/utils/__init__.py in raise_with_traceback(exc, traceback)
    417         if traceback == Ellipsis:
    418             _, _, traceback = sys.exc_info()
--> 419         raise exc.with_traceback(traceback)
    420 
    421 else:

/usr/local/lib/python3.6/dist-packages/apache_beam/runners/common.cpython-36m-x86_64-linux-gnu.so in apache_beam.runners.common.DoFnRunner.process()

/usr/local/lib/python3.6/dist-packages/apache_beam/runners/common.cpython-36m-x86_64-linux-gnu.so in apache_beam.runners.common.PerWindowInvoker.invoke_process()

/usr/local/lib/python3.6/dist-packages/apache_beam/runners/common.cpython-36m-x86_64-linux-gnu.so in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window()

/usr/local/lib/python3.6/dist-packages/apache_beam/io/iobase.py in process(self, element, init_result)
   1080     for e in bundle[1]:  # values
   1081       writer.write(e)
-> 1082     return [window.TimestampedValue(writer.close(), timestamp.MAX_TIMESTAMP)]
   1083 
   1084 

/usr/local/lib/python3.6/dist-packages/apache_beam/io/filebasedsink.py in close(self)
    421 
    422   def close(self):
--> 423     self.sink.close(self.temp_handle)
    424     return self.temp_shard_path

/usr/local/lib/python3.6/dist-packages/apache_beam/io/parquetio.py in close(self, writer)
    536   def close(self, writer):
    537     if len(self._buffer[0]) > 0:
--> 538       self._flush_buffer()
    539     if self._record_batches_byte_size > 0:
    540       self._write_batches(writer)

/usr/local/lib/python3.6/dist-packages/apache_beam/io/parquetio.py in _flush_buffer(self)
    568     for x in arrays:
    569       for b in x.buffers():
--> 570         size = size + b.size
    571     self._record_batches_byte_size = self._record_batches_byte_size + size

AttributeError: 'NoneType' object has no attribute 'size' [while running 'train/Save to parquet/Write/WriteImpl/WriteBundles']

[Bug] labels of glue/ax are all -1

ax = nlp.load_dataset('glue', 'ax')
for i in range(30): print(ax['test'][i]['label'], end=', ')

-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,

[Tensorflow] Use something else than `from_tensor_slices()`

In the example notebook, the TF Dataset is built using from_tensor_slices() :

columns = ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions']
train_tf_dataset.set_format(type='tensorflow', columns=columns)
features = {x: train_tf_dataset[x] for x in columns[:3]} 
labels = {"output_1": train_tf_dataset["start_positions"]}
labels["output_2"] = train_tf_dataset["end_positions"]
tfdataset = tf.data.Dataset.from_tensor_slices((features, labels)).batch(8)

But according to official tensorflow documentation, this will load the entire dataset to memory.

This defeats one purpose of this library, which is lazy loading.

Is there any other way to load the nlp dataset into TF dataset lazily ?

For example, is it possible to use Arrow dataset ? If yes, is there any code example ?

Discussion on version identifier & MockDataLoaderManager for test data

Hi, I'm working on adding a dataset and ran into an error due to download not being defined on MockDataLoaderManager, but being defined in nlp/utils/download_manager.py. The readme step running this: RUN_SLOW=1 pytest tests/test_dataset_common.py::DatasetTest::test_load_real_dataset_localmydatasetname triggers the error. If I can get something to work, I can include it in my data PR once I'm done.

caching in map causes same result to be returned for train, validation and test

hello,

I am working on a program that uses the nlp library with the SST2 dataset.

The rough outline of the program is:

import nlp as nlp_datasets
...
parser.add_argument('--dataset', help='HuggingFace Datasets id', default=['glue', 'sst2'], nargs='+')
...
dataset = nlp_datasets.load_dataset(*args.dataset)
...
# Create feature vocabs
vocabs = create_vocabs(dataset.values(), vectorizers)
...
# Create a function to vectorize based on vectorizers and vocabs:

print('TS', train_set.num_rows)
print('VS', valid_set.num_rows)
print('ES', test_set.num_rows)

# factory method to create a `convert_to_features` function based on vocabs
convert_to_features = create_featurizer(vectorizers, vocabs)
train_set = train_set.map(convert_to_features, batched=True)
train_set.set_format(type='torch', columns=list(vectorizers.keys()) + ['y', 'lengths'])
train_loader = torch.utils.data.DataLoader(train_set, batch_size=args.batchsz)

valid_set = valid_set.map(convert_to_features, batched=True)
valid_set.set_format(type='torch', columns=list(vectorizers.keys()) + ['y', 'lengths'])
valid_loader = torch.utils.data.DataLoader(valid_set, batch_size=args.batchsz)

test_set = test_set.map(convert_to_features, batched=True)
test_set.set_format(type='torch', columns=list(vectorizers.keys()) + ['y', 'lengths'])
test_loader = torch.utils.data.DataLoader(test_set, batch_size=args.batchsz)

print('TS', train_set.num_rows)
print('VS', valid_set.num_rows)
print('ES', test_set.num_rows)

Im not sure if Im using it incorrectly, but the results are not what I expect. Namely, the .map() seems to grab the datset from the cache and then loses track of what the specific dataset is, instead using my training data for all datasets:

TS 67349
VS 872
ES 1821
TS 67349
VS 67349
ES 67349

The behavior changes if I turn off the caching but then the results fail:

train_set = train_set.map(convert_to_features, batched=True, load_from_cache_file=False)
...
valid_set = valid_set.map(convert_to_features, batched=True, load_from_cache_file=False)
...
test_set = test_set.map(convert_to_features, batched=True, load_from_cache_file=False)

Now I get the right set of features back...

TS 67349
VS 872
ES 1821
100%|██████████| 68/68 [00:00<00:00, 92.78it/s]
100%|██████████| 1/1 [00:00<00:00, 75.47it/s]
  0%|          | 0/2 [00:00<?, ?it/s]TS 67349
VS 872
ES 1821
100%|██████████| 2/2 [00:00<00:00, 77.19it/s]

but I think its losing track of the original training set:

Traceback (most recent call last):
  File "/home/dpressel/dev/work/baseline/api-examples/layers-classify-hf-datasets.py", line 148, in <module>
    for x in train_loader:
  File "/home/dpressel/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/dpressel/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/dpressel/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/dpressel/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/dpressel/anaconda3/lib/python3.7/site-packages/nlp/arrow_dataset.py", line 338, in __getitem__
    output_all_columns=self._output_all_columns,
  File "/home/dpressel/anaconda3/lib/python3.7/site-packages/nlp/arrow_dataset.py", line 294, in _getitem
    outputs = self._unnest(self._data.slice(key, 1).to_pydict())
  File "pyarrow/table.pxi", line 1211, in pyarrow.lib.Table.slice
  File "pyarrow/public-api.pxi", line 390, in pyarrow.lib.pyarrow_wrap_table
  File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 3: In chunk 0: Invalid: Length spanned by list offsets (15859698) larger than values array (length 100000)

Process finished with exit code 1

The full-example program (minus the print stmts) is here:
https://github.com/dpressel/mead-baseline/pull/620/files

[Feature request] Add cos-e v1.0

I noticed the second release of cos-e (v1.11) is included in this repo. I wanted to request inclusion of v1.0, since this is the version on which results are reported on in the paper, and v1.11 has noted annotation issues.

Index outside of table length

The offset input box warns of numbers larger than a limit (like 2000) but then the errors start at a smaller value than that limit (like 1955).

ValueError: Index (2000) outside of table length (2000).
Traceback:
File "/home/sasha/.local/lib/python3.7/site-packages/streamlit/ScriptRunner.py", line 322, in _run_script
exec(code, module.dict)
File "/home/sasha/nlp_viewer/run.py", line 116, in
v = d[item][k]
File "/home/sasha/.local/lib/python3.7/site-packages/nlp/arrow_dataset.py", line 338, in getitem
output_all_columns=self._output_all_columns,
File "/home/sasha/.local/lib/python3.7/site-packages/nlp/arrow_dataset.py", line 290, in _getitem
raise ValueError(f"Index ({key}) outside of table length ({self._data.num_rows}).")

[Feature Request] Add the OpenWebText dataset

The OpenWebText dataset is an open clone of OpenAI's WebText dataset. It can be used to train ELECTRA as is specified in the README.

More information and the download link are available here.

Error with sklearn train_test_split

It would be nice if we could use sklearn train_test_split to quickly generate subsets from the dataset objects returned by nlp.load_dataset. At the moment the code:

data = nlp.load_dataset('imdb', cache_dir=data_cache)
f_half, s_half = train_test_split(data['train'], test_size=0.5, random_state=seed)

throws:

ValueError: Can only get row(s) (int or slice) or columns (string).

It's not a big deal, since there are other ways to split the data, but it would be a cool thing to have.

❓ How to apply a map to all subsets ?

I'm working with CNN/DM dataset, where I have 3 subsets : train, test, validation.

Should I apply my map function on the subsets one by one ?

import nlp

cnn_dm = nlp.load_dataset('cnn_dailymail')
for corpus in ['train', 'test', 'validation']:
         cnn_dm[corpus] = cnn_dm[corpus].map(my_func)

Or is there a better way to do this ?

AttributeError: 'dict' object has no attribute 'info'

I'm trying to access the information of CNN/DM dataset :

cnn_dm = nlp.load_dataset('cnn_dailymail')
print(cnn_dm.info)

returns :

AttributeError: 'dict' object has no attribute 'info'

Add Spanish POR and NER Datasets

Hi guys,
In order to cover multilingual support a little step could be adding standard Datasets used for Spanish NER and POS tasks.
I can provide it in raw and preprocessed formats.

Weird-ish: Not creating unique caches for different phases

Sample code:

import nlp
dataset = nlp.load_dataset('boolq')

def func1(x):
    return x

def func2(x):
    return None

train_output = dataset["train"].map(func1)
valid_output = dataset["validation"].map(func1)
print()
print(len(train_output), len(valid_output))
# Output: 9427 9427

The map method in both cases seem to be pointing to the same cache, so the latter call based on the validation data will return the processed train data cache.

What's weird is that the following doesn't seem to be an issue:

train_output = dataset["train"].map(func2)
valid_output = dataset["validation"].map(func2)
print()
print(len(train_output), len(valid_output))
# 9427 3270

[Feature request] Add Toronto BookCorpus dataset

I know the copyright/distribution of this one is complex, but it would be great to have! That, combined with the existing wikitext, would provide a complete dataset for pretraining models like BERT.

[Feature request] separate split name and split instructions

Currently, the name of an nlp.NamedSplit is parsed in arrow_reader.py and used as the instruction.

This makes it impossible to have several training sets, which can occur when:

A dataset corresponds to a collection of sub-datasets
A dataset was built in stages, adding new examples at each stage

Would it be possible to have two separate fields in the Split class, a name /instruction and a unique ID that is used as the key in the builder's split_dict ?

[Manual data dir] Error message: nlp.load_dataset('xsum') -> TypeError

v 0.1.0 from pip

import nlp
xsum = nlp.load_dataset('xsum')

Issue is dl_manager.manual_diris None

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-42-8a32f066f3bd> in <module>
----> 1 xsum = nlp.load_dataset('xsum')

~/miniconda3/envs/nb/lib/python3.7/site-packages/nlp/load.py in load_dataset(path, name, version, data_dir, data_files, split, cache_dir, download_config, download_mode, ignore_verifications, save_infos, **config_kwargs)
    515         download_mode=download_mode,
    516         ignore_verifications=ignore_verifications,
--> 517         save_infos=save_infos,
    518     )
    519 

~/miniconda3/envs/nb/lib/python3.7/site-packages/nlp/builder.py in download_and_prepare(self, download_config, download_mode, ignore_verifications, save_infos, dl_manager, **download_and_prepare_kwargs)
    361                 verify_infos = not save_infos and not ignore_verifications
    362                 self._download_and_prepare(
--> 363                     dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
    364                 )
    365                 # Sync info

~/miniconda3/envs/nb/lib/python3.7/site-packages/nlp/builder.py in _download_and_prepare(self, dl_manager, verify_infos, **prepare_split_kwargs)
    397         split_dict = SplitDict(dataset_name=self.name)
    398         split_generators_kwargs = self._make_split_generators_kwargs(prepare_split_kwargs)
--> 399         split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
    400         # Checksums verification
    401         if verify_infos:

~/miniconda3/envs/nb/lib/python3.7/site-packages/nlp/datasets/xsum/5c5fca23aaaa469b7a1c6f095cf12f90d7ab99bcc0d86f689a74fd62634a1472/xsum.py in _split_generators(self, dl_manager)
    102         with open(dl_path, "r") as json_file:
    103             split_ids = json.load(json_file)
--> 104         downloaded_path = os.path.join(dl_manager.manual_dir, "xsum-extracts-from-downloads")
    105         return [
    106             nlp.SplitGenerator(

~/miniconda3/envs/nb/lib/python3.7/posixpath.py in join(a, *p)
     78     will be discarded.  An empty last part will result in a path that
     79     ends with a separator."""
---> 80     a = os.fspath(a)
     81     sep = _get_sep(a)
     82     path = a

TypeError: expected str, bytes or os.PathLike object, not NoneType

ValueError when a split is empty

When a split is empty either TEST, VALIDATION or TRAIN I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jplu/dev/jplu/datasets/src/nlp/load.py", line 295, in load
    ds = dbuilder.as_dataset(**as_dataset_kwargs)
  File "/home/jplu/dev/jplu/datasets/src/nlp/builder.py", line 587, in as_dataset
    datasets = utils.map_nested(build_single_dataset, split, map_tuple=True)
  File "/home/jplu/dev/jplu/datasets/src/nlp/utils/py_utils.py", line 158, in map_nested
    for k, v in data_struct.items()
  File "/home/jplu/dev/jplu/datasets/src/nlp/utils/py_utils.py", line 158, in <dictcomp>
    for k, v in data_struct.items()
  File "/home/jplu/dev/jplu/datasets/src/nlp/utils/py_utils.py", line 172, in map_nested
    return function(data_struct)
  File "/home/jplu/dev/jplu/datasets/src/nlp/builder.py", line 601, in _build_single_dataset
    split=split,
  File "/home/jplu/dev/jplu/datasets/src/nlp/builder.py", line 625, in _as_dataset
    split_infos=self.info.splits.values(),
  File "/home/jplu/dev/jplu/datasets/src/nlp/arrow_reader.py", line 200, in read
    return py_utils.map_nested(_read_instruction_to_ds, instructions)
  File "/home/jplu/dev/jplu/datasets/src/nlp/utils/py_utils.py", line 172, in map_nested
    return function(data_struct)
  File "/home/jplu/dev/jplu/datasets/src/nlp/arrow_reader.py", line 191, in _read_instruction_to_ds
    file_instructions = make_file_instructions(name, split_infos, instruction)
  File "/home/jplu/dev/jplu/datasets/src/nlp/arrow_reader.py", line 104, in make_file_instructions
    absolute_instructions=absolute_instructions,
  File "/home/jplu/dev/jplu/datasets/src/nlp/arrow_reader.py", line 122, in _make_file_instructions_from_absolutes
    'Split empty. This might means that dataset hasn\'t been generated '
ValueError: Split empty. This might means that dataset hasn't been generated yet and info not restored from GCS, or that legacy dataset is used.

How to reproduce:

import csv

import nlp


class Bbc(nlp.GeneratorBasedBuilder):
    VERSION = nlp.Version("1.0.0")

    def __init__(self, **config):
        self.train = config.pop("train", None)
        self.validation = config.pop("validation", None)
        super(Bbc, self).__init__(**config)

    def _info(self):
        return nlp.DatasetInfo(builder=self, description="bla", features=nlp.features.FeaturesDict({"id": nlp.int32, "text": nlp.string, "label": nlp.string}))

    def _split_generators(self, dl_manager):
        return [nlp.SplitGenerator(name=nlp.Split.TRAIN, gen_kwargs={"filepath": self.train}),
                nlp.SplitGenerator(name=nlp.Split.VALIDATION, gen_kwargs={"filepath": self.validation}),
                nlp.SplitGenerator(name=nlp.Split.TEST, gen_kwargs={"filepath": None})]

    def _generate_examples(self, filepath):
        if not filepath:
            return None, {}

        with open(filepath) as f:
            reader = csv.reader(f, delimiter=',', quotechar="\"")
            lines = list(reader)[1:]

            for idx, line in enumerate(lines):
                yield idx, {"id": idx, "text": line[1], "label": line[0]}

import nlp
dataset = nlp.load("bbc", builder_kwargs={"train": "bbc/data/train.csv", "validation": "bbc/data/test.csv"})

Scientific Papers only downloading Pubmed

Hi!

I have been playing around with this module, and I am a bit confused about the scientific_papers dataset. I thought that it would download two separate datasets, arxiv and pubmed. But when I run the following:

dataset = nlp.load_dataset('scientific_papers', data_dir='.', cache_dir='.')
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.05k/5.05k [00:00<00:00, 2.66MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.90k/4.90k [00:00<00:00, 2.42MB/s]
Downloading and preparing dataset scientific_papers/pubmed (download: 4.20 GiB, generated: 2.33 GiB, total: 6.53 GiB) to ./scientific_papers/pubmed/1.1.1...
Downloading: 3.62GB [00:40, 90.5MB/s]
Downloading: 880MB [00:08, 101MB/s]
Dataset scientific_papers downloaded and prepared to ./scientific_papers/pubmed/1.1.1. Subsequent calls will reuse this data.

only a pubmed folder is created. There doesn't seem to be something for arxiv. Are these two datasets merged? Or have I misunderstood something?

Thanks!

When will the remaining math_dataset modules be added as dataset objects

Currently only the algebra_linear_1d is supported. Is there a timeline for making the other modules supported. If no timeline is established, how can I help?

🐛 Colab : type object 'pyarrow.lib.RecordBatch' has no attribute 'from_struct_array'

I'm trying to load CNN/DM dataset on Colab.

Colab notebook

But I meet this error :

AttributeError: type object 'pyarrow.lib.RecordBatch' has no attribute 'from_struct_array'

Add a method to shuffle a dataset

Could maybe be a dataset.shuffle(generator=None, seed=None) signature method.

Also, we could maybe have a clear indication of which method modify in-place and which methods return/cache a modified dataset. I kinda like torch conversion of having an underscore suffix for all the methods which modify a dataset in-place. What do you think?

Meta-datasets (GLUE/XTREME/...) – Special care to attributions and citations

Meta-datasets are interesting in terms of standardized benchmarks but they also have specific behaviors, in particular in terms of attribution and authorship. It's very important that each specific dataset inside a meta dataset is properly referenced and the citation/specific homepage/etc are very visible and accessible and not only the generic citation of the meta-dataset itself.

Let's take GLUE as an example:

The configuration has the citation for each dataset included (e.g. here) but it should be copied inside the dataset info so that, when people access dataset.info.citation they get both the citation for GLUE and the citation for the specific datasets inside GLUE that they have loaded.

[Question] Using/adding a local dataset

Users may want to either create/modify a local copy of a dataset, or use a custom-built dataset with the same Dataset API as externally downloaded datasets.

It appears to be possible to point to a local dataset path rather than downloading the external ones, but I'm not exactly sure how to go about doing this.

A notebook/example script demonstrating this would be very helpful.

nlp.load_dataset('xsum') -> TypeError

SyntaxError with WMT datasets

The following snippet produces a syntax error:

import nlp

dataset = nlp.load_dataset('wmt14')
print(dataset['train'][0])

Traceback (most recent call last):

  File "/home/tom/.local/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-8-3206959998b9>", line 3, in <module>
    dataset = nlp.load_dataset('wmt14')

  File "/home/tom/.local/lib/python3.6/site-packages/nlp/load.py", line 505, in load_dataset
    builder_cls = import_main_class(module_path, dataset=True)

  File "/home/tom/.local/lib/python3.6/site-packages/nlp/load.py", line 56, in import_main_class
    module = importlib.import_module(module_path)

  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)

  File "<frozen importlib._bootstrap>", line 994, in _gcd_import

  File "<frozen importlib._bootstrap>", line 971, in _find_and_load

  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked

  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked

  File "<frozen importlib._bootstrap_external>", line 678, in exec_module

  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed

  File "/home/tom/.local/lib/python3.6/site-packages/nlp/datasets/wmt14/c258d646f4f5870b0245f783b7aa0af85c7117e06aacf1e0340bd81935094de2/wmt14.py", line 21, in <module>
    from .wmt_utils import Wmt, WmtConfig

  File "/home/tom/.local/lib/python3.6/site-packages/nlp/datasets/wmt14/c258d646f4f5870b0245f783b7aa0af85c7117e06aacf1e0340bd81935094de2/wmt_utils.py", line 659
    <<<<<<< HEAD
     ^
SyntaxError: invalid syntax

Python version:
3.6.9 (default, Apr 18 2020, 01:56:04) [GCC 8.4.0]
Running on Ubuntu 18.04, via a Jupyter notebook

[Feature request] Add Google Natural Question dataset

Would be great to have https://github.com/google-research-datasets/natural-questions as an alternative to SQuAD.

Consider renaming to nld

Hey :)

Just making a thread here recording what I said on Twitter, as it's impossible to follow discussion there. It's also just really not a good way to talk about this sort of thing.

The issue is that modules go into the global namespace, so you shouldn't use variable names that conflict with module names. This means the package makes nlp a bad variable name everywhere in the codebase. I've always used nlp as the canonical variable name of spaCy's Language objects, and this is a convention that a lot of other code has followed (Stanza, flair, etc). And actually, your transformers library uses nlp as the name for its Pipeline instance in your readme.

If you stick with the nlp name for this package, if anyone uses it then they should rewrite all of that code. If nlp is a bad choice of variable anywhere, it's a bad choice of variable everywhere --- because you shouldn't have to notice whether some other function uses a module when you're naming variables within a function. You want to have one convention that you can stick to everywhere.

If people use your nlp package and continue to use the nlp variable name, they'll find themselves with confusing bugs. There will be many many bits of code cut-and-paste from tutorials that give confusing results when combined with the data loading from the nlp library. The problem will be especially bad for shadowed modules (people might reasonably have a module named nlp.py within their codebase) and notebooks, as people might run notebook cells for data loading out-of-order.

I don't think it's an exaggeration to say that if your library becomes popular, we'll all be answering issues around this about once a week for the next few years. That seems pretty unideal, so I do hope you'll reconsider.

I suggest nld as a better name. It more accurately represents what the package actually does. It's pretty unideal to have a package named nlp that doesn't do any processing, and contains data about natural language generation or other non-NLP tasks. The name is equally short, and is sort of a visual pun on nlp, since a d is a rotated p.

Loading GLUE dataset loads CoLA by default

If I run:

dataset = nlp.load_dataset('glue')

The resultant dataset seems to be CoLA be default, without throwing any error. This is in contrast to calling:

metric = nlp.load_metric("glue")

which throws an error telling the user that they need to specify a task in GLUE. Should the same apply for loading datasets?

[Feature] Keep the list of labels of a dataset as metadata

It would be useful to keep the list of the labels of a dataset as metadata. Either directly in the DatasetInfo or in the Arrow metadata.

🐛 `map` not working

I'm trying to run a basic example (mapping function to add a prefix).
Here is the colab notebook I'm using.

import nlp

dataset = nlp.load_dataset('squad', split='validation[:10%]')

def test(sample):
    sample['title'] = "test prefix @@@ " + sample["title"]
    return sample

print(dataset[0]['title'])
dataset.map(test)
print(dataset[0]['title'])

Output :

Super_Bowl_50
Super_Bowl_50

Expected output :

Super_Bowl_50
test prefix @@@ Super_Bowl_50

Loading 'wikitext' dataset fails

Loading the 'wikitext' dataset fails with Attribute error:

Code to reproduce (From example notebook):

import nlp
wikitext_dataset = nlp.load_dataset('wikitext')

Error:

AttributeError Traceback (most recent call last)
in ()
11
12 # Load a dataset and print the first examples in the training set
---> 13 wikitext_dataset = nlp.load_dataset('wikitext')
14 print(wikitext_dataset['train'][0])

6 frames
/usr/local/lib/python3.6/dist-packages/nlp/load.py in load_dataset(path, name, version, data_dir, data_files, split, cache_dir, download_config, download_mode, ignore_verifications, save_infos, **config_kwargs)
518 download_mode=download_mode,
519 ignore_verifications=ignore_verifications,
--> 520 save_infos=save_infos,
521 )
522

/usr/local/lib/python3.6/dist-packages/nlp/builder.py in download_and_prepare(self, download_config, download_mode, ignore_verifications, save_infos, dl_manager, **download_and_prepare_kwargs)
363 verify_infos = not save_infos and not ignore_verifications
364 self._download_and_prepare(
--> 365 dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
366 )
367 # Sync info

/usr/local/lib/python3.6/dist-packages/nlp/builder.py in _download_and_prepare(self, dl_manager, verify_infos, **prepare_split_kwargs)
416 try:
417 # Prepare split will record examples associated to the split
--> 418 self._prepare_split(split_generator, **prepare_split_kwargs)
419 except OSError:
420 raise OSError("Cannot find data file. " + (self.MANUAL_DOWNLOAD_INSTRUCTIONS or ""))

/usr/local/lib/python3.6/dist-packages/nlp/builder.py in _prepare_split(self, split_generator)
594 example = self.info.features.encode_example(record)
595 writer.write(example)
--> 596 num_examples, num_bytes = writer.finalize()
597
598 assert num_examples == num_examples, f"Expected to write {split_info.num_examples} but wrote {num_examples}"

/usr/local/lib/python3.6/dist-packages/nlp/arrow_writer.py in finalize(self, close_stream)
173 def finalize(self, close_stream=True):
174 if self.pa_writer is not None:
--> 175 self.write_on_file()
176 self.pa_writer.close()
177 if close_stream:

/usr/local/lib/python3.6/dist-packages/nlp/arrow_writer.py in write_on_file(self)
124 else:
125 # All good
--> 126 self._write_array_on_file(pa_array)
127 self.current_rows = []
128

/usr/local/lib/python3.6/dist-packages/nlp/arrow_writer.py in _write_array_on_file(self, pa_array)
93 def _write_array_on_file(self, pa_array):
94 """Write a PyArrow Array"""
---> 95 pa_batch = pa.RecordBatch.from_struct_array(pa_array)
96 self._num_bytes += pa_array.nbytes
97 self.pa_writer.write_batch(pa_batch)

AttributeError: type object 'pyarrow.lib.RecordBatch' has no attribute 'from_struct_array'

[Checksums] Error for some datasets

The checksums command works very nicely for squad. But for crime_and_punish and xnli,
the same bug happens:

When running:

python nlp-cli nlp-cli test xnli --save_checksums

leads to:

  File "nlp-cli", line 33, in <module>
    service.run()
  File "/home/patrick/python_bin/nlp/commands/test.py", line 61, in run
    ignore_checksums=self._ignore_checksums,
  File "/home/patrick/python_bin/nlp/builder.py", line 383, in download_and_prepare
    self._download_and_prepare(dl_manager=dl_manager, download_config=download_config)
  File "/home/patrick/python_bin/nlp/builder.py", line 627, in _download_and_prepare
    dl_manager=dl_manager, max_examples_per_split=download_config.max_examples_per_split,
  File "/home/patrick/python_bin/nlp/builder.py", line 431, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
  File "/home/patrick/python_bin/nlp/datasets/xnli/8bf4185a2da1ef2a523186dd660d9adcf0946189e7fa5942ea31c63c07b68a7f/xnli.py", line 95, in _split_generators
    dl_dir = dl_manager.download_and_extract(_DATA_URL)
  File "/home/patrick/python_bin/nlp/utils/download_manager.py", line 246, in download_and_extract
    return self.extract(self.download(url_or_urls))
  File "/home/patrick/python_bin/nlp/utils/download_manager.py", line 186, in download
    self._record_sizes_checksums(url_or_urls, downloaded_path_or_paths)
  File "/home/patrick/python_bin/nlp/utils/download_manager.py", line 166, in _record_sizes_checksums
    self._recorded_sizes_checksums[url] = get_size_checksum(path)
  File "/home/patrick/python_bin/nlp/utils/checksums_utils.py", line 81, in get_size_checksum
    with open(path, "rb") as f:
TypeError: expected str, bytes or os.PathLike object, not tuple

ANLI

Can I recommend the following:

For ANLI, use https://github.com/facebookresearch/anli. As that paper says, "Our dataset is not
to be confused with abductive NLI (Bhagavatula et al., 2019), which calls itself αNLI, or ART.".

Indeed, the paper cited under what is currently called anli says in the abstract "We introduce a challenge dataset, ART".

The current naming will confuse people :)

[Question] Create Apache Arrow dataset from raw text file

Hi guys, I have gathered and preprocessed about 2GB of COVID papers from CORD dataset @ Kggle. I have seen you have a text dataset as "Crime and punishment" in Apache arrow format. Do you have any script to do it from a raw txt file (preprocessed as for BERT like) or any guide?
Is the worth of send it to you and add it to the NLP library?
Thanks, Manu