poloclub / diffusiondb Goto Github PK

View Code? Open in Web Editor NEW

1.1K 18.0 62.0 6.79 MB

A large-scale text-to-image prompt gallery dataset based on Stable Diffusion

Home Page: https://poloclub.github.io/diffusiondb

License: MIT License

HTML 0.13% Python 99.23% CSS 0.64%

computer-vision ai-art image-generation prompt-engineering stable-diffusion

diffusiondb's Introduction

DiffusionDB

DiffusionDB is the first large-scale text-to-image prompt dataset. It contains 14 million images generated by Stable Diffusion using prompts and hyperparameters specified by real users. The unprecedented scale and diversity of this human-actuated dataset provide exciting research opportunities in understanding the interplay between prompts and generative models, detecting deepfakes, and designing human-AI interaction tools to help users more easily use these models.

Get Started

DiffusionDB is available at 🤗 Hugging Face Datasets.

Two Subsets

DiffusionDB provides two subsets (DiffusionDB 2M and DiffusionDB Large) to support different needs.

Subset	Num of Images	Num of Unique Prompts	Size	Image Directory	Metadata Table
DiffusionDB 2M	2M	1.5M	1.6TB	`images/`	`metadata.parquet`
DiffusionDB Large	14M	1.8M	6.5TB	`diffusiondb-large-part-1/` `diffusiondb-large-part-2/`	`metadata-large.parquet`

Key Differences

Two subsets have a similar number of unique prompts, but DiffusionDB Large has much more images. DiffusionDB Large is a superset of DiffusionDB 2M.
Images in DiffusionDB 2M are stored in png format; images in DiffusionDB Large use a lossless webp format.

Dataset Structure

We use a modularized file structure to distribute DiffusionDB. The 2 million images in DiffusionDB 2M are split into 2,000 folders, where each folder contains 1,000 images and a JSON file that links these 1,000 images to their prompts and hyperparameters. Similarly, the 14 million images in DiffusionDB Large are split into 14,000 folders.

# DiffusionDB 2M
./
├── images
│   ├── part-000001
│   │   ├── 3bfcd9cf-26ea-4303-bbe1-b095853f5360.png
│   │   ├── 5f47c66c-51d4-4f2c-a872-a68518f44adb.png
│   │   ├── 66b428b9-55dc-4907-b116-55aaa887de30.png
│   │   ├── [...]
│   │   └── part-000001.json
│   ├── part-000002
│   ├── part-000003
│   ├── [...]
│   └── part-002000
└── metadata.parquet

# DiffusionDB Large
./
├── diffusiondb-large-part-1
│   ├── part-000001
│   │   ├── 0a8dc864-1616-4961-ac18-3fcdf76d3b08.webp
│   │   ├── 0a25cacb-5d91-4f27-b18a-bd423762f811.webp
│   │   ├── 0a52d584-4211-43a0-99ef-f5640ee2fc8c.webp
│   │   ├── [...]
│   │   └── part-000001.json
│   ├── part-000002
│   ├── part-000003
│   ├── [...]
│   └── part-010000
├── diffusiondb-large-part-2
│   ├── part-010001
│   │   ├── 0a68f671-3776-424c-91b6-c09a0dd6fc2d.webp
│   │   ├── 0a0756e9-1249-4fe2-a21a-12c43656c7a3.webp
│   │   ├── 0aa48f3d-f2d9-40a8-a800-c2c651ebba06.webp
│   │   ├── [...]
│   │   └── part-010001.json
│   ├── part-010002
│   ├── part-010003
│   ├── [...]
│   └── part-014000
└── metadata-large.parquet

These sub-folders have names part-0xxxxx, and each image has a unique name generated by UUID Version 4. The JSON file in a sub-folder has the same name as the sub-folder. Each image is a PNG file (DiffusionDB 2M) or a lossless WebP file (DiffusionDB Large). The JSON file contains key-value pairs mapping image filenames to their prompts and hyperparameters. For example, below is the image of f3501e05-aef7-4225-a9e9-f516527408ac.png and its key-value pair in part-000001.json.

{
  "f3501e05-aef7-4225-a9e9-f516527408ac.png": {
    "p": "geodesic landscape, john chamberlain, christopher balaskas, tadao ando, 4 k, ",
    "se": 38753269,
    "c": 12.0,
    "st": 50,
    "sa": "k_lms"
  },
}

The data fields are:

key: Unique image name
p: Prompt
se: Random seed
c: CFG Scale (guidance scale)
st: Steps
sa: Sampler

Dataset Metadata

To help you easily access prompts and other attributes of images without downloading all the Zip files, we include two metadata tables metadata.parquet and metadata-large.parquet for DiffusionDB 2M and DiffusionDB Large, respectively.

The shape of metadata.parquet is (2000000, 13) and the shape of metatable-large.parquet is (14000000, 13). Two tables share the same schema, and each row represents an image. We store these tables in the Parquet format because Parquet is column-based: you can efficiently query individual columns (e.g., prompts) without reading the entire table.

Below are three random rows from metadata.parquet.

image_name	prompt	part_id	seed	step	cfg	sampler	width	height	user_name	timestamp	image_nsfw	prompt_nsfw
0c46f719-1679-4c64-9ba9-f181e0eae811.png	a small liquid sculpture, corvette, viscous, reflective, digital art	1050	2026845913	50	7	8	512	512	c2f288a2ba9df65c38386ffaaf7749106fed29311835b63d578405db9dbcafdb	2022-08-11 09:05:00+00:00	0.0845108	0.00383462
a00bdeaa-14eb-4f6c-a303-97732177eae9.png	human sculpture of lanky tall alien on a romantic date at italian restaurant with smiling woman, nice restaurant, photography, bokeh	905	1183522603	50	10	8	512	768	df778e253e6d32168eb22279a9776b3cde107cc82da05517dd6d114724918651	2022-08-19 17:55:00+00:00	0.692934	0.109437
6e5024ce-65ed-47f3-b296-edb2813e3c5b.png	portrait of barbaric spanish conquistador, symmetrical, by yoichi hatakenaka, studio ghibli and dan mumford	286	1713292358	50	7	8	512	640	1c2e93cfb1430adbd956be9c690705fe295cbee7d9ac12de1953ce5e76d89906	2022-08-12 03:26:00+00:00	0.0773138	0.0249675

Metadata Schema

metadata.parquet and metatable-large.parquet share the same schema.

Column	Type	Description
`image_name`	`string`	Image UUID filename.
`prompt`	`string`	The text prompt used to generate this image.
`part_id`	`uint16`	Folder ID of this image.
`seed`	`uint32`	Random seed used to generate this image.
`step`	`uint16`	Step count (hyperparameter).
`cfg`	`float32`	Guidance scale (hyperparameter).
`sampler`	`uint8`	Sampler method (hyperparameter). Mapping: `{1: "ddim", 2: "plms", 3: "k_euler", 4: "k_euler_ancestral", 5: "k_heun", 6: "k_dpm_2", 7: "k_dpm_2_ancestral", 8: "k_lms", 9: "others"}`.
`width`	`uint16`	Image width.
`height`	`uint16`	Image height.
`user_name`	`string`	The unique discord ID's SHA256 hash of the user who generated this image. For example, the hash for `xiaohk#3146` is `e285b7ef63be99e9107cecd79b280bde602f17e0ca8363cb7a0889b67f0b5ed0`. "deleted_account" refer to users who have deleted their accounts. None means the image has been deleted before we scrape it for the second time.
`timestamp`	`timestamp`	UTC Timestamp when this image was generated. None means the image has been deleted before we scrape it for the second time. Note that timestamp is not accurate for duplicate images that have the same prompt, hypareparameters, width, height.
`image_nsfw`	`float32`	Likelihood of an image being NSFW. Scores are predicted by LAION's state-of-art NSFW detector (range from 0 to 1). A score of 2.0 means the image has already been flagged as NSFW and blurred by Stable Diffusion.
`prompt_nsfw`	`float32`	Likelihood of a prompt being NSFW. Scores are predicted by the library Detoxicy. Each score represents the maximum of `toxicity` and `sexual_explicit` (range from 0 to 1).

Warning Although the Stable Diffusion model has an NSFW filter that automatically blurs user-generated NSFW images, this NSFW filter is not perfect—DiffusionDB still contains some NSFW images. Therefore, we compute and provide the NSFW scores for images and prompts using the state-of-the-art models. The distribution of these scores is shown below. Please decide an appropriate NSFW score threshold to filter out NSFW images before using DiffusionDB in your projects.

Loading DiffusionDB

DiffusionDB is large (1.6TB or 6.5 TB)! However, with our modularized file structure, you can easily load a desirable number of images and their prompts and hyperparameters. In the example-loading.ipynb notebook, we demonstrate three methods to load a subset of DiffusionDB. Below is a short summary.

Method 1: Use Hugging Face Datasets Loader

You can use the Hugging Face Datasets library to easily load prompts and images from DiffusionDB. We pre-defined 16 DiffusionDB subsets (configurations) based on the number of instances. You can see all subsets in the Dataset Preview.

Note To use Datasets Loader, you need to install Pillow as well (pip install Pillow)

import numpy as np
from datasets import load_dataset

# Load the dataset with the `large_random_1k` subset
dataset = load_dataset('poloclub/diffusiondb', 'large_random_1k')

Method 2. Use a downloader script

This repo includes a Python downloader download.py that allows you to download and load DiffusionDB. You can use it from your command line. Below is an example of loading a subset of DiffusionDB.

Usage/Examples

The script is run using command-line arguments as follows:

-i --index - File to download or lower bound of a range of files if -r is also set.
-r --range - Upper bound of range of files to download if -i is set.
-o --output - Name of custom output directory. Defaults to the current directory if not set.
-z --unzip - Unzip the file/files after downloading
-l --large - Download from Diffusion DB Large. Defaults to Diffusion DB 2M.

Downloading a single file

The specific file to download is supplied as the number at the end of the file on HuggingFace. The script will automatically pad the number out and generate the URL.

python download.py -i 23

Downloading a range of files

The upper and lower bounds of the set of files to download are set by the -i and -r flags respectively.

python download.py -i 1 -r 2000

Note that this range will download the entire dataset. The script will ask you to confirm that you have 1.7Tb free at the download destination.

Downloading to a specific directory

The script will default to the location of the dataset's part .zip files at images/. If you wish to move the download location, you should move these files as well or use a symbolic link.

python download.py -i 1 -r 2000 -o /home/$USER/datahoarding/etc

Again, the script will automatically add the / between the directory and the file when it downloads.

Setting the files to unzip once they've been downloaded

The script is set to unzip the files after all files have downloaded as both can be lengthy processes in certain circumstances.

python download.py -i 1 -r 2000 -z

Method 3. Use `metadata.parquet` (Text Only)

If your task does not require images, then you can easily access all 2 million prompts and hyperparameters in the metadata.parquet table.

from urllib.request import urlretrieve
import pandas as pd

# Download the parquet table
table_url = f'https://huggingface.co/datasets/poloclub/diffusiondb/resolve/main/metadata.parquet'
urlretrieve(table_url, 'metadata.parquet')

# Read the table using Pandas
metadata_df = pd.read_parquet('metadata.parquet')

Dataset Creation

We collected all images from the official Stable Diffusion Discord server. Please read our research paper for details. The code is included in ./scripts/.

Data Removal

If you find any harmful images or prompts in DiffusionDB, you can use this Google Form to report them. Similarly, if you are a creator of an image included in this dataset, you can use the same form to let us know if you would like to remove your image from DiffusionDB. We will closely monitor this form and update DiffusionDB periodically.

Credits

DiffusionDB is created by Jay Wang, Evan Montoya, David Munechika, Alex Yang, Ben Hoover, Polo Chau.

Citation

@article{wangDiffusionDBLargescalePrompt2022,
  title = {{{DiffusionDB}}: {{A}} Large-Scale Prompt Gallery Dataset for Text-to-Image Generative Models},
  author = {Wang, Zijie J. and Montoya, Evan and Munechika, David and Yang, Haoyang and Hoover, Benjamin and Chau, Duen Horng},
  year = {2022},
  journal = {arXiv:2210.14896 [cs]},
  url = {https://arxiv.org/abs/2210.14896}
}

Licensing

The DiffusionDB dataset is available under the CC0 1.0 License. The Python code in this repository is available under the MIT License.

Contact

If you have any questions, feel free to open an issue or contact Jay Wang.

diffusiondb's People

Contributors

Stargazers

Watchers

Forkers

ayushsubedi batrlatom techthiyanes richardsonjf ploverman dl-diffusion midwoea essence-611 yomismith zxl502 thelustriva benjamesbabala icwsm ai-cdrone ai2-cdrone soitun awesomediffusion chunchi031 gg-big-org lianzhouhui aadityaverma stevenlimcorn chaoso macguyversmusic chhaviilli zhuxiongwei24 wizyke ginihumer sf-wind rhematt jxzhangjhu yunxileo git19112019 starktynt henrywenwh2020 4agi haorand creepyfemur666 ripingit peace-zy kekewind carlosdvp krish240574 superbia-zyb zilunzhang hatt732 ngduchieus colosieve 5l1v3r1 jaedukseo anminhhung nil-devops deyh2020 al-hawiyalu outrun32 aixia121 svemulapalli jasonz1360 junlingzhuang keyurdesai53987 lixiang007666 alexocculate

diffusiondb's Issues

Unable to download using HuggingFace `datasets` library

I'm using the same snippet from readme to download the dataset from HF

from datasets import load_dataset

dataset = load_dataset('poloclub/diffusiondb', 'large_random_1k')

but I'm getting this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/minhduc0711/miniconda3/envs/diffusion/lib/python3.7/site-packages/datasets/load.py", line 1729, in load_dataset
    **config_kwargs,
  File "/home/minhduc0711/miniconda3/envs/diffusion/lib/python3.7/site-packages/datasets/load.py", line 1498, in load_dataset_builder
    builder_cls = import_main_class(dataset_module.module_path)
  File "/home/minhduc0711/miniconda3/envs/diffusion/lib/python3.7/site-packages/datasets/load.py", line 115, in import_main_class
    module = importlib.import_module(module_path)
  File "/home/minhduc0711/miniconda3/envs/diffusion/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/minhduc0711/.cache/huggingface/modules/datasets_modules/datasets/poloclub--diffusiondb/8e4f79d20e94e3f261bfbea0101aa5047d6961c1d124920dc067889f88f5cddd/diffusiondb.py", line 50, in <module>
    "datasets/poloclub/diffusiondb", filename=f"images/part-{i:06}.zip"
  File "/home/minhduc0711/miniconda3/envs/diffusion/lib/python3.7/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    validate_repo_id(arg_value)
  File "/home/minhduc0711/miniconda3/envs/diffusion/lib/python3.7/site-packages/huggingface_hub/utils/_validators.py", line 167, in validate_repo_id
    "Repo id must be in the form 'repo_name' or 'namespace/repo_name':"
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 'datasets/poloclub/diffusiondb'. Use `repo_type` argument if needed.

Some details about my environment:

Python 3.7.15
datasets 2.7.1

How to Create Fig. 6

Many thanks for this excellent work. In the Page 6 and Fig. 6 of your paper, you show

https://poloclub. github.io/diffusiondb/explorer/#prompt-embedding

which is a very good visualization. However, I cannot find the code to generate this. Could you show me where it is?

"download -z" unzips all the images to the same directory

I ran python download.py -i 1 -r 5 -z based on https://huggingface.co/datasets/poloclub/diffusiondb#downloading-to-a-specific-directory. It downloaded the five zip files as images/part-00000<N>.zip. However, the unzipped images were all in the current directory. Shouldn't they be created in five separate subdirectories? Otherwise, you end up with a single directory with 2M files. https://huggingface.co/datasets/poloclub/diffusiondb says "The 2 million images in DiffusionDB 2M are split into 2,000 folders", and dowload.py is not implementing that intent.

How to use `dataset.load_dataset` with metadata if I downloaded my data using method 2?

Sorry if this might be a dumb question, but I downloaded a bunch of data with method 2 and is not able to load the data with metadata loaded. Otherwise I will have to download everything with method 1.

Thanks in advance.

DatasetGenerationError

I ran into the following error with:

import numpy as np
from datasets import load_dataset

# Load the dataset with the `large_random_1k` subset
dataset = load_dataset('poloclub/diffusiondb', 'large_random_1k')

error msg:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
File ~/sandbox/sd-prompt-analysis/venv/lib/python3.8/site-packages/datasets/builder.py:1587, in GeneratorBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
   1578     writer = writer_class(
   1579         features=writer._features,
   1580         path=fpath.replace("SSSSS", f"{shard_id:05d}").replace("JJJJJ", f"{job_id:05d}"),
   (...)
   1585         embed_local_files=embed_local_files,
   1586     )
-> 1587 example = self.info.features.encode_example(record) if self.info.features is not None else record
   1588 writer.write(example, key)

File ~/sandbox/sd-prompt-analysis/venv/lib/python3.8/site-packages/datasets/features/features.py:1800, in Features.encode_example(self, example)
   1799 example = cast_to_python_objects(example)
-> 1800 return encode_nested_example(self, example)

File ~/sandbox/sd-prompt-analysis/venv/lib/python3.8/site-packages/datasets/features/features.py:1202, in encode_nested_example(schema, obj, level)
   1200         raise ValueError("Got None but expected a dictionary instead")
   1201     return (
-> 1202         {
   1203             k: encode_nested_example(sub_schema, sub_obj, level=level + 1)
   1204             for k, (sub_schema, sub_obj) in zip_dict(schema, obj)
   1205         }
   1206         if obj is not None
   1207         else None
...
   1605         e = e.__context__
-> 1606     raise DatasetGenerationError("An error occurred while generating the dataset") from e
   1608 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

Thanks.

24gb GPU enough / supported?

Hi all, just curious on if a 24gb card is enough to get this working.
Also, is the quality of images any better than what Stability-AI/StableDiffusion or CompVis/Stable-Diffusion has achieved?

Kind regards.

How can I use this Dataset to train a LLM on Huggingface?

OSError: could not create decoder object

Hi I have 2 problems.

First,

I try

import numpy as np
from datasets import load_dataset

dataset = load_dataset('poloclub/diffusiondb', 'large_random_100k')
dataset['train']['image'][1]

but,

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[5], line 1
----> 1 dataset['train']['image'][1]

File ~\miniconda3\envs\py38\lib\site-packages\datasets\arrow_dataset.py:2590, in Dataset.__getitem__(self, key)
   2588 def __getitem__(self, key):  # noqa: F811
   2589     """Can be used to index columns (by string names) or rows (by integer index or iterable of indices or bools)."""
-> 2590     return self._getitem(
   2591         key,
   2592     )

File ~\miniconda3\envs\py38\lib\site-packages\datasets\arrow_dataset.py:2575, in Dataset._getitem(self, key, **kwargs)
   2573 formatter = get_formatter(format_type, features=self.features, **format_kwargs)
   2574 pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
-> 2575 formatted_output = format_table(
   2576     pa_subtable, key, formatter=formatter, format_columns=format_columns, output_all_columns=output_all_columns
   2577 )
   2578 return formatted_output

File ~\miniconda3\envs\py38\lib\site-packages\datasets\formatting\formatting.py:634, in format_table(table, key, formatter, format_columns, output_all_columns)
    632 python_formatter = PythonFormatter(features=None)
    633 if format_columns is None:
--> 634     return formatter(pa_table, query_type=query_type)
    635 elif query_type == "column":
    636     if key in format_columns:

File ~\miniconda3\envs\py38\lib\site-packages\datasets\formatting\formatting.py:408, in Formatter.__call__(self, pa_table, query_type)
    406     return self.format_row(pa_table)
    407 elif query_type == "column":
--> 408     return self.format_column(pa_table)
    409 elif query_type == "batch":
    410     return self.format_batch(pa_table)

File ~\miniconda3\envs\py38\lib\site-packages\datasets\formatting\formatting.py:447, in PythonFormatter.format_column(self, pa_table)
    445 def format_column(self, pa_table: pa.Table) -> list:
    446     column = self.python_arrow_extractor().extract_column(pa_table)
--> 447     column = self.python_features_decoder.decode_column(column, pa_table.column_names[0])
    448     return column

File ~\miniconda3\envs\py38\lib\site-packages\datasets\formatting\formatting.py:228, in PythonFeaturesDecoder.decode_column(self, column, column_name)
    227 def decode_column(self, column: list, column_name: str) -> list:
--> 228     return self.features.decode_column(column, column_name) if self.features else column

File ~\miniconda3\envs\py38\lib\site-packages\datasets\features\features.py:1868, in Features.decode_column(self, column, column_name)
   1855 def decode_column(self, column: list, column_name: str):
   1856     """Decode column with custom feature decoding.
   1857 
   1858     Args:
   (...)
   1865         `list[Any]`
   1866     """
   1867     return (
-> 1868         [decode_nested_example(self[column_name], value) if value is not None else None for value in column]
   1869         if self._column_requires_decoding[column_name]
   1870         else column
   1871     )

File ~\miniconda3\envs\py38\lib\site-packages\datasets\features\features.py:1868, in <listcomp>(.0)
   1855 def decode_column(self, column: list, column_name: str):
   1856     """Decode column with custom feature decoding.
   1857 
   1858     Args:
   (...)
   1865         `list[Any]`
   1866     """
   1867     return (
-> 1868         [decode_nested_example(self[column_name], value) if value is not None else None for value in column]
   1869         if self._column_requires_decoding[column_name]
   1870         else column
   1871     )

File ~\miniconda3\envs\py38\lib\site-packages\datasets\features\features.py:1309, in decode_nested_example(schema, obj, token_per_repo_id)
   1306 elif isinstance(schema, (Audio, Image)):
   1307     # we pass the token to read and decode files from private repositories in streaming mode
   1308     if obj is not None and schema.decode:
-> 1309         return schema.decode_example(obj, token_per_repo_id=token_per_repo_id)
   1310 return obj

File ~\miniconda3\envs\py38\lib\site-packages\datasets\features\image.py:163, in Image.decode_example(self, value, token_per_repo_id)
    161 else:
    162     if is_local_path(path):
--> 163         image = PIL.Image.open(path)
    164     else:
    165         source_url = path.split("::")[-1]

File ~\miniconda3\envs\py38\lib\site-packages\PIL\Image.py:3268, in open(fp, mode, formats)
   3265             raise
   3266     return None
-> 3268 im = _open_core(fp, filename, prefix, formats)
   3270 if im is None:
   3271     if init():

File ~\miniconda3\envs\py38\lib\site-packages\PIL\Image.py:3254, in open.<locals>._open_core(fp, filename, prefix, formats)
   3252 elif result:
   3253     fp.seek(0)
-> 3254     im = factory(fp, filename)
   3255     _decompression_bomb_check(im.size)
   3256     return im

File ~\miniconda3\envs\py38\lib\site-packages\PIL\ImageFile.py:117, in ImageFile.__init__(self, fp, filename)
    115 try:
    116     try:
--> 117         self._open()
    118     except (
    119         IndexError,  # end of data
    120         TypeError,  # end of data (ord)
   (...)
    123         struct.error,
    124     ) as v:
    125         raise SyntaxError(v) from v

File ~\miniconda3\envs\py38\lib\site-packages\PIL\WebPImagePlugin.py:63, in WebPImageFile._open(self)
     59     return
     61 # Use the newer AnimDecoder API to parse the (possibly) animated file,
     62 # and access muxed chunks like ICC/EXIF/XMP.
---> 63 self._decoder = _webp.WebPAnimDecoder(self.fp.read())
     65 # Get info from decoder
     66 width, height, loop_count, bgcolor, frame_count, mode = self._decoder.get_info()

OSError: could not create decoder object

Second,
After the download is complete, I want to load the file locally.

Thanks.

DatasetGenerationError: An error occurred while generating the dataset | ValueError: NaTType does not support utcoffset

dataset = load_dataset('poloclub/diffusiondb', '2m_random_50k', split="all")

Gives error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/env/lib/python3.8/site-packages/datasets/builder.py:1626, in GeneratorBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
   1625 example = self.info.features.encode_example(record) if self.info.features is not None else record
-> 1626 writer.write(example, key)
   1627 num_examples_progress_update += 1

File ~/env/lib/python3.8/site-packages/datasets/arrow_writer.py:488, in ArrowWriter.write(self, example, key, writer_batch_size)
    486     self.hkey_record = []
--> 488 self.write_examples_on_file()

File ~/env/lib/python3.8/site-packages/datasets/arrow_writer.py:446, in ArrowWriter.write_examples_on_file(self)
    442         batch_examples[col] = [
    443             row[0][col].to_pylist()[0] if isinstance(row[0][col], (pa.Array, pa.ChunkedArray)) else row[0][col]
    444             for row in self.current_examples
    445         ]
--> 446 self.write_batch(batch_examples=batch_examples)
    447 self.current_examples = []

File ~/env/lib/python3.8/site-packages/datasets/arrow_writer.py:551, in ArrowWriter.write_batch(self, batch_examples, writer_batch_size)
    550 typed_sequence = OptimizedTypedSequence(col_values, type=col_type, try_type=col_try_type, col=col)
--> 551 arrays.append(pa.array(typed_sequence))
    552 inferred_features[col] = typed_sequence.get_inferred_type()

File ~/env/lib/python3.8/site-packages/pyarrow/array.pxi:236, in pyarrow.lib.array()

File ~/env/lib/python3.8/site-packages/pyarrow/array.pxi:110, in pyarrow.lib._handle_arrow_array_protocol()

File ~/env/lib/python3.8/site-packages/datasets/arrow_writer.py:189, in TypedSequence.__arrow_array__(self, type)
    188     trying_cast_to_python_objects = True
--> 189     out = pa.array(cast_to_python_objects(data, only_1d_for_numpy=True))
    190 # use smaller integer precisions if possible

File ~/env/lib/python3.8/site-packages/pyarrow/array.pxi:320, in pyarrow.lib.array()

File ~/env/lib/python3.8/site-packages/pyarrow/array.pxi:39, in pyarrow.lib._sequence_to_array()

File ~/env/lib/python3.8/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File ~/env/lib/python3.8/site-packages/pandas/_libs/tslibs/nattype.pyx:67, in pandas._libs.tslibs.nattype._make_error_func.f()

ValueError: NaTType does not support utcoffset

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
File ~/env/lib/python3.8/site-packages/datasets/builder.py:1635, in GeneratorBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
   1634 num_shards = shard_id + 1
-> 1635 num_examples, num_bytes = writer.finalize()
   1636 writer.close()

File ~/env/lib/python3.8/site-packages/datasets/arrow_writer.py:582, in ArrowWriter.finalize(self, close_stream)
    581     self.hkey_record = []
--> 582 self.write_examples_on_file()
    583 # If schema is known, infer features even if no examples were written

File ~/env/lib/python3.8/site-packages/datasets/arrow_writer.py:446, in ArrowWriter.write_examples_on_file(self)
    442         batch_examples[col] = [
    443             row[0][col].to_pylist()[0] if isinstance(row[0][col], (pa.Array, pa.ChunkedArray)) else row[0][col]
    444             for row in self.current_examples
    445         ]
--> 446 self.write_batch(batch_examples=batch_examples)
    447 self.current_examples = []

File ~/env/lib/python3.8/site-packages/datasets/arrow_writer.py:551, in ArrowWriter.write_batch(self, batch_examples, writer_batch_size)
    550 typed_sequence = OptimizedTypedSequence(col_values, type=col_type, try_type=col_try_type, col=col)
--> 551 arrays.append(pa.array(typed_sequence))
    552 inferred_features[col] = typed_sequence.get_inferred_type()

File ~/env/lib/python3.8/site-packages/pyarrow/array.pxi:236, in pyarrow.lib.array()

File ~/env/lib/python3.8/site-packages/pyarrow/array.pxi:110, in pyarrow.lib._handle_arrow_array_protocol()

File ~/env/lib/python3.8/site-packages/datasets/arrow_writer.py:189, in TypedSequence.__arrow_array__(self, type)
    188     trying_cast_to_python_objects = True
--> 189     out = pa.array(cast_to_python_objects(data, only_1d_for_numpy=True))
    190 # use smaller integer precisions if possible

File ~/env/lib/python3.8/site-packages/pyarrow/array.pxi:320, in pyarrow.lib.array()

File ~/env/lib/python3.8/site-packages/pyarrow/array.pxi:39, in pyarrow.lib._sequence_to_array()

File ~/env/lib/python3.8/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File ~/env/lib/python3.8/site-packages/pandas/_libs/tslibs/nattype.pyx:67, in pandas._libs.tslibs.nattype._make_error_func.f()

ValueError: NaTType does not support utcoffset

The above exception was the direct cause of the following exception:

DatasetGenerationError                    Traceback (most recent call last)
Cell In[2], line 5
      2 from datasets import load_dataset
      4 # Load the dataset with the `large_random_1k` subset
----> 5 dataset = load_dataset('poloclub/diffusiondb', '2m_random_50k', split="all")

File ~/env/lib/python3.8/site-packages/datasets/load.py:1782, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, num_proc, **config_kwargs)
   1779 try_from_hf_gcs = path not in _PACKAGED_DATASETS_MODULES
   1781 # Download and prepare data
-> 1782 builder_instance.download_and_prepare(
   1783     download_config=download_config,
   1784     download_mode=download_mode,
   1785     verification_mode=verification_mode,
   1786     try_from_hf_gcs=try_from_hf_gcs,
   1787     num_proc=num_proc,
   1788 )
   1790 # Build dataset for splits
   1791 keep_in_memory = (
   1792     keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
   1793 )

File ~/env/lib/python3.8/site-packages/datasets/builder.py:872, in DatasetBuilder.download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
    870     if num_proc is not None:
    871         prepare_split_kwargs["num_proc"] = num_proc
--> 872     self._download_and_prepare(
    873         dl_manager=dl_manager,
    874         verification_mode=verification_mode,
    875         **prepare_split_kwargs,
    876         **download_and_prepare_kwargs,
    877     )
    878 # Sync info
    879 self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values())

File ~/env/lib/python3.8/site-packages/datasets/builder.py:1649, in GeneratorBasedBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_splits_kwargs)
   1648 def _download_and_prepare(self, dl_manager, verification_mode, **prepare_splits_kwargs):
-> 1649     super()._download_and_prepare(
   1650         dl_manager,
   1651         verification_mode,
   1652         check_duplicate_keys=verification_mode == VerificationMode.BASIC_CHECKS
   1653         or verification_mode == VerificationMode.ALL_CHECKS,
   1654         **prepare_splits_kwargs,
   1655     )

File ~/env/lib/python3.8/site-packages/datasets/builder.py:967, in DatasetBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
    963 split_dict.add(split_generator.split_info)
    965 try:
    966     # Prepare split will record examples associated to the split
--> 967     self._prepare_split(split_generator, **prepare_split_kwargs)
    968 except OSError as e:
    969     raise OSError(
    970         "Cannot find data file. "
    971         + (self.manual_download_instructions or "")
    972         + "\nOriginal error:\n"
    973         + str(e)
    974     ) from None

File ~/env/lib/python3.8/site-packages/datasets/builder.py:1488, in GeneratorBasedBuilder._prepare_split(self, split_generator, check_duplicate_keys, file_format, num_proc, max_shard_size)
   1486 gen_kwargs = split_generator.gen_kwargs
   1487 job_id = 0
-> 1488 for job_id, done, content in self._prepare_split_single(
   1489     gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
   1490 ):
   1491     if done:
   1492         result = content

File ~/env/lib/python3.8/site-packages/datasets/builder.py:1644, in GeneratorBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
   1642     if isinstance(e, SchemaInferenceError) and e.__context__ is not None:
   1643         e = e.__context__
-> 1644     raise DatasetGenerationError("An error occurred while generating the dataset") from e
   1646 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset
Generating train split: 13999 examples [00:45, 540.25 examples[/s](https://file+.vscode-resource.vscode-cdn.net/s)]

Bug in download.py

download.py script has a few bugs: whenever I would run python download.py -i 1, it would automatically set range_max = 2000 due to default value in parser.add_argument and it would always try to download the entire dataset. Besides that, the confirmation letter would never appear due to the condition range_max - index > 1999 always being false. I have opened the PR to address this issues.

Could not create decoder object

When I try to load the dataset:

dataset = load_dataset('poloclub/diffusiondb', 'large_random_10k')
df = pd.DataFrame(dataset['train'])

I get the following error :

 60         # Use the newer AnimDecoder API to parse the (possibly) animated file,
 61         # and access muxed chunks like ICC/EXIF/XMP.
 ---> 62         self._decoder = _webp.WebPAnimDecoder(self.fp.read())
 63 
 64         # Get info from decoder

 OSError: could not create decoder object

It seems that there is an issue with an object within dataset['train']['image'].

1.5M DiffusionDB Aesthetic and Artifact Ratings

Hey there I'm the creator of the AI Horde and for the past year we've using our community to help us rate the diffusiondb output for aesthetics and artifacts.

I have created a daily export of the ratings in huggingface and by today we have achieved 1.5 M ratings in ~300K images from diffusiondb. This is because we require 5 ratings per image in the dataset to ensure a good average.

We started this ratings gathering as a collaboration with LAION but they haven't been able to find anyone to process this data. I am bringing this to your attention in case you or the poloclub would be interested in crunching this dataset in any form.

Stable diffusion version ?

Hi,
Thank you for your amazing work :)

I would like to know which version of Stable Diffusion is used to generate those images. Are there differentversions or only one ?

Thanks for your help ;)
Regards,

Marc-Antoine