The main motivation of the SHIFT15M project is to provide a dataset that contains natural dataset shifts collected from a web service that was actually in operation for several years. In addition, the SHIFT15M dataset has several types of dataset shifts, allowing us to evaluate the robustness of the model to different types of shifts (e.g., covariate shift and target shift).
We provide the Datasheet for SHIFT15M. This datasheet is based on the Datasheets for Datasets [1] template.
System | Python 3.6 | Python 3.7 | Python 3.8 |
---|---|---|---|
Linux CPU | |||
Linux GPU | |||
Windows CPU / GPU | Status Currently Unavailable | Status Currently Unavailable | Status Currently Unavailable |
Mac OS CPU |
SHIFT15M is a large-scale dataset based on approximately 15 million items accumulated by the fashion search service IQON.
$ pip install shift15m
$ git clone https://github.com/st-tech/zozo-shift15m.git
$ cd zozo-shift15m
$ poetry build
$ pip install dist/shift15m-xxxx-py3-none-any.whl
You can download SHIFT15M dataset as follows:
from shift15.datasets import NumLikesRegression
dataset = NumLikesRegression(root="./data", download=True)
Please download the dataset as follows:
$ bash scripts/download_all.sh
To avoid downloading the test dataset for set matching (80GB), which is not required in training, you can use the following script.
$ bash scripts/download_all_wo_set_testdata.sh
The following tasks are now available:
Tasks | Task type | Shift type | # of input dim | # of output dim |
---|---|---|---|---|
NumLikesRegression | regression | target shift | (N,25) | (N,1) |
SumPricesRegression | regression | covariate shift, target shift | (N, 1) | (N, 1) |
ItemPriceRegression | regression | target shift | (N, 4096) | (N, 1) |
ItemCategoryClassification | classification | target shift | (N, 4096) | (N, 7) |
Set2SetMatching | set-to-set matching | covariate shift | (N,4096)x(M,4096) | (1) |
As templates for numerical experiments on the SHIFT15M dataset, we have published experimental results for each task with several models.
The original dataset is maintained in json format, and a row consists of the following:
{
"user":{"user_id":"xxxx"},
"like_num":"xx",
"set_id":"xxx",
"items":[
{"price":"xxxx","item_id":"xxxxxx","category_id1":"xx","category_id2":"xxxxx"},
...
],
"publish_date":"yyyy-mm-dd"
}
To learn more about making a contribution to SHIFT15M, please see the following materials:
The dataset itself is provided under a CC BY-NC 4.0 license. On the other hand, the software in this repository is provided under the MIT license.
The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.
property | value | ||||||
---|---|---|---|---|---|---|---|
name | SHIFT15M Dataset |
||||||
alternateName | SHIFT15M |
||||||
alternateName | shift15m-dataset |
||||||
url | https://github.com/st-tech/zozo-shift15m |
||||||
sameAs | https://github.com/st-tech/zozo-shift15m |
||||||
description | SHIFT15M is a multi-objective, multi-domain dataset which includes multiple dataset shifts. |
||||||
provider |
|
||||||
license |
|
@misc{Kimura_SHIFT15M_Multiobjective_LargeScale_2021,
author = {Kimura, Masanari and Nakamura, Takuma and Saito, Yuki},
month = {8},
title = {SHIFT15M: Multiobjective Large-Scale Fashion Dataset with Distributional Shifts},
year = {2021}
}
No errata are currently available.
- [1] Gebru, Timnit, et al. "Datasheets for datasets." arXiv preprint arXiv:1803.09010 (2018).