pyg-team / pytorch-frame Goto Github PK

View Code? Open in Web Editor NEW

474.0 15.0 49.0 10.85 MB

Tabular Deep Learning Library for PyTorch

Home Page: https://pytorch-frame.readthedocs.io

License: MIT License

Python 100.00%

data-frame deep-learning pytorch tabular-learning

pytorch-frame's Introduction

A modular deep learning framework for building neural network models on heterogeneous tabular data.

Documentation | Paper

PyTorch Frame is a deep learning extension for PyTorch, designed for heterogeneous tabular data with different column types, including numerical, categorical, time, text, and images. It offers a modular framework for implementing existing and future methods. The library features methods from state-of-the-art models, user-friendly mini-batch loaders, benchmark datasets, and interfaces for custom data integration.

PyTorch Frame democratizes deep learning research for tabular data, catering to both novices and experts alike. Our goals are:

Facilitate Deep Learning for Tabular Data: Historically, tree-based models (e.g., GBDT) excelled at tabular learning but had notable limitations, such as integration difficulties with downstream models, and handling complex column types, such as texts, sequences, and embeddings. Deep tabular models are promising to resolve the limitations. We aim to facilitate deep learning research on tabular data by modularizing its implementation and supporting the diverse column types.
Integrates with Diverse Model Architectures like Large Language Models: PyTorch Frame supports integration with a variety of different architectures including LLMs. With any downloaded model or embedding API endpoint, you can encode your text data with embeddings and train it with deep learning models alongside other complex semantic types. We support the following (but not limited to):

OpenAI Embedding Code Example

Cohere Embed v3 Code Example

Hugging Face Code Example

Voyage AI Code Example

Library Highlights
Architecture Overview
Quick Tour
Implemented Deep Tabular Models
Benchmark
Installation

Library Highlights

PyTorch Frame builds directly upon PyTorch, ensuring a smooth transition for existing PyTorch users. Key features include:

Diverse column types: PyTorch Frame supports learning across various column types: numerical, categorical, multicategorical, text_embedded, text_tokenized, timestamp, image_embedded, and embedding. See here for the detailed tutorial.
Modular model design: Enables modular deep learning model implementations, promoting reusability, clear coding, and experimentation flexibility. Further details in the architecture overview.
Models Implements many state-of-the-art deep tabular models as well as strong GBDTs (XGBoost, CatBoost, and LightGBM) with hyper-parameter tuning.
Datasets: Comes with a collection of readily-usable tabular datasets. Also supports custom datasets to solve your own problem. We benchmark deep tabular models against GBDTs.
PyTorch integration: Integrates effortlessly with other PyTorch libraries, facilitating end-to-end training of PyTorch Frame with downstream PyTorch models. For example, by integrating with PyG, a PyTorch library for GNNs, we can perform deep learning over relational databases. Learn more in RelBench and example code (WIP).

Architecture Overview

Models in PyTorch Frame follow a modular design of FeatureEncoder, TableConv, and Decoder, as shown in the figure below:

In essence, this modular setup empowers users to effortlessly experiment with myriad architectures:

Materialization handles converting the raw pandas DataFrame into a TensorFrame that is amenable to Pytorch-based training and modeling.
FeatureEncoder encodes TensorFrame into hidden column embeddings of size [batch_size, num_cols, channels].
TableConv models column-wise interactions over the hidden embeddings.
Decoder generates embedding/prediction per row.

Quick Tour

In this quick tour, we showcase the ease of creating and training a deep tabular model with only a few lines of code.

Build and train your own deep tabular model

As an example, we implement a simple ExampleTransformer following the modular architecture of Pytorch Frame. In the example below:

self.encoder maps an input TensorFrame to an embedding of size [batch_size, num_cols, channels].
self.convs interatively transforms the embedding of size [batch_size, num_cols, channels] into an embedding of the same size.
self.decoder pools the embedding of size [batch_size, num_cols, channels] into [batch_size, out_channels].

from torch import Tensor
from torch.nn import Linear, Module, ModuleList

import torch_frame
from torch_frame import TensorFrame, stype
from torch_frame.nn.conv import TabTransformerConv
from torch_frame.nn.encoder import (
    EmbeddingEncoder,
    LinearEncoder,
    StypeWiseFeatureEncoder,
)

class ExampleTransformer(Module):
    def __init__(
        self,
        channels, out_channels, num_layers, num_heads,
        col_stats, col_names_dict,
    ):
        super().__init__()
        self.encoder = StypeWiseFeatureEncoder(
            out_channels=channels,
            col_stats=col_stats,
            col_names_dict=col_names_dict,
            stype_encoder_dict={
                stype.categorical: EmbeddingEncoder(),
                stype.numerical: LinearEncoder()
            },
        )
        self.convs = ModuleList([
            TabTransformerConv(
                channels=channels,
                num_heads=num_heads,
            ) for _ in range(num_layers)
        ])
        self.decoder = Linear(channels, out_channels)

    def forward(self, tf: TensorFrame) -> Tensor:
        x, _ = self.encoder(tf)
        for conv in self.convs:
            x = conv(x)
        out = self.decoder(x.mean(dim=1))
        return out

To prepare the data, we can quickly instantiate a pre-defined dataset and create a PyTorch-compatible data loader as follows:

from torch_frame.datasets import Yandex
from torch_frame.data import DataLoader

dataset = Yandex(root='/tmp/adult', name='adult')
dataset.materialize()
train_dataset = dataset[:0.8]
train_loader = DataLoader(train_dataset.tensor_frame, batch_size=128,
                          shuffle=True)

Then, we just follow the standard PyTorch training procedure to optimize the model parameters. That's it!

import torch
import torch.nn.functional as F

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = ExampleTransformer(
    channels=32,
    out_channels=dataset.num_classes,
    num_layers=2,
    num_heads=8,
    col_stats=train_dataset.col_stats,
    col_names_dict=train_dataset.tensor_frame.col_names_dict,
).to(device)

optimizer = torch.optim.Adam(model.parameters())

for epoch in range(50):
    for tf in train_loader:
        tf = tf.to(device)
        pred = model.forward(tf)
        loss = F.cross_entropy(pred, tf.y)
        optimizer.zero_grad()
        loss.backward()

Implemented Deep Tabular Models

We list currently supported deep tabular models:

Trompt from Chen et al.: Trompt: Towards a Better Deep Neural Network for Tabular Data (ICML 2023) [Example]
FTTransformer from Gorishniy et al.: Revisiting Deep Learning Models for Tabular Data (NeurIPS 2021) [Example]
ResNet from Gorishniy et al.: Revisiting Deep Learning Models for Tabular Data (NeurIPS 2021) [Example]
TabNet from Arık et al.: TabNet: Attentive Interpretable Tabular Learning (AAAI 2021) [Example]
ExcelFormer from Chen et al.: ExcelFormer: A Neural Network Surpassing GBDTs on Tabular Data [Example]
TabTransformer from Huang et al.: TabTransformer: Tabular Data Modeling Using Contextual Embeddings [Example]

In addition, we implemented XGBoost, CatBoost, and LightGBM examples with hyperparameter-tuning using Optuna for users who'd like to compare their model performance with GBDTs.

Benchmark

We benchmark recent tabular deep learning models against GBDTs over diverse public datasets with different sizes and task types.

The following chart shows the performance of various models on small regression datasets, where the row represents the model names and the column represents dataset indices (we have 13 datasets here). For more results on classification and larger datasets, please check the benchmark documentation.

Model Name	dataset_0	dataset_1	dataset_2	dataset_3	dataset_4	dataset_5	dataset_6	dataset_7	dataset_8	dataset_9	dataset_10	dataset_11	dataset_12
XGBoost	0.247±0.000	0.077±0.000	0.167±0.000	1.119±0.000	0.328±0.000	1.024±0.000	0.292±0.000	0.606±0.000	0.876±0.000	0.023±0.000	0.697±0.000	0.865±0.000	0.435±0.000
CatBoost	0.265±0.000	0.062±0.000	0.128±0.000	0.336±0.000	0.346±0.000	0.443±0.000	0.375±0.000	0.273±0.000	0.881±0.000	0.040±0.000	0.756±0.000	0.876±0.000	0.439±0.000
LightGBM	0.253±0.000	0.054±0.000	0.112±0.000	0.302±0.000	0.325±0.000	0.384±0.000	0.295±0.000	0.272±0.000	0.877±0.000	0.011±0.000	0.702±0.000	0.863±0.000	0.395±0.000
Trompt	0.261±0.003	0.015±0.005	0.118±0.001	0.262±0.001	0.323±0.001	0.418±0.003	0.329±0.009	0.312±0.002	OOM	0.008±0.001	0.779±0.006	0.874±0.004	0.424±0.005
ResNet	0.288±0.006	0.018±0.003	0.124±0.001	0.268±0.001	0.335±0.001	0.434±0.004	0.325±0.012	0.324±0.004	0.895±0.005	0.036±0.002	0.794±0.006	0.875±0.004	0.468±0.004
FTTransformerBucket	0.325±0.008	0.096±0.005	0.360±0.354	0.284±0.005	0.342±0.004	0.441±0.003	0.345±0.007	0.339±0.003	OOM	0.105±0.011	0.807±0.010	0.885±0.008	0.468±0.006
ExcelFormer	0.302±0.003	0.099±0.003	0.145±0.003	0.382±0.011	0.344±0.002	0.411±0.005	0.359±0.016	0.336±0.008	OOM	0.192±0.014	0.794±0.005	0.890±0.003	0.445±0.005
FTTransformer	0.335±0.010	0.161±0.022	0.140±0.002	0.277±0.004	0.335±0.003	0.445±0.003	0.361±0.018	0.345±0.005	OOM	0.106±0.012	0.826±0.005	0.896±0.007	0.461±0.003
TabNet	0.279±0.003	0.224±0.016	0.141±0.010	0.275±0.002	0.348±0.003	0.451±0.007	0.355±0.030	0.332±0.004	0.992±0.182	0.015±0.002	0.805±0.014	0.885±0.013	0.544±0.011
TabTransformer	0.624±0.003	0.229±0.003	0.369±0.005	0.340±0.004	0.388±0.002	0.539±0.003	0.619±0.005	0.351±0.001	0.893±0.005	0.431±0.001	0.819±0.002	0.886±0.005	0.545±0.004

We see that some recent deep tabular models were able to achieve competitive model performance to strong GBDTs (despite being 5--100 times slower to train). Making deep tabular models even more performant with less compute is a fruitful direction for future research.

We also benchmark different text encoders on a real-world tabular dataset (Wine Reviews) with one text column. The following table shows the performance:

Test Acc	Method	Model Name	Source
0.7926	Pre-trained	sentence-transformers/all-distilroberta-v1 (125M # params)	Hugging Face
0.7998	Pre-trained	embed-english-v3.0 (dimension size: 1024)	Cohere
0.8102	Pre-trained	text-embedding-ada-002 (dimension size: 1536)	OpenAI
0.8147	Pre-trained	voyage-01 (dimension size: 1024)	Voyage AI
0.8203	Pre-trained	intfloat/e5-mistral-7b-instruct (7B # params)	Hugging Face
0.8230	LoRA Finetune	DistilBERT (66M # params)	Hugging Face

The benchmark script for Hugging Face text encoders is in this file and for the rest of text encoders is in this file.

Installation

PyTorch Frame is available for Python 3.8 to Python 3.11.

pip install pytorch_frame

See the installation guide for other options.

Cite

If you use PyTorch Frame in your work, please cite our paper (Bibtex below).

@article{hu2024pytorch,
  title={PyTorch Frame: A Modular Framework for Multi-Modal Tabular Learning},
  author={Hu, Weihua and Yuan, Yiwen and Zhang, Zecheng and Nitta, Akihiro and Cao, Kaidi and Kocijan, Vid and Leskovec, Jure and Fey, Matthias},
  journal={arXiv preprint arXiv:2404.00776},
  year={2024}
}

pytorch-frame's People

Contributors

Stargazers

Watchers

Forkers

34j waimorris asarigun eliazonta avivnur yunjiao-chen cfgfung stjordanis toenshoff sangnguyens damianszwichtenberg xnuohz simonpop zrealshadow hunterlige jizongfox duxans sft110 roysh mailmahee alexandor91 nathanlem1 haipinglu deepak-1530 anas-rz berkekisin february24-lee jasoncnsh taokz blazstojanovic shaonc djun sudareff gauravcodepro project-delphi plurigrid ahoyosid atahanak sarwanpasha adobles96 jyansir wyq199321 drivanov fcas jaewonnow kp-forks jonomon luozhengdong macos

pytorch-frame's Issues

Proofread documentation and github readme

@weihua916 and I have worked on creating the initial draft of the readme and documentation. We'd like feedbacks and opinions from different perspectives.
Feel free to make any comments and suggestions~

documentation link

Error in Handling Heterogeneous Semantic Types (in documentation)

The following code in the example documentation produces the error below. Somehow arrays are not of the same length. Can you please look into this?

import random

import numpy as np
import pandas as pd

# Numerical column
numerical = np.random.randint(0, 100, size=10)

# Categorical column
simple_categories = ['Type 1', 'Type 2', 'Type 3']
categorical = np.random.choice(simple_categories, size=100)

# Timestamp column
time = pd.date_range(start='2023-01-01', periods=100, freq='D')

# Multicategorical column
categories = ['Category A', 'Category B', 'Category C', 'Category D']
multicategorical = [
    random.sample(categories, k=random.randint(0, len(categories)))
    for _ in range(100)
]

# Embedding column (assuming an embedding size of 5 for simplicity)
embedding_size = 5
embedding = np.random.rand(100, embedding_size)

# Create the DataFrame
df = pd.DataFrame({
    'Numerical': numerical,
    'Categorical': categorical,
    'Time': time,
    'Multicategorical': multicategorical,
    'Embedding': list(embedding)
})

Error:

 "Mixing dicts with non-Series may lead to ambiguous ordering."
ValueError: All arrays must be of the same length

Feature Importance

Feature

Support feature importance in tabular data scenarios.

Understand which features are beneficial for prediction and help to develop new features
Feature selection, removing features that are not helpful in prediction

Ideas

GBDTs naturally have APIs for calculating feature importance, it's easy to add.
NNs
- Permutation. After shuffling a certain feature, observe the changes in metric. The greater the change, the more important the feature is. Simple.
- SHAP. Complex.

Auto inference of `stype`

Currently, we ask our users to specify stype but this is sometimes tedious. We may come up with some simple rules to classify each column into existing stype.

More stype support

We aim to expand our scope of Pytorch Frame beyond the basic numerical and categorical columns.

stype.text_embedded @zechengz @weihua916

TensorMapper
StypeEncoder
e2e example

stype.text_tokenized @zechengz

TensorMapper
StypeEncoder
e2e example

stype.timestamp @yiweny

TensorMapper
StypeEncoder
e2e example

stype.sequence_numerical @yiweny

TensorMapper
StypeEncoder
e2e example

stype.sequence_categorical @yiweny

TensorMapper
StypeEncoder
e2e example

stype.multi_categorical @yiweny

TensorMapper
StypeEncoder
e2e example

stype.embeddings @akihironitta

MultiEmbeddingTensor
TensorMapper
StypeEncoder
e2e example

Add overwrite logic for dataset materialization

Currently, dataset.materialize() supports path argument where it loads a cached object.
However, if any column changes, for the dataset, or if we are using a different transformer model, the default behavior is to read from the cached object rather than rematerialize.

I think it's necessary to add logic to overwrite the cached object if the dataset changes. Currently, there's not an option to do so--you have to manually delete the cached object.

Initial Model Implementation & Reproduction

We aim to include four popular models in the initial release. This issue keeps track of our progress.

FT-Transformer @kaidic @weihua916

Initial implementation #12
Initial implementation of LinearBucketEncoder and LinearPeriodicEncoder. #30 #31
Include at least 2 datasets from the paper #37
Reproducibility of FT-Transformer
Reproducibility of ResNet.
Reproducibility of LinearBucketEncoder ad LinearPeriodicEncoder.

Trompt @weihua916

Initial implementation #25
Include at least 2 datasets from the paper #33
Reproducibility #39

TabNet @zechengz @weihua916

Initial implementation
Include at least 2 datasets from the paper
Reproducibility

ExcelFormer @yiweny

Initial implementation
Include at least 2 datasets from the paper
Reproducibility

TabTransformer @yiweny

Initial implementation
Include at least 2 datasets from the paper
Reproducibility

Example encoding + self-attention + sum pooling @weihua916

example code. #54

dataset download errors

Hi,
I came across a dataset download error when I use the script that this readme file provided:

from torch_frame.datasets import Yandex
from torch_frame.data import DataLoader

dataset = Yandex(root='/Users/huyu/Github/test/pytorch-frame/adult', name='adult')
dataset.materialize()
train_dataset = dataset[:0.8]
train_loader = DataLoader(train_dataset.tensor_frame, batch_size=128,
                          shuffle=True)

then got the error:

gaierror: [Errno 8] nodename nor servname provided, or not known

During handling of the above exception, another exception occurred:

URLError                                  Traceback (most recent call last)
Cell In[7], [line 4](vscode-notebook-cell:?execution_count=7&line=4)
      [1](vscode-notebook-cell:?execution_count=7&line=1) from torch_frame.datasets import Yandex
      [2](vscode-notebook-cell:?execution_count=7&line=2) from torch_frame.data import DataLoader
----> [4](vscode-notebook-cell:?execution_count=7&line=4) dataset = Yandex(root='[/Users/huyu/Github/test/pytorch-frame/adult](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/adult)', name='adult')
      [5](vscode-notebook-cell:?execution_count=7&line=5) dataset.materialize()
      [6](vscode-notebook-cell:?execution_count=7&line=6) train_dataset = dataset[:0.8]

File [~/Github/test/pytorch-frame/torch_frame/datasets/yandex.py:215](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/datasets/yandex.py:215), in Yandex.__init__(self, root, name)
    [213](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/datasets/yandex.py:213) self.root = root
    [214](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/datasets/yandex.py:214) self.name = name
--> [215](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/datasets/yandex.py:215) path = self.download_url(osp.join(self.base_url, self.name + '.zip'),
    [216](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/datasets/yandex.py:216)                          root)
    [217](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/datasets/yandex.py:217) df, col_to_stype = get_df_and_col_to_stype(path)
    [218](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/datasets/yandex.py:218) if name in self.regression_datasets:

File [~/Github/test/pytorch-frame/torch_frame/data/dataset.py:472](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/dataset.py:472), in Dataset.download_url(url, root, filename, log)
    [453](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/dataset.py:453) @staticmethod
    [454](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/dataset.py:454) def download_url(
    [455](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/dataset.py:455)     url: str,
   (...)
    [459](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/dataset.py:459)     log: bool = True,
    [460](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/dataset.py:460) ) -> str:
    [461](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/dataset.py:461)     r"""Downloads the content of :obj:`url` to the specified folder
    [462](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/dataset.py:462)     :obj:`root`.
    [463](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/dataset.py:463) 
   (...)
    [470](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/dataset.py:470)             the console. (default: :obj:`True`)
    [471](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/dataset.py:471)     """
--> [472](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/dataset.py:472)     return torch_frame.data.download_url(url, root, filename, log=log)

File [~/Github/test/pytorch-frame/torch_frame/data/download.py:44](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/download.py:44), in download_url(url, root, filename, log)
     [41](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/download.py:41) os.makedirs(root, exist_ok=True)
     [43](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/download.py:43) context = ssl._create_unverified_context()
---> [44](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/download.py:44) data = urllib.request.urlopen(url, context=context)
     [46](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/download.py:46) with open(path, 'wb') as f:
     [47](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/download.py:47)     while True:

File [~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:216](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:216), in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    [214](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:214) else:
    [215](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:215)     opener = _opener
--> [216](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:216) return opener.open(url, data, timeout)

File [~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:519](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:519), in OpenerDirector.open(self, fullurl, data, timeout)
    [516](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:516)     req = meth(req)
    [518](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:518) sys.audit('urllib.Request', req.full_url, req.data, req.headers, req.get_method())
--> [519](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:519) response = self._open(req, data)
    [521](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:521) # post-process response
    [522](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:522) meth_name = protocol+"_response"

File [~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:536](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:536), in OpenerDirector._open(self, req, data)
    [533](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:533)     return result
    [535](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:535) protocol = req.type
--> [536](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:536) result = self._call_chain(self.handle_open, protocol, protocol +
    [537](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:537)                           '_open', req)
    [538](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:538) if result:
    [539](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:539)     return result

File [~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:496](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:496), in OpenerDirector._call_chain(self, chain, kind, meth_name, *args)
    [494](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:494) for handler in handlers:
    [495](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:495)     func = getattr(handler, meth_name)
--> [496](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:496)     result = func(*args)
    [497](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:497)     if result is not None:
    [498](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:498)         return result

File [~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:1391](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:1391), in HTTPSHandler.https_open(self, req)
   [1390](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:1390) def https_open(self, req):
-> [1391](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:1391)     return self.do_open(http.client.HTTPSConnection, req,
   [1392](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:1392)         context=self._context, check_hostname=self._check_hostname)

File [~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:1351](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:1351), in AbstractHTTPHandler.do_open(self, http_class, req, **http_conn_args)
   [1348](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:1348)         h.request(req.get_method(), req.selector, req.data, headers,
   [1349](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:1349)                   encode_chunked=req.has_header('Transfer-encoding'))
   [1350](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:1350)     except OSError as err: # timeout error
-> [1351](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:1351)         raise URLError(err)
   [1352](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:1352)     r = h.getresponse()
   [1353](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:1353) except:

URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>

how could I resolve this problem?

Thanks,
Yu

Support saving/loading of GBDT models

Discussed in #240

Try it with custom dataset

Hi all,

I'm ok for all examples. I'm working on a custom dataset. I couldn't manage to implement it usinf torch_frame dataset scripts. I've got an error with DataLoader and using dataset. I tried to write my own custom dataloader but even in the model it uses the outputs which comes from dataset or data loader like dataset.col_stats etc..

How did you handle it and use this nice framework for your custom datasets? Any example or tutorial for custom datasets?

Thanks,

Add saving/loading logic for fittable transform

We will need a saving/loading logic for fittable transform so that we don't fit every time. It's like Pytorch's model checkpoint saving/loading.

Support Multi-label Classification problem for `Dataset`

Introduce multi-label classification problem support for Dataset

Select rows in `TensorFrame` with a bool `Tensor` mask

I was wondering what the intended behavior of a tensor_frame is when passing a row-wise Boolean mask to __getitem__?
Based on pandas data frames and torch tensors a user may expect Boolean masks to work for a tensor_frame object.

Currently, the result depends on the specific stypes occurring in the frame. If no _MultiTensor columns are present the masking seems to work as expected.
In the _MultiTensor classes row selection only works correctly for integer tensors but does not raise an error when given a Boolean mask. Instead, it yields an incorrect output. This may yield an invalid tensor_frame. Here is an example:

from torch_frame.datasets import FakeDataset
from torch_frame import stype
import torch

mask = torch.tensor([True, False, True])

# Frame without _MultiTensor Columns
stypes = [stype.categorical, stype.numerical]
tf_1 = FakeDataset(num_rows=3, stypes=stypes).materialize().tensor_frame

print(f'Trying Boolean mask for stypes {stypes}')
tf_1[mask].validate()

# Frame with _MultiTensor Column
stypes = [stype.categorical, stype.multicategorical]
tf_2 = FakeDataset(num_rows=3, stypes=stypes).materialize().tensor_frame

print(f'Trying Boolean mask for stypes {stypes}')
tf_2[mask].validate()

We get the following output:

Trying Boolean mask for stypes [<stype.categorical: 'categorical'>, <stype.numerical: 'numerical'>]
Trying Boolean mask for stypes [<stype.categorical: 'categorical'>, <stype.multicategorical: 'multicategorical'>]
Traceback (most recent call last):
  File "/home/jan/git/torch/multinestedtest.py", line 21, in <module>
    tf_2[mask].validate()
  File "/home/jan/miniconda3/envs/torch/lib/python3.10/site-packages/torch_frame/data/tensor_frame.py", line 120, in validate
    raise ValueError(
ValueError: The length of elements in feat_dict are not aligned, got 3 but expected 2.

Should Boolean masks be supported? I think this would be convenient. If not, then some error should be thrown early when this is passed as input.

Add ColumnWiseEncoder

So we can assign a different encoder for different columns.

Handle the case when a numerical sequence is all nans

Currently it will break our code because we do ser.dropna() before we feed the ser into the compute function

About Excelformer

Hello, thank you for considering Excelformer. Just a friendly reminder: ExcelFormer utilizes a special initialization methods along with two data augmentation techniques, which might be worth incorporating into the example. Alternatively, you could consider adding a reminder to readers at the beginning of the example file. Thanks!

feature importance frameworks integration (SHAP, Captum and etc)

So we can understand the impact of certain features in the model's learning

Deal with NaN values in datasets before transform

If a dataset contains NaN values and we use transforms, we will get a failure message. So we need to convert the nan types before applying transforms.

sklearn-compatible interface

I think it would be great to have this feature, as I think sklearn is often used for tabular data. I tried to use skorch, but skorch does not allow TensorFrames and did not work well.

(examples/tutorial.py)

from skorch import NeuralNetClassifier

net = NeuralNetClassifier(module=model, max_epochs=args.epochs, lr=args.lr, 
                            device=device, batch_size=args.batch_size, 
                            classes=dataset.num_classes, iterator_train=DataLoader,
                            iterator_valid=DataLoader, train_split=None)
net.fit(train_dataset, y=None)

Traceback (most recent call last):
  File "\examples\tutorial.py", line 346, in <module>
    net.fit(train_dataset, y=None)
  File "\site-packages\skorch\classifier.py", line 165, in fit
    return super(NeuralNetClassifier, self).fit(X, y, **fit_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\skorch\net.py", line 1319, in fit
    self.partial_fit(X, y, **fit_params)
  File "\site-packages\skorch\net.py", line 1278, in partial_fit
    self.fit_loop(X, y, **fit_params)
  File "\site-packages\skorch\net.py", line 1190, in fit_loop
    self.run_single_epoch(iterator_train, training=True, prefix="train",
  File "\site-packages\skorch\net.py", line 1226, in run_single_epoch
    step = step_fn(batch, **fit_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\skorch\net.py", line 1105, in train_step
    self._step_optimizer(step_fn)
  File "\site-packages\skorch\net.py", line 1060, in _step_optimizer
    optimizer.step(step_fn)
  File "\site-packages\torch\optim\optimizer.py", line 373, in wrapper
    out = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\torch\optim\optimizer.py", line 76, in _use_grad
    ret = func(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\torch\optim\sgd.py", line 66, in step
    loss = closure()
           ^^^^^^^^^
  File "\site-packages\skorch\net.py", line 1094, in step_fn
    step = self.train_step_single(batch, **fit_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\skorch\net.py", line 993, in train_step_single
    y_pred = self.infer(Xi, **fit_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\skorch\net.py", line 1517, in infer
    x = to_tensor(x, device=self.device)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\skorch\utils.py", line 104, in to_tensor
    return [to_tensor_(x) for x in X]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\skorch\utils.py", line 104, in <listcomp>
    return [to_tensor_(x) for x in X]
            ^^^^^^^^^^^^^
  File "\site-packages\skorch\utils.py", line 118, in to_tensor
    raise TypeError("Cannot convert this data type to a torch tensor.")
TypeError: Cannot convert this data type to a torch tensor.

I think the following changes are needed:

Add an ability to convert from DataFrame to TensorFrame without much prior information.
Create a wrapper that passes Tensor to skorch or create a scikit-learn compatible estimator specifically for this package.

I am sorry, but I cannot take much time to assist in creating this feature, so if it is not possible, please close this.

Fix the image of the architecture overview in our documentation

If I understand it correctly, the image below doesn't reflect our current structure precisely because the output of FeatureEncoder should be a concatenated tensor. However, the image illustrates that the concatenation is applied outside the FeatureEncoder.

Improve current the split and support more split methods

Right now the split method for Dateset object is very basic. We should support more flexible splits

Make the current split method less error prone through refactoring and adding more validation checks
Support train/val split in addition to train/val/test split
Support random split function for Dateset for better user experience
Support temporal split, i.e. split based on a time column in the Dataset

Make text finetuning version of Trompt, just like "transformers_text.py" for FTtransformer?

Can you also add an example of using text finetuning (better with LORA) version of Trompt, just like "transformers_text.py" for FTtransformer?

Better to support larger models LORA finetuning such as LLMs.

This will be very useful to compare FTtransformer and Trompt after text finetuned embeddings.

Thanks!

Implement Mixup

Mixup is used by Excelformer.
My current implementation of Excelformer doesn't have mixup and it's performing ~10% worse than the reported numbers on helena and adult dataset.
The mixup is applied after the embedding layer and on the labels.

I am thinking of implementing it as a transform.

So the Base Transform will inherit from Module as well

class BaseTransform(ABC, Module):

And then it looks like

dataset = Yandex()
dataset.materialize()

transform = HiddenMix()
tensor_frame = dataset.tensor_frame.to(device)
transformed_tensor_frame = transform(tensor_frame)

model = ExcelFormer(mixup = tranform)

In the ExcelFromerEncoder, we will use the transform as a post module.

Question:

Does it make sense to use mixup without labels(y)?
Currently, the forward function of BaseTransform transforms from TensorFrame to TensorFrame. But if we need to use it as post module , we need to have the forward function takes in a Tensor and outputs a Tensor. Perhaps we can change the types of BaseTransform to be Union[Tensor, TensorFrame]. And then have another class PostEmbeddingTransform that inherits from BaseTransform. Example Implementation

Handle the case when target_mean is nan in CatToNumTransform

Currently, it will just cast all the transformed values to nan

Split dataset by `timestamp`

In PyF, maybe it's a good choice to add split_by_col to split data. Otherwise, need to create split_id by myself.

def _add_split(dt: int) -> int:
    if dt <= 20231220:
        return 0
    elif dt <= 20231231:
        return 1
    else:
        return 2

Add extensively tuned XGboost and Catboost baseline

After discussing with @weihua916 in office today, we decided to include extensively tuned XGboost and Catboost baseline, so users of the pytorch-frame can easily compare the performance of the deep learning approaches to GBDT approaches.

The idea is inspired by ExcelFormer as the authors compared the performance of ExcelFormer with extensively tuned XGboost and Catboost.

The intented interface looks like the following:

xgboost = torch_frame.XGBoost()
xgboost.fit_tune(tf_train, tf_val, num_trials, task_type, task_kwargs, search_kwargs)
y_pred_test = xgboost.predict(tf_test)
test_acc = accuracy(y_pred_test, tf_test.y)

The fit_tune function will do hyper parameter search by default. If num_trials is 1, then we don't do any hyperparameter search and use the default hyper-parameter values.
task_kwargs is any keyword arguments for the task (e.g., number of classes for classification, MAE versus RMSE loss for regression).
search_kwargs is any keyword arguments to manually specify the hyper-parameter search space.

Here is the default hyper parameter tuning space in ExcelFormer and we propose it as the default search space for our XGBoost and Catboost baselines.

Support for pyspark dataframe

Currently, there's support for pandas dataframe. Requesting support for pyspark dataframe.

Allow different `text_embedder_cfg` for each column

Currently, the Dataset class expects a single text_embedder_cfg object that is used globally to embed all text embedding columns.

This may not be desirable in cases where different pre-trained models should be applied to different columns.
An important example would be multi-lingual data where different text columns have different languages, say English and Chinese.

I think it would be great if one can also optionally pass a dictionary with columns-wise configs, similar to the col_to_sep argument.
I would gladly add this feature if it is considered helpful.

Thanks for this project by the way, it will be very useful for many datasets.

Implement CatBoostEncoder using PyTorch

Currently, ExcelFormer needs the CatBoostEncoder from category encoders library. I will extract the required pieces and write them in pure PyTorch.

Refactor advanced indexing in `MultiEmbeddingTensor` and `MultiNestedTensor`

          Do you think `_slice`, `narrow`, `select`, `index_select` can be shared with `MultiNestedTensor` (in other words, we put them in the parent `_MultiTensor` class)? Here we only need to implement `_row_index_select`, `_col_index_select` and so on.

Originally posted by @weihua916 in #217 (comment)

Integration with TorchData

Hi all,

In my project, I use TorchData to read parquet files from AWS S3 buckets. Currently, it seems that pytorch-frame can not be integrated with torchdata. I was wondering if you have any plans to make it possible or if you have any workaround solution to read parquets files from S3 buckets using torchframe dataset?

Thanks,

Support `LightGBM`

I've tried lightgbm on my own dataset, and it's good. So, wanna to integrate it into pytorch_frame

NaN masking not working in multicategorical stype

This is because when we are doing slicing, we are creating a new tensor. So the inplace modification to the feat.values does not get reflected back on the original feat passed to the function.

Handle differnt `dtype` in `MultiNestedTensor`/`MultiEmbeddingTensor`

num_rows = 10
tensor_list = [
    torch.rand((num_rows, 2, dtype=torch.float32),
    torch.rand((num_rows, 2, dtype=torch.long),
]
MultiEmbeddingTensor.from_list(tensor_list)  # should raise an exception or warning

You probably want to check if all items in xs are in the same dtype. If not, probably good to raise an error or a warning.

Originally posted by @yiweny in #181 (comment)

sequence numerical stats computation

Currently, sequence numerical stats will be NaN as long as we have a single sequence with NaN element. This is because dropna() in https://github.com/pyg-team/pytorch-frame/blob/master/torch_frame/data/stats.py#L101 cannot drop NaN elements inside the sequence.

We want to handle NaN in sequence gracefully and augment the test case here.

Handle the case when input series is all nans's in MutualInformationSort

It's possible that the input series is all nans or the target contains some nans. We need to handle the case otherwise the code will error out.

Add CONTRIBUTING.md

Remove implicit clone in stype encoders

Currently, we clone the TensorData in StypeEncoders to avoid in place modifications. But this is bad because it is wasteful. We want to remove that clone while avoiding inplace modification.

Problem to reproduce similar results when saving the model

I am facing a problem that, when I save the model (even saving the model or the state dict), the results are totally different. Is there any detail that I should be aware of the Excelformer algorithm to be able to include the state dict in the right way inside the Excelformer?

Support more `stype` in `LinearModelEncoder`

Currently, only stype.text_tokenized is supported but in principle, we can support any stypes.

Clean up `TextTokenizer` logic

Currently, we are doing hacking around here. We should remove this hack.

Support inductive `TensorFrame` transformation

Currently, we create TensorFrame via dataset.materialize() but this only handles the transductive transformation of DataFrame. It'd be useful to support inductive transformation, i.e., transform new DataFrame into TensorFrame with the same schema as the existing transformed TensorFrame.

Add default NAStrategy to models

Currently, all the models don't have a default NAStrategy. If a user runs any model on data containing nans, then all the gradients would explode.

Improve documentation on Text Embeddings

Currently, the documentation on text embedding is not sufficient, e.g.
https://pyg-team-pytorch-frame--112.com.readthedocs.build/en/112/generated/torch_frame.config.TextEmbedderConfig.html#torch_frame.config.TextEmbedderConfig

Fix the logos in documentation

In the documentation, we are still using the pyg logo. Need to switch it to pytorch-frame logo.

Bug in the stype encoder

In materialize stage, we calculate all the statistics of all the columns, including the target.
However, in these lines, we are assuming that the StatType.MEAN exist for all columns in col_stats.
This would cause an error when the target is binary or multi class, if we directly pass dataset.col_stats

Can it be used for relational datasets?

Is it possible for the library to be used for relational datasets? If we load the database data into multiple dataframes, e.g. Customer table, Orders table, Products table. Can we use the library to generate data that are both synthetic and related to each other? A customer could have 1-N orders while its orders could also have 1-N products, depending the distributions found in the database. This is the idea. So far, I see that is only for single tables.

Problem in MutualInformationSort

I have a dataset that was working before, but, after some changes on it, the MutualInformationSort is raising this error, specifically when checking if has any NaNs values. What is this is related to? I really can't understand what it is happening.

Fix `examples/transformers_text.py`

python transformers_text.py --epochs 1
Epoch: 1:   0%|                                                                                                                                                                                                                  | 0/148 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/akihiro/work/github.com/pyg-team/pytorch-frame/examples/transformers_text.py", line 341, in <module>
    train_loss = train(epoch)
  File "/home/akihiro/work/github.com/pyg-team/pytorch-frame/examples/transformers_text.py", line 294, in train
    pred = model(tf)
  File "/home/akihiro/.conda/envs/pyf310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/akihiro/.conda/envs/pyf310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/akihiro/work/github.com/pyg-team/pytorch-frame/torch_frame/nn/models/ft_transformer.py", line 105, in forward
    x, _ = self.encoder(tf)
  File "/home/akihiro/.conda/envs/pyf310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/akihiro/.conda/envs/pyf310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/akihiro/work/github.com/pyg-team/pytorch-frame/torch_frame/nn/encoder/stypewise_encoder.py", line 84, in forward
    x = self.encoder_dict[stype.value](feat, col_names)
  File "/home/akihiro/.conda/envs/pyf310/lib/python3.10/site-packages/torch/nn/modules/container.py", line 459, in __getitem__
    return self._modules[key]
KeyError: 'text_embedded'

Fix `examples/mercari.py`

python examples/mercari.py  --epochs 1
Epoch: 1: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:01<00:00,  8.09it/s]
Train Loss: 2503.9182, Train RMSE: 49.0190, Val RMSE: 33.1917
Best Val RMSE: 33.1917,
Traceback (most recent call last):
  File "/home/akihiro/work/github.com/pyg-team/pytorch-frame/examples/mercari.py", line 196, in <module>
    pred = np.concatenate(all_preds).flatten()
  File "<__array_function__ internals>", line 200, in concatenate
ValueError: need at least one array to concatenate