Coder Social home page Coder Social logo

pyg-team / pytorch-frame Goto Github PK

View Code? Open in Web Editor NEW
474.0 15.0 49.0 10.85 MB

Tabular Deep Learning Library for PyTorch

Home Page: https://pytorch-frame.readthedocs.io

License: MIT License

Python 100.00%
data-frame deep-learning pytorch tabular-learning

pytorch-frame's Introduction



A modular deep learning framework for building neural network models on heterogeneous tabular data.


PyPI Version Testing Status Docs Status Contributing Slack

Documentation | Paper

PyTorch Frame is a deep learning extension for PyTorch, designed for heterogeneous tabular data with different column types, including numerical, categorical, time, text, and images. It offers a modular framework for implementing existing and future methods. The library features methods from state-of-the-art models, user-friendly mini-batch loaders, benchmark datasets, and interfaces for custom data integration.

PyTorch Frame democratizes deep learning research for tabular data, catering to both novices and experts alike. Our goals are:

  1. Facilitate Deep Learning for Tabular Data: Historically, tree-based models (e.g., GBDT) excelled at tabular learning but had notable limitations, such as integration difficulties with downstream models, and handling complex column types, such as texts, sequences, and embeddings. Deep tabular models are promising to resolve the limitations. We aim to facilitate deep learning research on tabular data by modularizing its implementation and supporting the diverse column types.

  2. Integrates with Diverse Model Architectures like Large Language Models: PyTorch Frame supports integration with a variety of different architectures including LLMs. With any downloaded model or embedding API endpoint, you can encode your text data with embeddings and train it with deep learning models alongside other complex semantic types. We support the following (but not limited to):

OpenAI
OpenAI Embedding Code Example
Cohere
Cohere Embed v3 Code Example
Hugging Face
Hugging Face Code Example
Voyage AI
Voyage AI Code Example

Library Highlights

PyTorch Frame builds directly upon PyTorch, ensuring a smooth transition for existing PyTorch users. Key features include:

  • Diverse column types: PyTorch Frame supports learning across various column types: numerical, categorical, multicategorical, text_embedded, text_tokenized, timestamp, image_embedded, and embedding. See here for the detailed tutorial.
  • Modular model design: Enables modular deep learning model implementations, promoting reusability, clear coding, and experimentation flexibility. Further details in the architecture overview.
  • Models Implements many state-of-the-art deep tabular models as well as strong GBDTs (XGBoost, CatBoost, and LightGBM) with hyper-parameter tuning.
  • Datasets: Comes with a collection of readily-usable tabular datasets. Also supports custom datasets to solve your own problem. We benchmark deep tabular models against GBDTs.
  • PyTorch integration: Integrates effortlessly with other PyTorch libraries, facilitating end-to-end training of PyTorch Frame with downstream PyTorch models. For example, by integrating with PyG, a PyTorch library for GNNs, we can perform deep learning over relational databases. Learn more in RelBench and example code (WIP).

Architecture Overview

Models in PyTorch Frame follow a modular design of FeatureEncoder, TableConv, and Decoder, as shown in the figure below:

In essence, this modular setup empowers users to effortlessly experiment with myriad architectures:

  • Materialization handles converting the raw pandas DataFrame into a TensorFrame that is amenable to Pytorch-based training and modeling.
  • FeatureEncoder encodes TensorFrame into hidden column embeddings of size [batch_size, num_cols, channels].
  • TableConv models column-wise interactions over the hidden embeddings.
  • Decoder generates embedding/prediction per row.

Quick Tour

In this quick tour, we showcase the ease of creating and training a deep tabular model with only a few lines of code.

Build and train your own deep tabular model

As an example, we implement a simple ExampleTransformer following the modular architecture of Pytorch Frame. In the example below:

  • self.encoder maps an input TensorFrame to an embedding of size [batch_size, num_cols, channels].
  • self.convs interatively transforms the embedding of size [batch_size, num_cols, channels] into an embedding of the same size.
  • self.decoder pools the embedding of size [batch_size, num_cols, channels] into [batch_size, out_channels].
from torch import Tensor
from torch.nn import Linear, Module, ModuleList

import torch_frame
from torch_frame import TensorFrame, stype
from torch_frame.nn.conv import TabTransformerConv
from torch_frame.nn.encoder import (
    EmbeddingEncoder,
    LinearEncoder,
    StypeWiseFeatureEncoder,
)

class ExampleTransformer(Module):
    def __init__(
        self,
        channels, out_channels, num_layers, num_heads,
        col_stats, col_names_dict,
    ):
        super().__init__()
        self.encoder = StypeWiseFeatureEncoder(
            out_channels=channels,
            col_stats=col_stats,
            col_names_dict=col_names_dict,
            stype_encoder_dict={
                stype.categorical: EmbeddingEncoder(),
                stype.numerical: LinearEncoder()
            },
        )
        self.convs = ModuleList([
            TabTransformerConv(
                channels=channels,
                num_heads=num_heads,
            ) for _ in range(num_layers)
        ])
        self.decoder = Linear(channels, out_channels)

    def forward(self, tf: TensorFrame) -> Tensor:
        x, _ = self.encoder(tf)
        for conv in self.convs:
            x = conv(x)
        out = self.decoder(x.mean(dim=1))
        return out

To prepare the data, we can quickly instantiate a pre-defined dataset and create a PyTorch-compatible data loader as follows:

from torch_frame.datasets import Yandex
from torch_frame.data import DataLoader

dataset = Yandex(root='/tmp/adult', name='adult')
dataset.materialize()
train_dataset = dataset[:0.8]
train_loader = DataLoader(train_dataset.tensor_frame, batch_size=128,
                          shuffle=True)

Then, we just follow the standard PyTorch training procedure to optimize the model parameters. That's it!

import torch
import torch.nn.functional as F

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = ExampleTransformer(
    channels=32,
    out_channels=dataset.num_classes,
    num_layers=2,
    num_heads=8,
    col_stats=train_dataset.col_stats,
    col_names_dict=train_dataset.tensor_frame.col_names_dict,
).to(device)

optimizer = torch.optim.Adam(model.parameters())

for epoch in range(50):
    for tf in train_loader:
        tf = tf.to(device)
        pred = model.forward(tf)
        loss = F.cross_entropy(pred, tf.y)
        optimizer.zero_grad()
        loss.backward()

Implemented Deep Tabular Models

We list currently supported deep tabular models:

In addition, we implemented XGBoost, CatBoost, and LightGBM examples with hyperparameter-tuning using Optuna for users who'd like to compare their model performance with GBDTs.

Benchmark

We benchmark recent tabular deep learning models against GBDTs over diverse public datasets with different sizes and task types.

The following chart shows the performance of various models on small regression datasets, where the row represents the model names and the column represents dataset indices (we have 13 datasets here). For more results on classification and larger datasets, please check the benchmark documentation.

Model Name dataset_0 dataset_1 dataset_2 dataset_3 dataset_4 dataset_5 dataset_6 dataset_7 dataset_8 dataset_9 dataset_10 dataset_11 dataset_12
XGBoost 0.247±0.000 0.077±0.000 0.167±0.000 1.119±0.000 0.328±0.000 1.024±0.000 0.292±0.000 0.606±0.000 0.876±0.000 0.023±0.000 0.697±0.000 0.865±0.000 0.435±0.000
CatBoost 0.265±0.000 0.062±0.000 0.128±0.000 0.336±0.000 0.346±0.000 0.443±0.000 0.375±0.000 0.273±0.000 0.881±0.000 0.040±0.000 0.756±0.000 0.876±0.000 0.439±0.000
LightGBM 0.253±0.000 0.054±0.000 0.112±0.000 0.302±0.000 0.325±0.000 0.384±0.000 0.295±0.000 0.272±0.000 0.877±0.000 0.011±0.000 0.702±0.000 0.863±0.000 0.395±0.000
Trompt 0.261±0.003 0.015±0.005 0.118±0.001 0.262±0.001 0.323±0.001 0.418±0.003 0.329±0.009 0.312±0.002 OOM 0.008±0.001 0.779±0.006 0.874±0.004 0.424±0.005
ResNet 0.288±0.006 0.018±0.003 0.124±0.001 0.268±0.001 0.335±0.001 0.434±0.004 0.325±0.012 0.324±0.004 0.895±0.005 0.036±0.002 0.794±0.006 0.875±0.004 0.468±0.004
FTTransformerBucket 0.325±0.008 0.096±0.005 0.360±0.354 0.284±0.005 0.342±0.004 0.441±0.003 0.345±0.007 0.339±0.003 OOM 0.105±0.011 0.807±0.010 0.885±0.008 0.468±0.006
ExcelFormer 0.302±0.003 0.099±0.003 0.145±0.003 0.382±0.011 0.344±0.002 0.411±0.005 0.359±0.016 0.336±0.008 OOM 0.192±0.014 0.794±0.005 0.890±0.003 0.445±0.005
FTTransformer 0.335±0.010 0.161±0.022 0.140±0.002 0.277±0.004 0.335±0.003 0.445±0.003 0.361±0.018 0.345±0.005 OOM 0.106±0.012 0.826±0.005 0.896±0.007 0.461±0.003
TabNet 0.279±0.003 0.224±0.016 0.141±0.010 0.275±0.002 0.348±0.003 0.451±0.007 0.355±0.030 0.332±0.004 0.992±0.182 0.015±0.002 0.805±0.014 0.885±0.013 0.544±0.011
TabTransformer 0.624±0.003 0.229±0.003 0.369±0.005 0.340±0.004 0.388±0.002 0.539±0.003 0.619±0.005 0.351±0.001 0.893±0.005 0.431±0.001 0.819±0.002 0.886±0.005 0.545±0.004

We see that some recent deep tabular models were able to achieve competitive model performance to strong GBDTs (despite being 5--100 times slower to train). Making deep tabular models even more performant with less compute is a fruitful direction for future research.

We also benchmark different text encoders on a real-world tabular dataset (Wine Reviews) with one text column. The following table shows the performance:

Test Acc Method Model Name Source
0.7926 Pre-trained sentence-transformers/all-distilroberta-v1 (125M # params) Hugging Face
0.7998 Pre-trained embed-english-v3.0 (dimension size: 1024) Cohere
0.8102 Pre-trained text-embedding-ada-002 (dimension size: 1536) OpenAI
0.8147 Pre-trained voyage-01 (dimension size: 1024) Voyage AI
0.8203 Pre-trained intfloat/e5-mistral-7b-instruct (7B # params) Hugging Face
0.8230 LoRA Finetune DistilBERT (66M # params) Hugging Face

The benchmark script for Hugging Face text encoders is in this file and for the rest of text encoders is in this file.

Installation

PyTorch Frame is available for Python 3.8 to Python 3.11.

pip install pytorch_frame

See the installation guide for other options.

Cite

If you use PyTorch Frame in your work, please cite our paper (Bibtex below).

@article{hu2024pytorch,
  title={PyTorch Frame: A Modular Framework for Multi-Modal Tabular Learning},
  author={Hu, Weihua and Yuan, Yiwen and Zhang, Zecheng and Nitta, Akihiro and Cao, Kaidi and Kocijan, Vid and Leskovec, Jure and Fey, Matthias},
  journal={arXiv preprint arXiv:2404.00776},
  year={2024}
}

pytorch-frame's People

Contributors

akihironitta avatar anas-rz avatar berkekisin avatar damianszwichtenberg avatar dependabot[bot] avatar drivanov avatar eliazonta avatar february24-lee avatar itsjayway avatar jyansir avatar kaidic avatar pre-commit-ci[bot] avatar rishabh-ranjan avatar rusty1s avatar simonpop avatar toenshoff avatar vid-koci avatar weihua916 avatar xinweihe avatar xnuohz avatar yiweny avatar zechengz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pytorch-frame's Issues

Error in Handling Heterogeneous Semantic Types (in documentation)

The following code in the example documentation produces the error below. Somehow arrays are not of the same length. Can you please look into this?

import random

import numpy as np
import pandas as pd

# Numerical column
numerical = np.random.randint(0, 100, size=10)

# Categorical column
simple_categories = ['Type 1', 'Type 2', 'Type 3']
categorical = np.random.choice(simple_categories, size=100)

# Timestamp column
time = pd.date_range(start='2023-01-01', periods=100, freq='D')

# Multicategorical column
categories = ['Category A', 'Category B', 'Category C', 'Category D']
multicategorical = [
    random.sample(categories, k=random.randint(0, len(categories)))
    for _ in range(100)
]

# Embedding column (assuming an embedding size of 5 for simplicity)
embedding_size = 5
embedding = np.random.rand(100, embedding_size)

# Create the DataFrame
df = pd.DataFrame({
    'Numerical': numerical,
    'Categorical': categorical,
    'Time': time,
    'Multicategorical': multicategorical,
    'Embedding': list(embedding)
})

Error:

 "Mixing dicts with non-Series may lead to ambiguous ordering."
ValueError: All arrays must be of the same length

Feature Importance

Feature

Support feature importance in tabular data scenarios.

  1. Understand which features are beneficial for prediction and help to develop new features
  2. Feature selection, removing features that are not helpful in prediction

Ideas

  1. GBDTs naturally have APIs for calculating feature importance, it's easy to add.
  2. NNs
    • Permutation. After shuffling a certain feature, observe the changes in metric. The greater the change, the more important the feature is. Simple.
    • SHAP. Complex.

Auto inference of `stype`

Currently, we ask our users to specify stype but this is sometimes tedious. We may come up with some simple rules to classify each column into existing stype.

More stype support

We aim to expand our scope of Pytorch Frame beyond the basic numerical and categorical columns.

stype.text_embedded @zechengz @weihua916

  • TensorMapper
  • StypeEncoder
  • e2e example

stype.text_tokenized @zechengz

  • TensorMapper
  • StypeEncoder
  • e2e example

stype.timestamp @yiweny

  • TensorMapper
  • StypeEncoder
  • e2e example

stype.sequence_numerical @yiweny

  • TensorMapper
  • StypeEncoder
  • e2e example

stype.sequence_categorical @yiweny

  • TensorMapper
  • StypeEncoder
  • e2e example

stype.multi_categorical @yiweny

  • TensorMapper
  • StypeEncoder
  • e2e example

stype.embeddings @akihironitta

  • MultiEmbeddingTensor
  • TensorMapper
  • StypeEncoder
  • e2e example

Add overwrite logic for dataset materialization

Currently, dataset.materialize() supports path argument where it loads a cached object.
However, if any column changes, for the dataset, or if we are using a different transformer model, the default behavior is to read from the cached object rather than rematerialize.

I think it's necessary to add logic to overwrite the cached object if the dataset changes. Currently, there's not an option to do so--you have to manually delete the cached object.

Initial Model Implementation & Reproduction

We aim to include four popular models in the initial release. This issue keeps track of our progress.

FT-Transformer @kaidic @weihua916

  • Initial implementation #12
  • Initial implementation of LinearBucketEncoder and LinearPeriodicEncoder. #30 #31
  • Include at least 2 datasets from the paper #37
  • Reproducibility of FT-Transformer
  • Reproducibility of ResNet.
  • Reproducibility of LinearBucketEncoder ad LinearPeriodicEncoder.

Trompt @weihua916

  • Initial implementation #25
  • Include at least 2 datasets from the paper #33
  • Reproducibility #39

TabNet @zechengz @weihua916

  • Initial implementation
  • Include at least 2 datasets from the paper
  • Reproducibility

ExcelFormer @yiweny

  • Initial implementation
  • Include at least 2 datasets from the paper
  • Reproducibility

TabTransformer @yiweny

  • Initial implementation
  • Include at least 2 datasets from the paper
  • Reproducibility

Example encoding + self-attention + sum pooling @weihua916

  • example code. #54

dataset download errors

Hi,
I came across a dataset download error when I use the script that this readme file provided:

from torch_frame.datasets import Yandex
from torch_frame.data import DataLoader

dataset = Yandex(root='/Users/huyu/Github/test/pytorch-frame/adult', name='adult')
dataset.materialize()
train_dataset = dataset[:0.8]
train_loader = DataLoader(train_dataset.tensor_frame, batch_size=128,
                          shuffle=True)

then got the error:

gaierror: [Errno 8] nodename nor servname provided, or not known

During handling of the above exception, another exception occurred:

URLError                                  Traceback (most recent call last)
Cell In[7], [line 4](vscode-notebook-cell:?execution_count=7&line=4)
      [1](vscode-notebook-cell:?execution_count=7&line=1) from torch_frame.datasets import Yandex
      [2](vscode-notebook-cell:?execution_count=7&line=2) from torch_frame.data import DataLoader
----> [4](vscode-notebook-cell:?execution_count=7&line=4) dataset = Yandex(root='[/Users/huyu/Github/test/pytorch-frame/adult](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/adult)', name='adult')
      [5](vscode-notebook-cell:?execution_count=7&line=5) dataset.materialize()
      [6](vscode-notebook-cell:?execution_count=7&line=6) train_dataset = dataset[:0.8]

File [~/Github/test/pytorch-frame/torch_frame/datasets/yandex.py:215](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/datasets/yandex.py:215), in Yandex.__init__(self, root, name)
    [213](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/datasets/yandex.py:213) self.root = root
    [214](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/datasets/yandex.py:214) self.name = name
--> [215](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/datasets/yandex.py:215) path = self.download_url(osp.join(self.base_url, self.name + '.zip'),
    [216](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/datasets/yandex.py:216)                          root)
    [217](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/datasets/yandex.py:217) df, col_to_stype = get_df_and_col_to_stype(path)
    [218](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/datasets/yandex.py:218) if name in self.regression_datasets:

File [~/Github/test/pytorch-frame/torch_frame/data/dataset.py:472](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/dataset.py:472), in Dataset.download_url(url, root, filename, log)
    [453](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/dataset.py:453) @staticmethod
    [454](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/dataset.py:454) def download_url(
    [455](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/dataset.py:455)     url: str,
   (...)
    [459](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/dataset.py:459)     log: bool = True,
    [460](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/dataset.py:460) ) -> str:
    [461](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/dataset.py:461)     r"""Downloads the content of :obj:`url` to the specified folder
    [462](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/dataset.py:462)     :obj:`root`.
    [463](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/dataset.py:463) 
   (...)
    [470](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/dataset.py:470)             the console. (default: :obj:`True`)
    [471](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/dataset.py:471)     """
--> [472](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/dataset.py:472)     return torch_frame.data.download_url(url, root, filename, log=log)

File [~/Github/test/pytorch-frame/torch_frame/data/download.py:44](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/download.py:44), in download_url(url, root, filename, log)
     [41](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/download.py:41) os.makedirs(root, exist_ok=True)
     [43](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/download.py:43) context = ssl._create_unverified_context()
---> [44](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/download.py:44) data = urllib.request.urlopen(url, context=context)
     [46](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/download.py:46) with open(path, 'wb') as f:
     [47](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/Github/test/pytorch-frame/torch_frame/data/download.py:47)     while True:

File [~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:216](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:216), in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    [214](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:214) else:
    [215](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:215)     opener = _opener
--> [216](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:216) return opener.open(url, data, timeout)

File [~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:519](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:519), in OpenerDirector.open(self, fullurl, data, timeout)
    [516](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:516)     req = meth(req)
    [518](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:518) sys.audit('urllib.Request', req.full_url, req.data, req.headers, req.get_method())
--> [519](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:519) response = self._open(req, data)
    [521](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:521) # post-process response
    [522](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:522) meth_name = protocol+"_response"

File [~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:536](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:536), in OpenerDirector._open(self, req, data)
    [533](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:533)     return result
    [535](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:535) protocol = req.type
--> [536](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:536) result = self._call_chain(self.handle_open, protocol, protocol +
    [537](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:537)                           '_open', req)
    [538](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:538) if result:
    [539](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:539)     return result

File [~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:496](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:496), in OpenerDirector._call_chain(self, chain, kind, meth_name, *args)
    [494](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:494) for handler in handlers:
    [495](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:495)     func = getattr(handler, meth_name)
--> [496](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:496)     result = func(*args)
    [497](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:497)     if result is not None:
    [498](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:498)         return result

File [~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:1391](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:1391), in HTTPSHandler.https_open(self, req)
   [1390](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:1390) def https_open(self, req):
-> [1391](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:1391)     return self.do_open(http.client.HTTPSConnection, req,
   [1392](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:1392)         context=self._context, check_hostname=self._check_hostname)

File [~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:1351](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:1351), in AbstractHTTPHandler.do_open(self, http_class, req, **http_conn_args)
   [1348](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:1348)         h.request(req.get_method(), req.selector, req.data, headers,
   [1349](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:1349)                   encode_chunked=req.has_header('Transfer-encoding'))
   [1350](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:1350)     except OSError as err: # timeout error
-> [1351](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:1351)         raise URLError(err)
   [1352](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:1352)     r = h.getresponse()
   [1353](https://file+.vscode-resource.vscode-cdn.net/Users/huyu/Github/test/pytorch-frame/~/anaconda3/envs/pytorch/lib/python3.10/urllib/request.py:1353) except:

URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>

how could I resolve this problem?

Thanks,
Yu

Try it with custom dataset

Hi all,

I'm ok for all examples. I'm working on a custom dataset. I couldn't manage to implement it usinf torch_frame dataset scripts. I've got an error with DataLoader and using dataset. I tried to write my own custom dataloader but even in the model it uses the outputs which comes from dataset or data loader like dataset.col_stats etc..

How did you handle it and use this nice framework for your custom datasets? Any example or tutorial for custom datasets?

Thanks,

Select rows in `TensorFrame` with a bool `Tensor` mask

I was wondering what the intended behavior of a tensor_frame is when passing a row-wise Boolean mask to __getitem__?
Based on pandas data frames and torch tensors a user may expect Boolean masks to work for a tensor_frame object.

Currently, the result depends on the specific stypes occurring in the frame. If no _MultiTensor columns are present the masking seems to work as expected.
In the _MultiTensor classes row selection only works correctly for integer tensors but does not raise an error when given a Boolean mask. Instead, it yields an incorrect output. This may yield an invalid tensor_frame. Here is an example:

from torch_frame.datasets import FakeDataset
from torch_frame import stype
import torch

mask = torch.tensor([True, False, True])

# Frame without _MultiTensor Columns
stypes = [stype.categorical, stype.numerical]
tf_1 = FakeDataset(num_rows=3, stypes=stypes).materialize().tensor_frame

print(f'Trying Boolean mask for stypes {stypes}')
tf_1[mask].validate()

# Frame with _MultiTensor Column
stypes = [stype.categorical, stype.multicategorical]
tf_2 = FakeDataset(num_rows=3, stypes=stypes).materialize().tensor_frame

print(f'Trying Boolean mask for stypes {stypes}')
tf_2[mask].validate()

We get the following output:

Trying Boolean mask for stypes [<stype.categorical: 'categorical'>, <stype.numerical: 'numerical'>]
Trying Boolean mask for stypes [<stype.categorical: 'categorical'>, <stype.multicategorical: 'multicategorical'>]
Traceback (most recent call last):
  File "/home/jan/git/torch/multinestedtest.py", line 21, in <module>
    tf_2[mask].validate()
  File "/home/jan/miniconda3/envs/torch/lib/python3.10/site-packages/torch_frame/data/tensor_frame.py", line 120, in validate
    raise ValueError(
ValueError: The length of elements in feat_dict are not aligned, got 3 but expected 2.

Should Boolean masks be supported? I think this would be convenient. If not, then some error should be thrown early when this is passed as input.

About Excelformer

Hello, thank you for considering Excelformer. Just a friendly reminder: ExcelFormer utilizes a special initialization methods along with two data augmentation techniques, which might be worth incorporating into the example. Alternatively, you could consider adding a reminder to readers at the beginning of the example file. Thanks!

sklearn-compatible interface

I think it would be great to have this feature, as I think sklearn is often used for tabular data. I tried to use skorch, but skorch does not allow TensorFrames and did not work well.

(examples/tutorial.py)

from skorch import NeuralNetClassifier

net = NeuralNetClassifier(module=model, max_epochs=args.epochs, lr=args.lr, 
                            device=device, batch_size=args.batch_size, 
                            classes=dataset.num_classes, iterator_train=DataLoader,
                            iterator_valid=DataLoader, train_split=None)
net.fit(train_dataset, y=None)
Traceback (most recent call last):
  File "\examples\tutorial.py", line 346, in <module>
    net.fit(train_dataset, y=None)
  File "\site-packages\skorch\classifier.py", line 165, in fit
    return super(NeuralNetClassifier, self).fit(X, y, **fit_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\skorch\net.py", line 1319, in fit
    self.partial_fit(X, y, **fit_params)
  File "\site-packages\skorch\net.py", line 1278, in partial_fit
    self.fit_loop(X, y, **fit_params)
  File "\site-packages\skorch\net.py", line 1190, in fit_loop
    self.run_single_epoch(iterator_train, training=True, prefix="train",
  File "\site-packages\skorch\net.py", line 1226, in run_single_epoch
    step = step_fn(batch, **fit_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\skorch\net.py", line 1105, in train_step
    self._step_optimizer(step_fn)
  File "\site-packages\skorch\net.py", line 1060, in _step_optimizer
    optimizer.step(step_fn)
  File "\site-packages\torch\optim\optimizer.py", line 373, in wrapper
    out = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\torch\optim\optimizer.py", line 76, in _use_grad
    ret = func(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\torch\optim\sgd.py", line 66, in step
    loss = closure()
           ^^^^^^^^^
  File "\site-packages\skorch\net.py", line 1094, in step_fn
    step = self.train_step_single(batch, **fit_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\skorch\net.py", line 993, in train_step_single
    y_pred = self.infer(Xi, **fit_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\skorch\net.py", line 1517, in infer
    x = to_tensor(x, device=self.device)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\skorch\utils.py", line 104, in to_tensor
    return [to_tensor_(x) for x in X]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\skorch\utils.py", line 104, in <listcomp>
    return [to_tensor_(x) for x in X]
            ^^^^^^^^^^^^^
  File "\site-packages\skorch\utils.py", line 118, in to_tensor
    raise TypeError("Cannot convert this data type to a torch tensor.")
TypeError: Cannot convert this data type to a torch tensor.

I think the following changes are needed:

  • Add an ability to convert from DataFrame to TensorFrame without much prior information.
  • Create a wrapper that passes Tensor to skorch or create a scikit-learn compatible estimator specifically for this package.

I am sorry, but I cannot take much time to assist in creating this feature, so if it is not possible, please close this.

Fix the image of the architecture overview in our documentation

If I understand it correctly, the image below doesn't reflect our current structure precisely because the output of FeatureEncoder should be a concatenated tensor. However, the image illustrates that the concatenation is applied outside the FeatureEncoder.

Improve current the split and support more split methods

Right now the split method for Dateset object is very basic. We should support more flexible splits

  • Make the current split method less error prone through refactoring and adding more validation checks
  • Support train/val split in addition to train/val/test split
  • Support random split function for Dateset for better user experience
  • Support temporal split, i.e. split based on a time column in the Dataset

Implement Mixup

Mixup is used by Excelformer.
My current implementation of Excelformer doesn't have mixup and it's performing ~10% worse than the reported numbers on helena and adult dataset.
The mixup is applied after the embedding layer and on the labels.

I am thinking of implementing it as a transform.

So the Base Transform will inherit from Module as well

class BaseTransform(ABC, Module):

And then it looks like

dataset = Yandex()
dataset.materialize()

transform = HiddenMix()
tensor_frame = dataset.tensor_frame.to(device)
transformed_tensor_frame = transform(tensor_frame)

model = ExcelFormer(mixup = tranform)

In the ExcelFromerEncoder, we will use the transform as a post module.

Question:

  1. Does it make sense to use mixup without labels(y)?
  2. Currently, the forward function of BaseTransform transforms from TensorFrame to TensorFrame. But if we need to use it as post module , we need to have the forward function takes in a Tensor and outputs a Tensor. Perhaps we can change the types of BaseTransform to be Union[Tensor, TensorFrame]. And then have another class PostEmbeddingTransform that inherits from BaseTransform. Example Implementation

Split dataset by `timestamp`

In PyF, maybe it's a good choice to add split_by_col to split data. Otherwise, need to create split_id by myself.

def _add_split(dt: int) -> int:
    if dt <= 20231220:
        return 0
    elif dt <= 20231231:
        return 1
    else:
        return 2

Add extensively tuned XGboost and Catboost baseline

After discussing with @weihua916 in office today, we decided to include extensively tuned XGboost and Catboost baseline, so users of the pytorch-frame can easily compare the performance of the deep learning approaches to GBDT approaches.

The idea is inspired by ExcelFormer as the authors compared the performance of ExcelFormer with extensively tuned XGboost and Catboost.

The intented interface looks like the following:

xgboost = torch_frame.XGBoost()
xgboost.fit_tune(tf_train, tf_val, num_trials, task_type, task_kwargs, search_kwargs)
y_pred_test = xgboost.predict(tf_test)
test_acc = accuracy(y_pred_test, tf_test.y)

The fit_tune function will do hyper parameter search by default. If num_trials is 1, then we don't do any hyperparameter search and use the default hyper-parameter values.
task_kwargs is any keyword arguments for the task (e.g., number of classes for classification, MAE versus RMSE loss for regression).
search_kwargs is any keyword arguments to manually specify the hyper-parameter search space.

Here is the default hyper parameter tuning space in ExcelFormer and we propose it as the default search space for our XGBoost and Catboost baselines.
Screenshot 2023-09-05 at 10 46 11 PM

Allow different `text_embedder_cfg` for each column

Currently, the Dataset class expects a single text_embedder_cfg object that is used globally to embed all text embedding columns.

This may not be desirable in cases where different pre-trained models should be applied to different columns.
An important example would be multi-lingual data where different text columns have different languages, say English and Chinese.

I think it would be great if one can also optionally pass a dictionary with columns-wise configs, similar to the col_to_sep argument.
I would gladly add this feature if it is considered helpful.

Thanks for this project by the way, it will be very useful for many datasets.

Integration with TorchData

Hi all,

In my project, I use TorchData to read parquet files from AWS S3 buckets. Currently, it seems that pytorch-frame can not be integrated with torchdata. I was wondering if you have any plans to make it possible or if you have any workaround solution to read parquets files from S3 buckets using torchframe dataset?

Thanks,

Support `LightGBM`

I've tried lightgbm on my own dataset, and it's good. So, wanna to integrate it into pytorch_frame

Remove implicit clone in stype encoders

Currently, we clone the TensorData in StypeEncoders to avoid in place modifications. But this is bad because it is wasteful. We want to remove that clone while avoiding inplace modification.

Problem to reproduce similar results when saving the model

I am facing a problem that, when I save the model (even saving the model or the state dict), the results are totally different. Is there any detail that I should be aware of the Excelformer algorithm to be able to include the state dict in the right way inside the Excelformer?

Support inductive `TensorFrame` transformation

Currently, we create TensorFrame via dataset.materialize() but this only handles the transductive transformation of DataFrame. It'd be useful to support inductive transformation, i.e., transform new DataFrame into TensorFrame with the same schema as the existing transformed TensorFrame.

Add default NAStrategy to models

Currently, all the models don't have a default NAStrategy. If a user runs any model on data containing nans, then all the gradients would explode.

Bug in the stype encoder

In materialize stage, we calculate all the statistics of all the columns, including the target.
However, in these lines, we are assuming that the StatType.MEAN exist for all columns in col_stats.
This would cause an error when the target is binary or multi class, if we directly pass dataset.col_stats

Can it be used for relational datasets?

Is it possible for the library to be used for relational datasets? If we load the database data into multiple dataframes, e.g. Customer table, Orders table, Products table. Can we use the library to generate data that are both synthetic and related to each other? A customer could have 1-N orders while its orders could also have 1-N products, depending the distributions found in the database. This is the idea. So far, I see that is only for single tables.

Problem in MutualInformationSort

I have a dataset that was working before, but, after some changes on it, the MutualInformationSort is raising this error, specifically when checking if has any NaNs values. What is this is related to? I really can't understand what it is happening.

image

Fix `examples/transformers_text.py`

python transformers_text.py --epochs 1
Epoch: 1:   0%|                                                                                                                                                                                                                  | 0/148 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/akihiro/work/github.com/pyg-team/pytorch-frame/examples/transformers_text.py", line 341, in <module>
    train_loss = train(epoch)
  File "/home/akihiro/work/github.com/pyg-team/pytorch-frame/examples/transformers_text.py", line 294, in train
    pred = model(tf)
  File "/home/akihiro/.conda/envs/pyf310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/akihiro/.conda/envs/pyf310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/akihiro/work/github.com/pyg-team/pytorch-frame/torch_frame/nn/models/ft_transformer.py", line 105, in forward
    x, _ = self.encoder(tf)
  File "/home/akihiro/.conda/envs/pyf310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/akihiro/.conda/envs/pyf310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/akihiro/work/github.com/pyg-team/pytorch-frame/torch_frame/nn/encoder/stypewise_encoder.py", line 84, in forward
    x = self.encoder_dict[stype.value](feat, col_names)
  File "/home/akihiro/.conda/envs/pyf310/lib/python3.10/site-packages/torch/nn/modules/container.py", line 459, in __getitem__
    return self._modules[key]
KeyError: 'text_embedded'

Fix `examples/mercari.py`

python examples/mercari.py  --epochs 1
Epoch: 1: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:01<00:00,  8.09it/s]
Train Loss: 2503.9182, Train RMSE: 49.0190, Val RMSE: 33.1917
Best Val RMSE: 33.1917,
Traceback (most recent call last):
  File "/home/akihiro/work/github.com/pyg-team/pytorch-frame/examples/mercari.py", line 196, in <module>
    pred = np.concatenate(all_preds).flatten()
  File "<__array_function__ internals>", line 200, in concatenate
ValueError: need at least one array to concatenate

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.