Coder Social home page Coder Social logo

archettialberto / federated_survival_datasets Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 115 KB

Build realistic heterogeneous datasets for federated survival analysis in a reproducible way.

License: MIT License

Jupyter Notebook 89.27% Python 10.73%
data-preparation dataset datasets federated-learning heterogeneity survival-analysis time-to-event

federated_survival_datasets's Introduction

Heterogeneous Datasets for Federated Survival Analysis Simulation

This repo contains three algorithms for constructing realistic federated datasets for survival analysis. Each algorithm starts from an existing non-federated dataset and assigns each sample to a specific client in the federation. The algorithms are

  • uniform_split: assigns each sample to a random client with uniform probability;
  • quantity_skewed_split: assigns each sample to a random client according to the Dirichlet distribution [3, 4];
  • label_skewed_split: assigns each sample to a time bin, then assigns a set of samples from each bin to the clients according to the Dirichlet distribution [3, 4].

For more information, please take a look at our paper [1].

⚙️ Installation

Federated Survival Datasets is built on top of numpy and scikit-learn. To install those libraries you can run

pip install -r requirements.txt

To import survival datasets in your project, we strongly recommend SurvSet [2], a comprehensive collection of more than 70 survival datasets.

🛠️ Usage

import numpy as np
import pandas as pd

from federated_survival_datasets import label_skewed_split

# import a survival dataset and extract the input array X and the output array y
df = pd.read_csv("metabric.csv")
X = df[[f"x{i}" for i in range(9)]].to_numpy()
y = np.array([(e, t) for e, t in zip(df["event"], df["time"])], dtype=[("event", bool), ("time", float)])

# run the splitting algorithm
client_data = label_skewed_split(num_clients=8, X=X, y=y)

# check the number of samples assigned to each client
for i, (X_c, y_c) in enumerate(client_data):
    print(f"Client {i} - X: {X_c.shape}, y: {y_c.shape}")

We provide an example notebook to illustrate the proposed algorithms. It requires scikit-survival, seaborn, and pandas.

📕 Bibtex Citation

@article{archetti2023heterogeneous,
  title={Heterogeneous Datasets for Federated Survival Analysis Simulation},
  author={Archetti, Alberto and Lomurno, Eugenio and Lattari, Francesco and Martin, Andr{\'e} and Matteucci, Matteo},
  journal={arXiv preprint arXiv:2301.12166},
  year={2023}
}

📚 References

[1] Archetti, A., Lomurno, E., Lattari, F., Martin, A., & Matteucci, M. (2023). Heterogeneous Datasets for Federated Survival Analysis Simulation. arXiv preprint arXiv:2301.12166.

[2] Drysdale, E. (2022). SurvSet: An open-source time-to-event dataset repository. arXiv preprint arXiv:2203.03094.

[3] Hsu, T. M. H., Qi, H., & Brown, M. (2019). Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335.

[4] Li, Q., Diao, Y., Chen, Q., & He, B. (2022, May). Federated learning on non-iid data silos: An experimental study. In 2022 IEEE 38th International Conference on Data Engineering (ICDE) (pp. 965-978). IEEE.

federated_survival_datasets's People

Contributors

archettialberto avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.