Coder Social home page Coder Social logo

aai-institute / mlrc22-like-shapley-love-the-core Goto Github PK

View Code? Open in Web Editor NEW
4.0 2.0 0.0 286 KB

Code for the submission to the ML Reproducibility Challenge 2022, reproducing "If you like Shapley then you'll love the core"

License: GNU Lesser General Public License v3.0

Python 100.00%
aaai2021 data-valuation machine-learning paper-reproduction transferlab least-core mlrc22 papers-with-code appliedai

mlrc22-like-shapley-love-the-core's Introduction

SWH

MLRC 2022: If you like Shapley, then you'll love the core

This repository contains code to reproduce the paper If You Like Shapley Then You’ll Love the Core for the ML Reproducibility Challenge 2022.

Getting Started

We use Python version 3.10 for this repository.

We use Poetry for dependency management. More specifically version 1.2.0.

After installing Poetry, run the following command to create a virtual environment and install all dependencies:

poetry install

You can then activate the virtual environment using:

poetry shell

Experiments

We use DVC to run the experiments and track their results.

To reproduce all results use:

dvc repro

Feature Valuation

Least Core

To reproduce the results of this experiment use:

dvc repro feature-valuation-least-core

You can find the results under output/feature_valuation_least_core.

Data Valuation

Synthetic Data

To reproduce the results of this experiment use:

dvc repro data-valuation-synthetic

You can find the results under output/data_valuation_synthetic.

Dog vs Fish Dataset

Note: This experiment requires downloading the imagenet-1k dataset from HuggingFace Datasets. For that you need to first create an account and then login using the huggingface-cli tool.

To reproduce the results of this experiment use:

dvc repro data-valuation-dog-vs-fish

You can find the results under output/data_valuation_dog_vs_fish.

Fixing Misalabeled Data

To reproduce the results of this experiment use:

dvc repro fixing-mislabeled-data

You can find the results under output/fixing_mislabeled_data.

Noisy Data

To reproduce the results of this experiment use:

dvc repro noisy-data

You can find the results under output/noisy_data.

Contributing

Make sure to install the pre-commit hooks:

pre-commit install

mlrc22-like-shapley-love-the-core's People

Contributors

anesbenmerzoug avatar mdbenito avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

mlrc22-like-shapley-love-the-core's Issues

Noisy Data Experiment

As one more sanity check, the authors conduct an experiment studying the percentage of utility
allocated by the least core to noisy data.

They divide the dataset into two: a clean portion and a noised portion.
They increase the Gaussian noise added to the noised portion and compute the percentage of utility
allocated by least the core to the clean data.

The hypothesis is that with higher noise, the noised data will become less
“valuable” and are thus allocated a lower percentage of the overall utility by the least core.

Separate plotting from experiments

In order to make it easier to generate new plots without having to rerun the experiments, we should put the plotting in a different dvc stage that depends on the experiment code.

Data Removal experiments

Compute valuation then gradually removes best, respectively worst, and see how the utility (e.g. accuracy) decrease, respectively increases.

Use two datasets:

  • A synthetic gaussian dataset
  • A dog-vs-fish dataset derived from Imagenet 2012 (depends on #1)

Compares the following methods:

  • Truncated Monte Carlo Shapley
  • Group Testing
  • Monte Carlo Least Core
  • Leave-One-Out
  • Random removal

Misalebeled Data experiments

Verify that the magnitude of the least core values strongly correlate with the importance of the data point.

  • Uses 1000 data points from the Enron Spam Dataset (Link to paper: https://www2.aueb.gr/users/ion/docs/ceas2006_paper.pdf)
  • 20% of the labels are randomly flipped.
  • Trains a Naive Bayes model which takes as input a bag-of-words representation of emails.
  • Uses a budget of 5000 iterations for the solution concept.
  • The utility is the performance on the validation set.
  • The final plots represent the data valuation of the data points in the test set.
  • The method is compared against random selection.

Feature Valuation experiment

The goal of these experiments is to empirically verify Theorem 1 from Section 3 (which deals with the probable least core).

Train a Logistic Regression classifier on all possible subsets of features and to compute the exact Least Core values.

The utility is the accuracy on a test set.

Uses 3 UCI datasets:

The experiment goes as follows:

  • Sample a small fraction of coalitions uniformly at random from all possible coalitions.
  • Compute the least core
  • Determine what fraction of all coalitions satisfy the least core constraints with respect to the true deficit $e^{*}$ from the exact solution.
  • This gives us accuracy $1 - \delta$, which, in turn, leads to $\delta$-probable least core

To reproduce the experiments, the accuracy should be close to one for a small fraction of the coalitions (< 20%)

Local ray cluster messes with random seed

Unfortunately, starting a ray cluster directly is known to mess up with
the random seed set in the main process (See this issue for more information).

Therefore, we should include a docker-compose file in the repository
to start the ray cluster separately.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.