Light

aai-institute / mlrc22-like-shapley-love-the-core Goto Github PK

View Code? Open in Web Editor NEW

4.0 2.0 0.0 286 KB

Code for the submission to the ML Reproducibility Challenge 2022, reproducing "If you like Shapley then you'll love the core"

License: GNU Lesser General Public License v3.0

Python 100.00%

aaai2021 data-valuation machine-learning paper-reproduction transferlab least-core mlrc22 papers-with-code appliedai

mlrc22-like-shapley-love-the-core's Introduction

MLRC 2022: If you like Shapley, then you'll love the core

This repository contains code to reproduce the paper If You Like Shapley Then You’ll Love the Core for the ML Reproducibility Challenge 2022.

Getting Started

We use Python version 3.10 for this repository.

We use Poetry for dependency management. More specifically version 1.2.0.

After installing Poetry, run the following command to create a virtual environment and install all dependencies:

poetry install

You can then activate the virtual environment using:

poetry shell

Experiments

We use DVC to run the experiments and track their results.

To reproduce all results use:

dvc repro

Feature Valuation

Least Core

To reproduce the results of this experiment use:

dvc repro feature-valuation-least-core

You can find the results under output/feature_valuation_least_core.

Data Valuation

Synthetic Data

To reproduce the results of this experiment use:

dvc repro data-valuation-synthetic

You can find the results under output/data_valuation_synthetic.

Dog vs Fish Dataset

Note: This experiment requires downloading the imagenet-1k dataset from HuggingFace Datasets. For that you need to first create an account and then login using the huggingface-cli tool.

To reproduce the results of this experiment use:

dvc repro data-valuation-dog-vs-fish

You can find the results under output/data_valuation_dog_vs_fish.

Fixing Misalabeled Data

To reproduce the results of this experiment use:

dvc repro fixing-mislabeled-data

You can find the results under output/fixing_mislabeled_data.

Noisy Data

To reproduce the results of this experiment use:

dvc repro noisy-data

You can find the results under output/noisy_data.

Contributing

Make sure to install the pre-commit hooks:

pre-commit install

mlrc22-like-shapley-love-the-core's People

Contributors

Stargazers

Watchers

mlrc22-like-shapley-love-the-core's Issues

Noisy Data Experiment

As one more sanity check, the authors conduct an experiment studying the percentage of utility
allocated by the least core to noisy data.

They divide the dataset into two: a clean portion and a noised portion.
They increase the Gaussian noise added to the noised portion and compute the percentage of utility
allocated by least the core to the clean data.

The hypothesis is that with higher noise, the noised data will become less
“valuable” and are thus allocated a lower percentage of the overall utility by the least core.

Make preprocessed dog-vs-fish indices available

To avoid regenerating for reproduction.

Regenerate the Dog-vs-Fish dataset

Download Imagenet 2012 dataset
Run load_animals.py script used in the paper Understanding Black-box Predictions via Influence Functions

Add dockerfile for ray setup

Installing dependencies in workers is a nightmare otherwise

Prove Theorems

Theorem 1
Theorem 2
Theorem 3

Theorem 2 is the most important one and its proof can surprisingly be found in the Improving Cooperative Game Theory-based Data Valuation via Data Utility Learning paper.

Include ray cluster config

Example yaml with basic GCP config

Separate plotting from experiments

In order to make it easier to generate new plots without having to rerun the experiments, we should put the plotting in a different dvc stage that depends on the experiment code.

Data Removal experiments

Compute valuation then gradually removes best, respectively worst, and see how the utility (e.g. accuracy) decrease, respectively increases.

Use two datasets:

A synthetic gaussian dataset
A dog-vs-fish dataset derived from Imagenet 2012 (depends on #1)

Compares the following methods:

Truncated Monte Carlo Shapley
Group Testing
Monte Carlo Least Core
Leave-One-Out
Random removal

Implement Nucleolus

Misalebeled Data experiments

Verify that the magnitude of the least core values strongly correlate with the importance of the data point.

Uses 1000 data points from the Enron Spam Dataset (Link to paper: https://www2.aueb.gr/users/ion/docs/ceas2006_paper.pdf)
20% of the labels are randomly flipped.
Trains a Naive Bayes model which takes as input a bag-of-words representation of emails.
Uses a budget of 5000 iterations for the solution concept.
The utility is the performance on the validation set.
The final plots represent the data valuation of the data points in the test set.
The method is compared against random selection.

Feature Valuation experiment

The goal of these experiments is to empirically verify Theorem 1 from Section 3 (which deals with the probable least core).

Train a Logistic Regression classifier on all possible subsets of features and to compute the exact Least Core values.

The utility is the accuracy on a test set.

Uses 3 UCI datasets:

The experiment goes as follows:

Sample a small fraction of coalitions uniformly at random from all possible coalitions.
Compute the least core
Determine what fraction of all coalitions satisfy the least core constraints with respect to the true deficit $e^{*}$ from the exact solution.
This gives us accuracy $1 - \delta$, which, in turn, leads to $\delta$-probable least core

To reproduce the experiments, the accuracy should be close to one for a small fraction of the coalitions (< 20%)

Local ray cluster messes with random seed

Unfortunately, starting a ray cluster directly is known to mess up with
the random seed set in the main process (See this issue for more information).

Therefore, we should include a docker-compose file in the repository
to start the ray cluster separately.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.