epfml / disco Goto Github PK
View Code? Open in Web Editor NEWDecentralized & federated privacy-preserving ML training, using p2p networking, in JS
License: Apache License 2.0
Decentralized & federated privacy-preserving ML training, using p2p networking, in JS
License: Apache License 2.0
As our repository is currently loading JS Images to tensors, we do not have strict requirements on explicit image data types.
Remove requirements from task file.
Possibly also check on special image formats and test them. (transparency in images, svg)
When I train a model on the LUS-Covid task (everything working properly and achieving a training accuracy of around 88% and validation accuracy of around 75%), then I leave the application and when I train the model again, I get a validation accuracy of 30% and a training accuracy of 50%.
I suspect there must be some problem with the creation of new models since the way I managed to fix it was delete the models from the storage and then refreshing the page and starting the application workflow from the task list.
Maybe the lus-covid task is not properly linked with the memory managers that were implemented afterwards.
Support a subset of nodes which behaves arbitrary malicious (sends arbitrary messages to neighbors, instead of true gradients).
Postponed for now until the honest workers model is implemented and evaluated and the failure model for nodes is supported.
Switch from local storage to indexedDB. This will allow a greater memory space for models.
Change serialization of the model: TensorFlow's has a different file architecture when a model is saved to local storage or indexedDB.
IndexedDB should be served by one separate script.
evaluate which bindings or deployments of libp2p would be most suitable for communicating gradients as in ML training, supporting mobile phone OSs, and for interfacing with e.g. PyTorch / jax or similar schemes
this is a big one we'll look into a bit later
we could add a few more small checks and error messages,
now in the MNIST example it throws some NaN error if not all image types are present. and in the Titanic one complains about the last column missing sometimes even when it's not (maybe doesn't like newline at the end of a csv file?). works nicely with the provided titanic data though
add status message in UI once data has been successfully loaded. and another one once training is successfully started
Now that we use Google App Engine (GAE) to host our app, we can no longer use different ports for different tasks since we access the server through a domain. Instead, we can host each task as a separate service within our app. In this way, each task can have its separate domain. The potential benefit is that we do not have to interrupt existing tasks to add a new task. I haven't tried this yet. Let me know what you think and whether you have any alternative ideas.
Ping: @martinjaggi, @tvogels
GAE Reference:
https://cloud.google.com/appengine/docs/standard/nodejs/an-overview-of-app-engine
For allowing joint training within a smaller set of trusted participants only (in addition to public tasks)
This functionality basically recovers federated learning as a special case. This is not giving the same security standard as a full PKI, and is not a replacement for a selection mechanism for helpful clients (as opposed to Byzantine ones), but is a start (see the same discussions in federated learning literature)
the training algorithm should support realistic changes of the communication graph, such as node failues or offline time. this issue here only considers non-malicious nodes. for Byzantine nodes, we'll discuss later in separate issues
we can experiment with some candidate algorithms from the following papers for example, and test them on the simulator.
A Unified Theory of Decentralized SGD with Changing Topology and Local Updates
https://arxiv.org/pdf/2003.10422
SwarmSGD: Scalable Decentralized SGD with Local Updates
https://arxiv.org/abs/1910.12308
Using https://libp2p.io/ or a simplified backend, build a first prototype which connects nodes and allows exchange of dummy arrays (tensors) between nodes.
See #3 for discussion on libp2p
as an example use-case, let's build a decentralized model for browser-based images
we can start from a stand-alone-version based on
https://github.com/justadudewhohacks/face-api.js/ (uses TF.js)
see for example
https://www.codeproject.com/Articles/5276827/AI-Age-Estimation-in-the-Browser-using-face-api-an
on top of it, we only need to extract gradients, to incorporate it into collaborative DeAI training. will also be a good use-case to refine our task-descriptions and image preprocessing pipelines and UI
A new model is created every time someone is clicking the "join training" button in the description frame.
Add variable so that once the model is created, it's never created again (at least for the current session).
Check on the MNIST model inference - prediction for test digits.
Test Procedure
Resulting CSV File output for all digits (uploaded in order smallest to largest digit) is attached to this post.
Digits used for training and testing:
Set up: tested on local machine, nvm v16.3.0, npm 7.15.1
Console log
[Log] Start: Processing Uploaded File
[Log] User File Validated. Start parsing.
[Log] Start Training
[Log] _________________________________________________________________
[Log] Layer (type) Output shape Param #
[Log] =================================================================
[Log] conv2d_Conv2D1 (Conv2D) [null,26,26,16] 448
[Log] _________________________________________________________________
[Log] max_pooling2d_MaxPooling2D1 [null,13,13,16] 0
[Log] _________________________________________________________________
[Log] conv2d_Conv2D2 (Conv2D) [null,11,11,32] 4640
[Log] _________________________________________________________________
[Log] max_pooling2d_MaxPooling2D2 [null,5,5,32] 0
[Log] _________________________________________________________________
[Log] conv2d_Conv2D3 (Conv2D) [null,3,3,32] 9248
[Log] _________________________________________________________________
[Log] flatten_Flatten1 (Flatten) [null,288] 0
[Log] _________________________________________________________________
[Log] dense_Dense1 (Dense) [null,64] 18496
[Log] _________________________________________________________________
[Log] dense_Dense2 (Dense) [null,10] 650
[Log] =================================================================
[Log] Total params: 33482
[Log] Trainable params: 33482
[Log] Non-trainable params: 0
[Log] _________________________________________________________________
[Log] Proxy
[Log] EPOCH (1): Train Accuracy: 100.00,
Val Accuracy: 100.00
[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (2): Train Accuracy: 100.00,
Val Accuracy: 100.00
[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (3): Train Accuracy: 100.00,
Val Accuracy: 100.00
[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (4): Train Accuracy: 100.00,
Val Accuracy: 100.00
[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (5): Train Accuracy: 100.00,
Val Accuracy: 100.00
[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (6): Train Accuracy: 100.00,
Val Accuracy: 100.00
[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (7): Train Accuracy: 100.00,
Val Accuracy: 100.00
[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (8): Train Accuracy: 100.00,
Val Accuracy: 100.00
[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (9): Train Accuracy: 100.00,
Val Accuracy: 100.00
[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (10): Train Accuracy: 100.00,
Val Accuracy: 100.00
[Log] loss 0.1054
[Log] mnist-model
[Log] Deactivated
[Log]
[Log] Loading model...
[Log] Model loaded.
[Log] Prediction Sucessful!
[Log] undefined
[Log] Object`
cifar 10 or 100. or even go to imagenet directly :)
in the task description we can maybe include a link how people can download an (arbitrary) part of this official dataset from somewhere. a bit unclear what format would be best. probably too large to 'upload' it as individual images in the UI
imagenet could work too if lots of people would join and everyone would only have a very small part of the data
For tabular datasets (popular examples: adult income and titanic), normalization is critical for neural network approaches.
The most typical and a very effective way to normalize is to "subtract the mean and divide by the standard deviation". However, computing these in a decentralized fashion is non-trivial. For DeAI to support this, additional functionality needs to be implemented.
Examples of how this can be addressed:
let's have the helper server running on a small instance in the gcloud so the app can always be easily used (even with peer.js).
not sure how we want to keep the participants list updated at the moment (one or two? maybe one on gcloud?)
With modularisation changes, some components are never used. Once every PR is merged to the master branch, delete these unwanted files (i.e ImageUploadFrame, GlobalTaskFrame, CSVUploadFrame ...).
given a task description (see #26), combined with a local dataset which was uploaded locally (see e.g. #28)
allow the user to define a local test set. (button in the UI?)
for example, just remove 20% from the local train set and mark it as test set.
implement the performance metric (accuracy) of a model on that test set.
add the performance metric description to the task description.
display the current test accuracy in the UI locally, and update it once a model is received.
to achieve stronger privacy in terms of input privacy (private data but public models), we would like to avoid information leaks from individual gradients which are communicated. to do so, the following route seems viable:
use simple additive secure aggregation (part of secure multi-party computation / MPC) of all individual gradients.
this scheme computes a public average/sum or all individual gradient vectors, while keeping each individual vector private, see e.g. https://arxiv.org/abs/2006.04747 for the federated case.
suggestions welcome
The main problem is the following:
TFJS has a special way of handling models (and all related TFJS objects) from a memory standpoint. Vue also has a special way of handling objects that are stored in its data hook.
This leads to errors when a TFJS model is saved in the data part of a Vue's component (i.e we are unable to process the model).
The solution so far has been to move the model (contained in the training manager) outside of the component. However, this leads to an error when doing the dynamic routing.
To make it simple: all tasks share the same frame called MainTrainingFrame. In this frame (or the related image or CSV training frame) the training manager is located outside the definition of the component. Hence, even though when we change task and a new component is created for that task, the state of the training manager is kept for the new component. This yields to unstable behaviors.
So this means that we can't store the training manager (that contains the TFJS model) outside the definition of the component, but we can't move it inside as well.
So the solution is to avoid having the model stored in an object. For instance, the training manager would not store the variable "model". Each time the actual TFJS model is required, it is loaded by the function that needs it.
depends on #6
get an easy to use reference code for training (simulated decentralized, so just running locally without any p2p backend) on any given communication graph, on a standard/toy dataset. this will be useful to later to compare the p2p version to it.
for simplicity we'll first assume all nodes perform one step of SGD (gradient and communication) per clock step, and that the underlying communication graph remains fixed (the code should allow giving an arbitrary graph as an input)
We are currently using all-reduce scheme for model averaging. We should add the option to use RelaySGD. I will start working on this.
both are now separate folders inside `mobile-
once merged, it's enough to only have one folder
as a fallback, if the UI finds no peers we can also just let it do local training
in the client UI, allow the client to see available tasks (see #26), and select one to join
Generalize and improve current data processing pipeline, with the goal to allow easy utilization of data as part of the projects, facilitate technical debugging, facilitate understanding of data science background and challenges and to separate those challenges from another.
In a minimalistic setting, this includes:
In addition, the following features would be nice to have: (now or later)
4) Ability to get data science style diagnosis of data on the interface (label distribution, before/after training evaluation)
5) capabilities to handle advanced bugs relating to data loading - handle corrupt/invalid data in an optimal way
6) minimize RAM/VRAM data and possibly provide overview on the UI
As a follow up, this would require current tasks/projects to be tested and checked again and verified, with possibly changes in the testing functions currently set up separately. Some tasks might be better connected to the UI and UI changes would improve the overall usability. (possibly in a separate issue)
If data loaders work similar to the python version, a lot of these tasks can be handled naturally, but some sub tasks would require follow ups.
Add more detailed descriptions and illustrations on the information page.
landing screen to explain the basics of the app and the main privacy model (similar to current readme)
on the bottom of it we can have a button to go to the task list 'show available tasks' or similar. we can also place a link here leading to the documentation on 'how to create a new task'
provide a simulated decentralized code (not using any p2p backend, but instead just running locally), which holds a communication graph, and distributes a standard/toy ML dataset among the nodes.
data distribution should support both random and heterogeneous / non-iid (for example different labels for each node).
we can use standard PyTorch code examples e.g. MNIST or Cifar
depends on #1 . related to #2 also.
modify the reference code (simulated decentralized) for a given communication graph, on a standard/toy dataset.
incorporate a basic asynchronous model, i.e. allowing node and edge failures in SGD, or in other words a few variants time-varying graphs. this can also be used to simulate some realistic notions of fault tolerance
this will be used later to compare the p2p version to it, and to test different algorithm variants before implementing them in the real p2p framework
We have to look into how to allow devices behind NAT to send and receive p2p messages.
Options:
Now, we have the training component for CSV files, but no testing components for them. Introduce the testing component for CSV files.
The idea is simple, the training process can be viewed as a function that takes as input a standard CSV file for the task but without the column's labels. It then returns the same CSV file, but with the labels.
At this moment, port numbers are being hardcoded in the communication_manager
file. Inside this file we should always use the portNbr attribute passed in the constructor, and then each task should pass its corresponding port number to the communication_manager
.
When training in a distributed manner with 3 peers or more there is a deadlock if one of the peers is not sharing weights and you set a threshold of 2.
The threshold should only be present for waiting until you receive all weights before you average weights but we should still put a time threshold (for example you will wait max 10seconds).
The file to be fixed is helpers.js
in helpers/communication_script/helpers
. One first problem I have seen is that checkArrayLen
function in line 120 is useless and creating an infinite loop. The arr
argument will never change and we are doing a loop until its length increases. There is a similar problem with dataReceivedBreak
. I think the way to fix this is to have a general object accessible to all of these functions and then add a limit in number of tries to checkArrayLen
.
like #28, but for images
after the client has joined a task (see issue #27 ), allow the client to upload (via an HTML form) a local training dataset to the browser local storage
.
as suggested by @Saipraneet, as an alternative to stand-alone applications frameworks such as libp2p, see #3 , we should evaluate browser-based communication frameworks. this could be advantageous in that it might be easier to get it to run on desktop and mobiles.
webRTC is supported in modern browsers. on top of it, there are several javascript peerJS, and in particular simple-peer looks very suitable.
could this be suitable for communicating gradients as in ML training, supporting desktop or mobile phone OSs (e.g. coreML?), and for interfacing with e.g. PyTorch or similar schemes?
if so, we could try sth like this to build a p2p communication prototype #5 ? what do people think?
seems we need some help here...
the auxiliary server needs to be expanded to host task descriptions. each task description should contain:
to share it with the client we could organize them also via json for example, for the machine readable variant.
once it works, convert the dummy csv task (adult dataset) and mnist (image) tasks to this format an make sure they still work
human readable format: the task description should also be easy to visualize in human readable form, in HTML (so it can be served by th server as well as locally on the client)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.