Coder Social home page Coder Social logo

commefficient's People

Contributors

dhroth avatar kiddyboots216 avatar sunahhlee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

commefficient's Issues

Help with the PersonaChat experiment command

Hello authors, thank you for your great paper -- I'll be citing it in my upcoming work.

I am trying to run your code, specifically the PersonaChat experiments. Would you be able to provide the exact set of commands used to generate the results in Figure 5 (i.e. ~15 ppl)? I have read the paper and all of your code (you did a lot of work on this!), and I wanted to confirm the command for the run. It would be extremely helpful if you were to share your commands.

At the moment I am simply trying to replicating the uncompressed results. I now have:

 python gpt2_train.py \
--dataset_name PERSONA \
--local_momentum 0 \
--dataset_dir ./dataset/personachat \
--mode uncompressed \
--seed 42 \
--local_batch_size -1 \
--num_results_train 1 --num_results_val 2 \
--num_epochs 1 \
--valid_batch_size 4 \
--port 6239 \
--num_workers 4 --num_devices 4

Is this right? Have I missed a hyperparameter that might be important?

I tried changing the learning rate to 0.16 with --lr_scale 0.16, but I quickly get NaNs. Should I set lm_coef or mc_coef to something other than the default? The results I'm getting are quite different (worse) than the paper, so I'm trying to find out what is the discrepancy.

Thank you so much for your help! I really do appreciate it.

Resnet9 Pooling

Hi authors,

While I was training on CIFAR10 using Resnet9 (from models/resnet9.py) with default settings(i.e. the channel sizes), I got an error on the last pooling layer out = self.pool(out).view(out.size()[0], -1) in BasicNet.forward.

After printing out the tensor size of the input into the pooling layer, I found that the input was of size [batch_size, 512, 3, 3]. I fixed this error by changing the last pooling layer to self.pool = nn.MaxPool2d(2), which made sense since the last linear layer expects a tensor of size [batch_size, 512].

Is this an error that has happened before or could there have been something else that went wrong? I see that the last pooling layer was hardcoded to be nn.MaxPool2d(4), so was the expected input of size [batch_size, 512, 5, 5] and I did something wrong?

Thanks,
Howard

Question about the final test_acc in cifar10 experiment.

Hi, I tried to reproduce the experiment results in the paper. I am using the following commands:

python cv_train.py --dataset_name CIFAR10 --model ResNet9 --mode fedavg --num_clients 200 --num_workers 10 --num_rows 1 --num_cols 50000 --error_type none --local_momentum 0.0 --virtual_momentum 0.9 --max_grad_norm 1.0 --num_devices=1 --lr_scale 0 --local_batch_size -1 --share_ps_gpu

the accuracy seems not correct, could you help me solve the problem? the logs are:

MY PID: 3424
Namespace(do_test=False, mode='fedavg', robustagg='none', use_tensorboard=False, seed=21, model='ResNet9', do_finetune=False, do_dp_finetune=False, do_checkpoint=False, checkpoint_path='/data/nvme/ashwinee/CommEfficient/CommEfficient/checkpoints/', finetune_path='./finetune', finetuned_from=None, num_results_train=2, num_results_val=2, dataset_name='CIFAR10', dataset_dir='./dataset', do_batchnorm=False, nan_threshold=999, k=50000, num_cols=50000, num_rows=1, num_blocks=20, do_topk_down=False, local_momentum=0.0, virtual_momentum=0.9, weight_decay=0.0005, num_epochs=24, num_fedavg_epochs=1, fedavg_batch_size=-1, fedavg_lr_decay=1.0, error_type='none', lr_scale=0.4, pivot_epoch=5, port=5315, num_clients=200, num_workers=10, device='cuda', num_devices=1, share_ps_gpu=True, do_iid=False, train_dataloader_workers=0, val_dataloader_workers=0, model_checkpoint='gpt2', num_candidates=2, max_history=2, local_batch_size=-1, valid_batch_size=8, microbatch_size=-1, lm_coef=1.0, mc_coef=1.0, max_grad_norm=1.0, personality_permutations=1, eval_before_start=False, checkpoint_epoch=-1, finetune_epoch=12, do_malicious=False, mal_targets=1, mal_boost=1.0, mal_epoch=0, mal_type=None, do_mal_forecast=False, do_pgd=False, do_data_ownership=False, mal_num_clients=-1, layer_freeze_idx=0, mal_layer_freeze_idx=0, mal_num_epochs=1, backdoor=-1, do_perfect_knowledge=False, do_dp=False, dp_mode='worker', l2_norm_clip=1.0, noise_multiplier=0.0, client_lr=0.1)
50000 125
Using BatchNorm: False
grad size 6568640
Finished initializing in 1.91 seconds
epoch lr train_time train_loss train_acc test_loss test_acc total_time
1 0.0800 25.2243 2.3028 0.1038 2.3012 0.1405 30.4418
2 0.1600 23.8377 2.3025 0.1017 2.2936 0.1460 57.5426
3 0.2400 24.0562 2.2886 0.1157 2.2449 0.1507 84.8176
4 0.3200 23.0985 2.2479 0.1461 2.1887 0.1535 111.1938
5 0.4000 22.0071 2.2901 0.0944 2.2941 0.0930 136.4487
6 0.3789 21.9321 2.3150 0.1301 3.3015 0.0997 161.6546
7 0.3579 21.9460 2.3782 0.1078 2.2771 0.1324 186.8818
8 0.3368 21.8156 2.2793 0.1264 2.2281 0.1360 211.9292
9 0.3158 21.6892 2.2410 0.1775 2.2307 0.1417 236.9210
10 0.2947 21.9432 2.2989 0.1024 2.2831 0.1175 262.0983
11 0.2737 21.9095 2.2511 0.1332 2.1657 0.1901 287.2876
12 0.2526 27.3621 2.1729 0.1771 2.1231 0.1734 321.2075
13 0.2316 37.6449 2.1274 0.1580 2.1067 0.2008 365.1934
14 0.2105 32.6825 2.3116 0.1308 2.0721 0.2026 401.1535
15 0.1895 22.3018 2.1435 0.1707 2.0014 0.2332 426.7760
16 0.1684 30.7159 2.0729 0.1982 2.1173 0.2312 460.7642
17 0.1474 22.4368 2.1110 0.2006 2.0027 0.2580 489.7420
18 0.1263 39.1600 2.0538 0.1897 2.0412 0.2377 535.3520
19 0.1053 38.9138 2.0614 0.2156 2.0193 0.2655 580.7346
20 0.0842 21.9821 1.9763 0.2441 2.0301 0.2679 605.9769
21 0.0632 32.8850 1.9892 0.2655 2.0524 0.2711 645.6084
22 0.0421 38.2427 1.9478 0.2627 1.8612 0.3094 690.2626
23 0.0211 38.3396 1.9010 0.2778 1.8869 0.2993 735.1110
HACK STEP
WARNING: LR is 0
WARNING: LR is 0
24 0.0000 33.1543 1.9016 0.2929 1.8394 0.3032 771.5566
done training

PersonaChat (GPT2)

Hello, sorry to bother you again, I want to reproduce the experiment of this picture, could you tell me what the running parameters are?
gpt1

Query for the type of malicious attack

Dear authors:
I am now studying this paper and source code. I would like to ask that what kinds of attack the letters "A", "B", "C" and "D" (in "MAL_ATTACK_TYPES" in utils.py) represent respectively.
Thank you!

issue of fetchpgd

hi, sorry to bother you again.
When I want to only implement fetchpgd on commefficient-attack, to test the accuracy difference between SpaseFed and fetchSGD methods. I encounter a problem.
My hyperparameter is --dataset_dir data/cifar10 --tensorboard --dataset_name CIFAR10 --model ResNet9 --mode fetchpgd --
k 10000 --num_blocks 1 --num_rows 1 --num_cols 325000 --num_clients 200 --num_workers 10 --error_type virtual --local_momentum 0.0 --virtual_momentum 0.9
the K, rows, cols value I use is same with fetchSGD.
But I got: CommEfficient-attacks\CommEfficient-attacks\CommEfficient\fed_worker.py", line 177, in worker_loop
sum_g += g
RuntimeError: The size of tensor a (500000) must match the size of tensor b (6568640) at non-singleton dimension 1
Could you please help me to fix it? THX a lot!

TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'

Hello,

I'm trying to run the code in cv_train.py using the command line arguments "--dataset_name CIFAR10 --iid --share_ps_gpu --num_workers 1 --num_clients 1" with pytorch 1.8.0, however I am met with the following error:

File "D:\CommEfficient\CommEfficient\cv_train.py", line 405, in
lr_scheduler = LambdaLR(opt, lr_lambda=lambda_step)

File "D:\anaconda3\envs\test\lib\site-packages\torch\optim\lr_scheduler.py", line 203, in init
super(LambdaLR, self).init(optimizer, last_epoch, verbose)

File "D:\anaconda3\envs\test\lib\site-packages\torch\optim\lr_scheduler.py", line 77, in init
self.step()

File "D:\anaconda3\envs\test\lib\site-packages\torch\optim\lr_scheduler.py", line 152, in step
values = self.get_lr()

File "D:\anaconda3\envs\test\lib\site-packages\torch\optim\lr_scheduler.py", line 250, in get_lr
return [base_lr * lmbda(self.last_epoch)

File` "D:\anaconda3\envs\test\lib\site-packages\torch\optim\lr_scheduler.py", line 250, in
return [base_lr * lmbda(self.last_epoch)

File "D:\Dewen\GitHub\CommEfficient-master\CommEfficient\cv_train.py", line 404, in
lambda_step = lambda step: lr_schedule(step / spe)

File "D:\Dewen\GitHub\CommEfficient-master\CommEfficient\utils.py", line 28, in call
return np.interp([t], self.knots, self.vals)[0]

File "<array_function internals>", line 180, in interp
File "D:\anaconda3\envs\test\lib\site-packages\numpy\lib\function_base.py", line 1570, in interp
return interp_func(x, xp, fp, left, right)

TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'

Have you encountered this error? I've tried running the code in both a Linux OS and Windows OS, but exhibit the same issue. Am I missing a command line argument?

Reproduce the results in the paper

Hi, I tried to reproduce the experiment results in the paper. I am using the following commands. But the logs seem not correct. Could you share the command line you are using in the paper? I am really interested in your work and willing to explore more about sketch techniques.

python cv_train.py --dataset_name CIFAR10 --iid --num_workers 2 --lr_scale 0.4 --local_momentum=0.0 --num_devices 2 --num_devices=2 --num_clients 2

MY PID: 31280
5315 port in use, trying next...
Namespace(checkpoint_path='./checkpoint', dataset_dir='./dataset', dataset_name='CIFAR10', device='cuda', do_batchnorm=False, do_checkpoint=False, do_dp=False, do_finetune=False, do_iid=True, do_test=False, do_topk_down=False, dp_mode='worker', error_type='none', eval_before_start=False, fedavg_batch_size=-1, fedavg_lr_decay=1, finetune_path='./finetune', finetuned_from=None, k=50000, l2_norm_clip=1.0, lm_coef=1.0, local_batch_size=8, local_momentum=0.0, lr_scale=0.4, max_grad_norm=None, max_history=2, mc_coef=1.0, microbatch_size=-1, mode='sketch', model='ResNet9', model_checkpoint='gpt2', nan_threshold=999, noise_multiplier=0.0, num_blocks=20, num_candidates=2, num_clients=2, num_cols=500000, num_devices=2, num_epochs=24, num_fedavg_epochs=1, num_results_train=2, num_results_val=2, num_rows=5, num_workers=2, personality_permutations=1, pivot_epoch=5, port=5646, seed=21, share_ps_gpu=False, train_dataloader_workers=0, use_tensorboard=False, val_dataloader_workers=0, valid_batch_size=8, virtual_momentum=0, weight_decay=0.0005)
50000 625
Using BatchNorm: False
Finished initializing in 11.00 seconds
miniconda3/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of lr_scheduler.step() before optimizer.step(). "
CommEfficient/CommEfficient/utils.py:258: UserWarning: This overload of add_ is deprecated:
add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
add_(Tensor other, *, Number alpha) (Triggered internally at ../torch/csrc/utils/python_arg_parser.cpp:1055.)
grad_vec.add_(args.weight_decay / args.num_workers, weights)
epoch lr train_time train_loss train_acc test_loss test_acc down (MiB) up (MiB) total_time
1 0.0800 655.4752 2.3025 0.1009 2.3025 0.1014 0 59606 679.6477
2 0.1600 649.9156 2.3025 0.1008 2.3025 0.1014 0 59606 1343.1710
3 0.2400 649.3290 2.3025 0.1011 2.3025 0.1014 0 59606 2006.0574

Variable velocity

Is the velocity stand for momentum? Cuz there's no velocity in the paper.

If I set args.local_momentum = 0 mode = sketch, will the local_velocity always be none?

Could the authors provide more commands to reproduce the entire experiment in the paper?

Question about Cifar10 experiment command

Hello, authors. When I run "python cv_train.py", the train_loss and train_acc are always NaNs.
There is my command:

python cv_train.py
--dataset_dir /home/data/cifar
--dataset_name CIFAR10
--num_results_train 1
--train_dataloader_workers 4
--val_dataloader_workers 4
--num_devices 2
--error_type virtual
--lr_scale 0.3
--num_workers 4
—num_clients 10000

Have I missed some hyperparameters that are important? And would you please be able to provide the exact set of commands? It would be helpful if you were to share your commands.

issue of sketch

hi,
Sorry to bother you, but I have some problems when deploying Sketch.
I got 0 grad updates and 0 download bytes when I do me the second iteration. Could you please tell me what is the problem?
My hyperparameters are --dataset_dir data/cifar10 --tensorboard --local_batch_size 50 --dataset_name CIFAR10 --model ResNet9 --mode sketch --num_clients 200 --num_workers 10 --num_rows 1 --num_cols 50000 --error_type none --local_momentum 0.0 --virtual_momentum 0.9 --max_grad_norm 1.0

I think the problem maybe sketch.accumulateVec(grad), but I don't know how to modify it.

Thanks a lot.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.