I use the following to generate the shakespeare data. <div class="snippet-c

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

the statistics of Shakespeare dataset is Inconsistent with the paper's description about leaf HOT 7 OPEN

talwalkarlab commented on July 29, 2024

the statistics of Shakespeare dataset is Inconsistent with the paper's description

from leaf.

Comments (7)

yanfangli1986 commented on July 29, 2024

After running the command, I got the following results. I wonder what is wrong here?
./preprocess.sh -s iid --sf 1.0 -k 0 -t sample -tf 0.8
DATASET: shakespeare
424 users
1992135 samples (total)
4698.43 samples per user (mean)
num_samples (std): 10122.73
num_samples (std/mean): 2.15
num_samples (skewness): 6.69

num_sam num_users
0 250
2000 66
4000 18
6000 16
8000 18
10000 13
12000 11
14000 2
16000 5
18000 3

from leaf.

scaldas commented on July 29, 2024

The Project Gutenberg EBook we use to extract the Shakespeare data has changed. I just updated the relevant pre-processing script to point to a similar version of the file, but the statistics have indeed changed (they will be updated in a new version of the preprint we are working on). Right now, running the same command as @chaoyanghe, I am getting:

####################################
DATASET: shakespeare
1129 users
4226158 samples (total)
3743.28 samples per user (mean)
num_samples (std): 6212.26
num_samples (std/mean): 1.66
num_samples (skewness): 3.35

num_sam num_users
0 705
2000 126
4000 72
6000 56
8000 38
10000 33
12000 31
14000 16
16000 8
18000 11

from leaf.

chaoyanghe commented on July 29, 2024

@scaldas Hi, Thanks for your reply. I wait for a long time...

I also found the FMNIST can not aligh to your statistics:
(venv) (base) chaoyanghe-hostname:femnist chaoyanghe$ sh stats.sh
####################################
DATASET: femnist
3500 users
791913 samples (total)
226.26 samples per user (mean)
num_samples (std): 89.12
num_samples (std/mean): 0.39
num_samples (skewness): 0.77

num_sam num_users
0 1
20 4
40 11
60 5
80 15
100 65
120 122
140 392
160 1237
180 322
200 44
220 52
240 87
260 92
280 116
300 157
320 156
340 181
360 166
380 147
400 87
420 36
440 3
460 1
480 0

Could you also help to check the reason? Since I will cite your paper I need to claim we use the same dataset.

from leaf.

scaldas commented on July 29, 2024

@chaoyanghe I will look into this, but if your work is time-sensitive, consider using the FEMNIST version hosted at Tensorflow Federated (they call it EMNIST). They host their own (slightly different) version and thus don't have the problem of mutating sources (which I believe is the issue here as well).

https://www.tensorflow.org/federated/api_docs/python/tff/simulation/datasets/emnist

We will look into hosting our own version of the datasets in the future.

from leaf.

Enehta commented on July 29, 2024

@scaldas I just tried to get a fresh FEMNIST data set and I am only getting 1900 users instead of before 3500. Was that data set changed as well?

from leaf.

scaldas commented on July 29, 2024

@Enehta Unfortunately, at the time we are only hosting preprocessing scripts for data that is hosted elsewhere. If that data mutates, our resulting scripts also mutate. We are actively working on solving this through our own hosting of the datasets. In the meantime, consider using the FEMNIST version hosted at Tensorflow Federated (they call it EMNIST). They host their own (slightly different) version and thus don't have the problem of mutating sources.

https://www.tensorflow.org/federated/api_docs/python/tff/simulation/datasets/emnist

from leaf.

future-xy commented on July 29, 2024

Interestingly, I found there are 3500 users and totally 803267 samples in the FEMNIST dataset.
####################################
DATASET: femnist
3500 users
803267 samples (total)
229.50 samples per user (mean)
num_samples (std): 89.03
num_samples (std/mean): 0.39
num_samples (skewness): 0.71

from leaf.

the statistics of Shakespeare dataset is Inconsistent with the paper's description about leaf HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent