It seems that the data for 2020-08-18 is not quite right: <a h

For one specific case, there was an explanation here: <a class="issue-link js-issue-li

OK, I saw that <a class="user-mention notranslate" data-hovercard-type="user" data-hov

OK, I saw that <a class="user-mention notranslate" data-hovercard-type="u

Wrong padding multiplier/wrong number of users for 2020-08-18? about ctt HOT 5 CLOSED

PalminX commented on August 24, 2024

Wrong padding multiplier/wrong number of users for 2020-08-18?

from ctt.

Comments (5)

mh- commented on August 24, 2024 1

For one specific case, there was an explanation here: corona-warn-app/cwa-server#693
A user submitted twice, the original keys were accepted only once (to avoid duplicate keys), but 2x4 random padding keys were added.
If someone uses my diagnosis-keys tools, I’d suggest to not always use the auto detect feature, but fix the factor to 5 at the moment, and change it when required.

from ctt.

PalminX commented on August 24, 2024

Hm, there are 3749 keys in the 2020-08-18 file, which obviously is not a multiple of 5. So it is maybe more a question to @micb25 how he handles these discrepancies

from ctt.

PalminX commented on August 24, 2024

OK, I saw that @micb25 sometimes manually corrected the multiplier in the past.
So I think here you should also have some way of handling or flagging these inconsistent values, because currently the number of users from 2020-08-18 is probably too high

from ctt.

micb25 commented on August 24, 2024

OK, I saw that @micb25 sometimes manually corrected the multiplier in the past.

Yes, I had to correct this manually for one of yesterday's hourly packages as well as for one package in the past (2020-08-04). I wonder what situation causes these issues. Fortunately, it seems to happen very rarely. However, the impact on the statistics can be quite significant as you spotted out.

Edit: As a consequence, I do manually check the statistics every day before uploading the new data. And I think this is still necessary for the future, at least as long as fake diagnosis keys are being generated.

from ctt.

janpf commented on August 24, 2024

https://ctt.pfstr.de/users/2020-08-18.txt shows a detected padding number of 1, resulting in 379 users

The graphs for number of users and number of keys show 98 users and 952 keys, which doesn't match the numbers from approx. users file

the https://ctt.pfstr.de/X/Y.txt files are generated based on the published daily package, while the graphs are based on the hourly packages. So there will be a discrepancy.
This is done since there is no use for the enduser to click through 24 hourly files per day, but the analysis for the hourly files is of course more precise.

I've now changed it so that the https://ctt.pfstr.de/X/Y.txt files are always analysed with a fixed multiplier of 5, so if the multiplier is wrongly detected, or actually changes it will now be visible by comparing those files to the graphs (1 or 2 users difference will nearly always be present).

Is there a problem in the downloaded source data? If so, why does https://micb25.github.io/dka/ show more reasonable values?

Kind of, yes. If the padding is detected strictly automatically the value is jumping all over the place for the hourly packages, as you correctly noted:

There are 3749 keys in the 2020-08-18 file, which obviously is not a multiple of 5.

I wanted to keep the process of updating the page and analyzing new data as automated and "hands-off" as possible, so these cases were handled incorrectly on my end.
I did this to generate the data as transparently as possible, without any manual interventions.
Everybody can replicate my numbers by running the commands defined in the workflow file in that order.

So it is maybe more a question to @micb25 how he handles these discrepancies.

It seems the only way to handle this is to set some reasonable hard-coded values like @micb25 did.

So I think here you should also have some way of handling or flagging these inconsistent values, because currently the number of users from 2020-08-18 is probably too high

If someone uses my diagnosis-keys tools, I’d suggest to not always use the auto detect feature, but fix the factor to 5 at the moment, and change it when required.

I've placed some safeguards, which should fix it for the moment.
I will use -n -a -m 5 (so with the automatic detection activated, but capped at 5) on new packages every day and when an issue appears I will manually flag the file to be reanalyzed with a fixed multiplier of 5 by adding them to this list.

Thank you for notifying me about the issue!

from ctt.

Wrong padding multiplier/wrong number of users for 2020-08-18? about ctt HOT 5 CLOSED

Comments (5)

Related Issues (9)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent