Data is being lost on read with updated versions of the pip dependencies. <p dir="

Deps where this issue occurs: numpy-1.12.1 <code clas

Pandas 0.19.0 breaks this code. Pandas <code class="n

Commenting out line 257 in report.py: <div class="snippet-clipboard-content notran

Packets are discarded at line 192 of diff_pir in <cod

PIR Diff result on Pandas 0.19.0 (with error) <div class="snippet-clipboard-conten

Data loss about reportgen HOT 7 CLOSED

sjmf commented on July 19, 2024

Data loss

from reportgen.

Comments (7)

sjmf commented on July 19, 2024

Deps where this issue occurs: numpy-1.12.1 pandas-0.20.1. (latest)

Frozen versions with no issue: numpy-1.11.0 pandas-0.18.0

from reportgen.

sjmf commented on July 19, 2024

Pandas 0.19.0 breaks this code. Pandas 0.18.1 works. (Both numpy 1.11.0 and 1.12.1 work, leading me to believe this is an issue with Pandas usage?)

from reportgen.

sjmf commented on July 19, 2024

Commenting out line 257 in report.py:

    # Apply fixes to the data and diff the PIR movement
    dfs = dh.clean_data(dfs)

... yields the expected packet numbers. Checking the dh.clean_data function...

from reportgen.

sjmf commented on July 19, 2024

Packets are discarded at line 192 of diff_pir in datahandling.py.

    dfs = {i: dfs[i].drop(dfs[i][dfs[i].PIRDiff > pir_threshold].index) for i in dfs}

A mitigation strategy would be to set the values to 0 instead of dropping them.

However, what is different about the DataFrame in 0.19.0 which triggers this?

from reportgen.

sjmf commented on July 19, 2024

PIR Diff result on Pandas 0.19.0 (with error)

DateTime
2016-02-02 17:58:31             NaN
2016-02-04 07:35:28             NaN
2016-02-08 13:55:25   -1.225261e+08
2016-02-09 07:33:33   -1.399888e+10
2016-02-09 13:22:58   -1.177617e+10
2016-02-10 15:28:20   -1.275398e+09
2016-03-23 17:52:49   -5.419100e+08
2016-03-23 19:04:56   -1.390800e+09
2016-03-23 19:35:09   -8.563218e+08
2016-03-23 19:36:07   -3.448276e+08
Name: PIRDiff, dtype: float64

Expected result (0.18.1):

DateTime
2016-02-02 17:58:31          NaN
2016-02-04 07:35:28          NaN
2016-02-08 13:55:25    -0.122526
2016-02-08 14:00:16    13.877688
2016-02-09 07:33:33   -13.998882
2016-02-09 08:31:46     9.651322
2016-02-09 09:15:27     3.928234
2016-02-09 13:22:58   -11.776167
2016-02-10 15:28:20    -1.275398
2016-03-23 17:52:49    -0.541910
Name: PIRDiff, dtype: float64

Note that the incorrect result seems to be off by the scale factor which is applied on line 177 (1e9), and the error propagates to subsequent values.

from reportgen.

sjmf commented on July 19, 2024

Suspiciously, commenting out the scale factor on line 177 generates the expected result. What exactly changed about the diff() function in 0.19.0 to cause this?

http://pandas.pydata.org/pandas-docs/version/0.19.0/whatsnew.html

from reportgen.

sjmf commented on July 19, 2024

Next push will fix this issue: just changed the line discarding packets to zero those values instead. The mystery remains as to why the scaling factor is no longer needed after Pandas 0.19.0 (and still a mystery as to why I needed it before: I must have thought it was weird otherwise I wouldn't have used ಠ_ಠ as a variable name...).

from reportgen.

Data loss about reportgen HOT 7 CLOSED

Comments (7)

Related Issues (1)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent