Coder Social home page Coder Social logo

Data loss about reportgen HOT 7 CLOSED

sjmf avatar sjmf commented on July 19, 2024
Data loss

from reportgen.

Comments (7)

sjmf avatar sjmf commented on July 19, 2024

Deps where this issue occurs: numpy-1.12.1 pandas-0.20.1. (latest)

Frozen versions with no issue: numpy-1.11.0 pandas-0.18.0

from reportgen.

sjmf avatar sjmf commented on July 19, 2024

Pandas 0.19.0 breaks this code. Pandas 0.18.1 works. (Both numpy 1.11.0 and 1.12.1 work, leading me to believe this is an issue with Pandas usage?)

from reportgen.

sjmf avatar sjmf commented on July 19, 2024

Commenting out line 257 in report.py:

    # Apply fixes to the data and diff the PIR movement
    dfs = dh.clean_data(dfs)

... yields the expected packet numbers. Checking the dh.clean_data function...

from reportgen.

sjmf avatar sjmf commented on July 19, 2024

Packets are discarded at line 192 of diff_pir in datahandling.py.

    dfs = {i: dfs[i].drop(dfs[i][dfs[i].PIRDiff > pir_threshold].index) for i in dfs}

A mitigation strategy would be to set the values to 0 instead of dropping them.

However, what is different about the DataFrame in 0.19.0 which triggers this?

from reportgen.

sjmf avatar sjmf commented on July 19, 2024

PIR Diff result on Pandas 0.19.0 (with error)

DateTime
2016-02-02 17:58:31             NaN
2016-02-04 07:35:28             NaN
2016-02-08 13:55:25   -1.225261e+08
2016-02-09 07:33:33   -1.399888e+10
2016-02-09 13:22:58   -1.177617e+10
2016-02-10 15:28:20   -1.275398e+09
2016-03-23 17:52:49   -5.419100e+08
2016-03-23 19:04:56   -1.390800e+09
2016-03-23 19:35:09   -8.563218e+08
2016-03-23 19:36:07   -3.448276e+08
Name: PIRDiff, dtype: float64

Expected result (0.18.1):

DateTime
2016-02-02 17:58:31          NaN
2016-02-04 07:35:28          NaN
2016-02-08 13:55:25    -0.122526
2016-02-08 14:00:16    13.877688
2016-02-09 07:33:33   -13.998882
2016-02-09 08:31:46     9.651322
2016-02-09 09:15:27     3.928234
2016-02-09 13:22:58   -11.776167
2016-02-10 15:28:20    -1.275398
2016-03-23 17:52:49    -0.541910
Name: PIRDiff, dtype: float64

Note that the incorrect result seems to be off by the scale factor which is applied on line 177 (1e9), and the error propagates to subsequent values.

from reportgen.

sjmf avatar sjmf commented on July 19, 2024

Suspiciously, commenting out the scale factor on line 177 generates the expected result. What exactly changed about the diff() function in 0.19.0 to cause this?

http://pandas.pydata.org/pandas-docs/version/0.19.0/whatsnew.html

from reportgen.

sjmf avatar sjmf commented on July 19, 2024

Next push will fix this issue: just changed the line discarding packets to zero those values instead. The mystery remains as to why the scaling factor is no longer needed after Pandas 0.19.0 (and still a mystery as to why I needed it before: I must have thought it was weird otherwise I wouldn't have used ಠ_ಠ as a variable name...).

from reportgen.

Related Issues (1)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.