ugentbiomath / wwdata Goto Github PK

View Code? Open in Web Editor NEW

11.0 11.0 12.0 18.78 MB

Python package to analyse, validate, fill and visualise data acquired in the context of (waste) water treatment

License: GNU General Public License v3.0

Jupyter Notebook 12.10% Python 86.88% Makefile 1.02%

data-analysis jupyter-notebook

wwdata's People

Contributors

Stargazers

Watchers

Forkers

zivzone paulamma jorasinghr csu-anzai nielsnicolai seshan01 fagan2888 bertwassink sabadaneshgar thanksgivings2022 ahmad-abdellatif notthatanonymous

wwdata's Issues

[tag_period]

Write a function tag_period that allows the user to tag values in a certain period. Possibly add the option to only tag values below, above or between given values within this period.

Different tags for different filter functions

Create a different tag for the different filter functions, so the user can always find out based on which filter algorithm a datapoint was dropped. Possibly plotting can be adjusted based on this as well.
This will require a lot of small code adjustment in a lot of functions.

[general] replace filtered_only with only_checked

In the filling functions, the filtered_only argument (where it is still present) needs to be replaced with the only_checked argument, for clarity and consistency.

Make distributable

Currently, the wwdata module is not available on PyPI yet, as is has not been made distributable. The following link still needs te be followed through: https://packaging.python.org/distributing/, starting from Packaging your project

[check_filling_error] Visualise the check

It would be nice to show the user visually how a reliability check is done, i.e. have a figure with the original data and the filled data, filling the artificial gaps. Can probably be done by composing such a figure based on the last iteration of the check_filling_error function.

Add fit_function functionality

In order to provide a better way of 'predicting' data, add a fit_function functionality, with as arguments an independent data series, a dependent data series and a function that can relate them to each other. See also the scipy.optimize.curve_fit function. Return values would be the function parameters.

Additionally, a fill_missing_function function should be provided, making use of the determined parameters and the given function to replace data.

The workflow for this is the same as when replacing data by means of correlations.

[fill_missing_interpolation] enable use of all pandas.interpolate options

For some of the pandas.interpolate options (spline, polynomial), you need to be able to give an order argument. This is currently not possible, but would greatly expand the options to use for interpolation.

[fill_missing_daybefore] fills with day before arange period, not with day before gap

Make notebook executable for everyone

Use https://mybinder.org to accomplish this!

[calc_daily_average] Add only_checked argument

Add the only_checked argument to the calc_daily_average function, to allow the user to only use validated data for daily average calculation. This will require some checks and error mitigation (if there is self.meta_valid column for the data in question, if there are NaN values generated this way...)

[Showcase_OnlineSensorBased] Errors and warnings

When running dataset.calc_daily_average an error message is shown:
TypeError: float() argument must be a string or a number, not 'Timestamp'.

There are some minor errors in the same file;
when running dataset.fill_missing_model a warning is shown, FutureWarning: 'argmin' is deprecated, use 'idxmin' instead. The behavior of 'argmin'.
Another warning is shown when running dataset.get_correlation, ImportWarning: can't resolve package from __spec__ or __package__, falling back on __name__ and __path__.

[fill_] ARMA

ARMA (Autoregressive Moving Average) can be an interesting option to fill gaps in data, as mentioned here: https://www.researchgate.net/publication/258649021_Towards_a_More_General_Method_for_Filling_Gaps_in_Time_Series
Might be something to look into. If you're getting tired of detecting drift @jorasinghr, this could give you some distraction ;)

[calc_ratio] improve error message

When setting the only_checked argument to True, but when one (or both) of the data series haven't been checked yet, this currently give a KeyError, which is confusing to the user, as (s)he doesn't know it concerns the self.meta_valid DataFrame, not the original DataFrame. Improve this to warn the user that only_checked cannot be fulfilled.
To be decided: proceed with calculation for all data points and just give a warning, or throw an error and stop?

[get_correlation] check calculations and visualisation

Initial Update

The initial setup worked, but all your packages are up to date. You can safely close this issue.

Data range for tagging of nan and double values

For the filling functions, one can currently choose a range for filling; this is not in place yet for the filtering functions, but would be very helpful with regards to a.o. timesaving.

[fill_missing_daybefore] doesn't seem to work in for-loop

The fill_missing_daybefore function works separately, in a for-loop, but not in a for-loop where another filling function is applied. Example:
This works:

for data in Q_data:
    dataset.fill_missing_interpolation(data,50,[dt.datetime(2016,5,23),dt.datetime(2016,5,24)],clear=True)
for data in Q_data:
    dataset.fill_missing_daybefore(data,[dt.datetime(2016,5,1),dt.datetime(2016,6,1)],range_to_replace=[0,5],filtered_only=True,clear=False)

This doesn't:

for data in Q_data:
    dataset.fill_missing_interpolation(data,50,[dt.datetime(2016,5,23),dt.datetime(2016,5,24)],clear=True)
    dataset.fill_missing_daybefore(data,[dt.datetime(2016,5,1),dt.datetime(2016,6,1)],range_to_replace=[0,5],filtered_only=True,clear=False)

In the last example, no filling with the day before is done (fill_missing_interpolation does work properly).

[check_filling_error] show more than mean filling error

Instead of only showing the mean filling error, also provide information on max and/or min error. In case of e.g. a gap during wet weather (for wwtp data), some algorithms might not fill this gap appropriately at all, but this will have a small impact on the mean, while adding the max to the output gives a more realistic image.

Tag values above or below certain value

write a function that tags values above or below a user-defined value as 'filtered'

self.columns doesn't always work

In the case where a column is added to the self.data dataframe, but not by a wwdata Class function, the self.columns command fails. Example:

dataset.columns

Gives ['airflow1','airflow2']
Executing:

dataset.data['airflow_total'] = dataset.data['airflow1']+dataset_control.data['airflow2']
dataset.columns

Gives the same result, while

dataset.data.columns

Does give
['airflow1','airflow2','airflow_total']

[fill_missing_correlation] and [check_filling_error] Take into account standard deviation of correlation

When calculating reliabilities, it would be good to take into account the standard deviation on data points filled in based on a correlation. A correlation calculations comes with a standard deviation. It is of little use to take this into account during the actual filling, but it does add some information on how reliable the filling is. Use that information to feed it back to the user.

[moving_average_filter] first x data points are automatically tagged

Due to the application of a window for the moving average, the first x number of datapoints is automatically tagged in a dataset, where x is the size of the window. This needs to be solved, maybe by just copying the original values without filtering them (also needs to be done for the meta_valid and meta_filled datasets then!)

Definition of a wet weather event

This is currently done very rudimentary; options include:

Defining a dry weather period and marking everything that exceeds the 95-percentile of this as wet weather (currently the approach of Waterboard De Dommel)
Hurst exponent?
Calculate a moving average with a very large window, anything to different from that is wet weather.

New gap filling algorithms

e.g. ANN?
http://www.arpnjournals.com/jeas/research_papers/rp_2010/jeas_0110_290.pdf
https://arxiv.org/pdf/1606.01865.pdf

Warnings are still produced during reliability checking

During the application of reliability checking function, the warnings on rain weather and order of gap filling (small to large) are still shown, despite several attempts to turn off warnings (https://docs.python.org/3/library/warnings.html#temporarily-suppressing-warnings or own developed variables such as self._rain_warning_issued).
No big problem, but annoying.

Tag/detect drift

Write a function detecting drift in the data. The idea would be that the user gives:

a range in which to apply the function (similar to other tagging functions)
the maximum slope a signal is expected to have over a certain period (see below)
the period over which a certain slope is allowed

This function could calculate the slope of the data in a certain given period (by for example fitting a line through it) and compare it with the maximum expected slope. In first instance, it would be interesting for the user to know if drift is present, secondly it would be good to be able to correct for it.

Some additional sources of information/inspiration:

@jorasinghr could you have a look at this and let me know if things are unclear?

update tags when calculating proportional concentrations

When using calc_tot_prop, new columns are created, but these do not get a tag in the self.meta_valid and/or self.meta_filled dataframes. This is important, especially in the case where the filled data is further used and it's important to know what is real data and what is filled.

Include the addition of a new column to the self.meta_valid/meta_filled dataframes in the calc_tot_prop function. This should be a combination of the columns that the proportional concentration is calculated from, making that wherever the tag is not 'original', something else is used.

Initial Update

Hi 👊

This is my first visit to this fine repo, but it seems you have been working hard to keep all dependencies updated so far.

Once you have closed this issue, I'll create separate pull requests for every update as soon as I find one.

That's it for now!

Happy merging! 🤖

[savgol] integrate better in package

Integrate the function using the savitsky-golay filter better in the package, use same structure as simple_moving_average function and include as an option in moving_average_filter

[filtering] change 'filtered' tag

The tag name 'filtered' can be a bit confusing: does this mean the data point is filtered out, or is this the data that is left after filtering? Update to a more clear tag!

Documentation 'clear' argument in filter functions

The clear argument present in most of the filter/tagging functions is not explained in the docstrings. This is very important, as the argument also makes sure the relevant self.meta_valid dataframe column is added the first time a filtering function is executed on that column.

Testing for reliability of data imputation/gap filling

A function should be available for testing the reliability of gap filling. Possible code flow:

Create artificial gaps
Fill them
Compare with original
Come up with a measure to represent the reliability
Add function to filling functions