ugentbiomath / wwdata Goto Github PK
View Code? Open in Web Editor NEWPython package to analyse, validate, fill and visualise data acquired in the context of (waste) water treatment
License: GNU General Public License v3.0
Python package to analyse, validate, fill and visualise data acquired in the context of (waste) water treatment
License: GNU General Public License v3.0
Write a function tag_period that allows the user to tag values in a certain period. Possibly add the option to only tag values below, above or between given values within this period.
Create a different tag for the different filter functions, so the user can always find out based on which filter algorithm a datapoint was dropped. Possibly plotting can be adjusted based on this as well.
This will require a lot of small code adjustment in a lot of functions.
In the filling functions, the filtered_only argument (where it is still present) needs to be replaced with the only_checked argument, for clarity and consistency.
Currently, the wwdata module is not available on PyPI yet, as is has not been made distributable. The following link still needs te be followed through: https://packaging.python.org/distributing/, starting from Packaging your project
It would be nice to show the user visually how a reliability check is done, i.e. have a figure with the original data and the filled data, filling the artificial gaps. Can probably be done by composing such a figure based on the last iteration of the check_filling_error
function.
In order to provide a better way of 'predicting' data, add a fit_function functionality, with as arguments an independent data series, a dependent data series and a function that can relate them to each other. See also the scipy.optimize.curve_fit function. Return values would be the function parameters.
Additionally, a fill_missing_function function should be provided, making use of the determined parameters and the given function to replace data.
The workflow for this is the same as when replacing data by means of correlations.
For some of the pandas.interpolate options (spline, polynomial), you need to be able to give an order
argument. This is currently not possible, but would greatly expand the options to use for interpolation.
Use https://mybinder.org to accomplish this!
Add the only_checked argument to the calc_daily_average function, to allow the user to only use validated data for daily average calculation. This will require some checks and error mitigation (if there is self.meta_valid column for the data in question, if there are NaN values generated this way...)
When running dataset.calc_daily_average an error message is shown:
TypeError: float() argument must be a string or a number, not 'Timestamp'
.
There are some minor errors in the same file;
when running dataset.fill_missing_model a warning is shown, FutureWarning: 'argmin' is deprecated, use 'idxmin' instead. The behavior of 'argmin'
.
Another warning is shown when running dataset.get_correlation, ImportWarning: can't resolve package from __spec__ or __package__, falling back on __name__ and __path__
.
ARMA (Autoregressive Moving Average) can be an interesting option to fill gaps in data, as mentioned here: https://www.researchgate.net/publication/258649021_Towards_a_More_General_Method_for_Filling_Gaps_in_Time_Series
Might be something to look into. If you're getting tired of detecting drift @jorasinghr, this could give you some distraction ;)
When setting the only_checked argument to True, but when one (or both) of the data series haven't been checked yet, this currently give a KeyError, which is confusing to the user, as (s)he doesn't know it concerns the self.meta_valid DataFrame, not the original DataFrame. Improve this to warn the user that only_checked cannot be fulfilled.
To be decided: proceed with calculation for all data points and just give a warning, or throw an error and stop?
The initial setup worked, but all your packages are up to date. You can safely close this issue.
For the filling functions, one can currently choose a range for filling; this is not in place yet for the filtering functions, but would be very helpful with regards to a.o. timesaving.
The fill_missing_daybefore function works separately, in a for-loop, but not in a for-loop where another filling function is applied. Example:
This works:
for data in Q_data:
dataset.fill_missing_interpolation(data,50,[dt.datetime(2016,5,23),dt.datetime(2016,5,24)],clear=True)
for data in Q_data:
dataset.fill_missing_daybefore(data,[dt.datetime(2016,5,1),dt.datetime(2016,6,1)],range_to_replace=[0,5],filtered_only=True,clear=False)
This doesn't:
for data in Q_data:
dataset.fill_missing_interpolation(data,50,[dt.datetime(2016,5,23),dt.datetime(2016,5,24)],clear=True)
dataset.fill_missing_daybefore(data,[dt.datetime(2016,5,1),dt.datetime(2016,6,1)],range_to_replace=[0,5],filtered_only=True,clear=False)
In the last example, no filling with the day before is done (fill_missing_interpolation does work properly).
Instead of only showing the mean filling error, also provide information on max and/or min error. In case of e.g. a gap during wet weather (for wwtp data), some algorithms might not fill this gap appropriately at all, but this will have a small impact on the mean, while adding the max to the output gives a more realistic image.
write a function that tags values above or below a user-defined value as 'filtered'
In the case where a column is added to the self.data dataframe, but not by a wwdata Class function, the self.columns command fails. Example:
dataset.columns
Gives ['airflow1','airflow2']
Executing:
dataset.data['airflow_total'] = dataset.data['airflow1']+dataset_control.data['airflow2']
dataset.columns
Gives the same result, while
dataset.data.columns
Does give
['airflow1','airflow2','airflow_total']
When calculating reliabilities, it would be good to take into account the standard deviation on data points filled in based on a correlation. A correlation calculations comes with a standard deviation. It is of little use to take this into account during the actual filling, but it does add some information on how reliable the filling is. Use that information to feed it back to the user.
Due to the application of a window for the moving average, the first x number of datapoints is automatically tagged in a dataset, where x is the size of the window. This needs to be solved, maybe by just copying the original values without filtering them (also needs to be done for the meta_valid and meta_filled datasets then!)
This is currently done very rudimentary; options include:
During the application of reliability checking function, the warnings on rain weather and order of gap filling (small to large) are still shown, despite several attempts to turn off warnings (https://docs.python.org/3/library/warnings.html#temporarily-suppressing-warnings or own developed variables such as self._rain_warning_issued
).
No big problem, but annoying.
Write a function detecting drift in the data. The idea would be that the user gives:
This function could calculate the slope of the data in a certain given period (by for example fitting a line through it) and compare it with the maximum expected slope. In first instance, it would be interesting for the user to know if drift is present, secondly it would be good to be able to correct for it.
Some additional sources of information/inspiration:
@jorasinghr could you have a look at this and let me know if things are unclear?
When using calc_tot_prop, new columns are created, but these do not get a tag in the self.meta_valid and/or self.meta_filled dataframes. This is important, especially in the case where the filled data is further used and it's important to know what is real data and what is filled.
Include the addition of a new column to the self.meta_valid/meta_filled dataframes in the calc_tot_prop function. This should be a combination of the columns that the proportional concentration is calculated from, making that wherever the tag is not 'original', something else is used.
Hi 👊
This is my first visit to this fine repo, but it seems you have been working hard to keep all dependencies updated so far.
Once you have closed this issue, I'll create separate pull requests for every update as soon as I find one.
That's it for now!
Happy merging! 🤖
Integrate the function using the savitsky-golay filter better in the package, use same structure as simple_moving_average
function and include as an option in moving_average_filter
The tag name 'filtered' can be a bit confusing: does this mean the data point is filtered out, or is this the data that is left after filtering? Update to a more clear tag!
The clear argument present in most of the filter/tagging functions is not explained in the docstrings. This is very important, as the argument also makes sure the relevant self.meta_valid dataframe column is added the first time a filtering function is executed on that column.
A function should be available for testing the reliability of gap filling. Possible code flow:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.