pastas / pastas Goto Github PK
View Code? Open in Web Editor NEWPastas is an open-source Python framework for the analysis of groundwater time series.
Home Page: https://pastas.readthedocs.io
License: MIT License
Pastas is an open-source Python framework for the analysis of groundwater time series.
Home Page: https://pastas.readthedocs.io
License: MIT License
I want to pass an array of parameters to the residuals function. Right now it checks first if the method is 'lmfit', then whether the parameters are an array. I think that should be the other way around. Also, in the checking whether parameters is an array, it now also check whether it is None. I think that is impossible, as it won't get beyond the if statement.
When both the kind
kwarg and stresses
kwarg are passed to project.get_distances the kind argument is ignored. My current case is that I have multiple sources of meteorological data. I have a list of stresses from KNMI, and a list of stresses names from another source. I could split up my KNMI stresses into a precipitation list and an evaporation list and just use those, but it would be nice if get_distances allows you to still pass the kind
kwarg to a subset of stresses.
From project.py get_distances:
if stresses is None and kind is None:
stresses = self.stresses.index
elif stresses is None:
stresses = self.stresses[self.stresses.kind == kind].index
I propose adding the following elif to allow both kind
and stresses
to be passed.
if stresses is None and kind is None:
stresses = self.stresses.index
elif stresses is None:
stresses = self.stresses[self.stresses.kind == kind].index
elif stresses is not None and kind is not None:
stresses = self.stresses.loc[stresses].loc[self.stresses.kind == kind].index
Figuring out how we deal with high-frequency oseries, meaning that the oseries have a higher frequency than the simulated series. Do we only compare the oseries on the indices wherethere are also simulations? Or do we interpolate the simulation to the indices of the oseries? I thought the last option..
Then, I think I would like an option where one can only compare to the oseries where there's also a simulation. Or be able to change the frequency of the oseries. Fitting high frequency observations is often difficult and suffers from high auto correlations..
I was trying to solve a model with a different frequency, but found out this does not work yet/anymore. However, the simulate, residuals etc all support this already. Would be a nice feature to be able to solve with different frequencies.
When a model is saved as a .pas file, the model fit is not yet stored. Due to this a couple of methods do not work, e.g. ml.fit_report()
It would be nice to store the fit, including the covariance matrix.
When the frequency of a TimeSeries cannot be inferred and is not user-provided, unexpected things can happen when updating the settings / series. This becomes visible when changing the frequency, and resampling automatically switches to the "sample_weighted" method in the "change_frequency" method of the TimeSeries method.
Bottomline: it should be possible to change oseries with no freq_original from Hourly values to daily values and use the "drop" option for dropping nan-values.
Methods for calculating Dutch groundwater statistics GHG and GLG are included in the Statistics class.
Why are these commented out? Can I add percentile based methods? I have forked the repo.
For determination of the tmin and tmax are for this stressmodel, the indices are compared. This goes wrong when the two series both have a daily frequency, but measured at another hour.
Possible solution:
When del_transform is ran twice, for example
ml.del_transform('recharge')
ml.del_transform('well 1')
you lose the optimal parameters. After the first time the initial parameters are set to the optimal parameters, and after the second time the initial parameters are overwitten by the default initial parameters.
If series is entered as DataFrame with one column, pastas should change it to a series
This is intended to be the pandas series of the stress
Nice library! The polder function (Bruggeman) is not included in rfunc.py. It would be nice to have as well to be able model surface water stresses.
Hi,
I have created a branch in which some methods of the model class are partly replaced by existing pandas fuctions, see dev-model in my fork. I think it makes the code a bit more compact and (just slightly) faster.
If you see the changes and like them, I can make a pull request.
Right now it is very difficult to find out what the pre-defined settings are for Stressmodels. They are defined in TimeSeries and can only be seen by looking at the code, as far as I know.
Implement the new tseries parameter and stress methods in this plotting function.
Current error message:
innovations = self.noisemodel.simulate(residuals, self.odelt) TypeError: simulate() takes at least 4 arguments (3 given)
To make PASTA more general, and to make it easier to generate the GUI and to be able to expand the GUI, a more general way to define Tseries would be nice (this is probably what Frans meant last week). For example, each Tseries should contain an attribute which states how many timeseries it needs (0 for Constant, 1 for Tseries1 and 2 for Recharge), and the series should be a list of this size, not seperate inputs like in the Recharge-class. With this information it is much easier to make an all-purpose import-dialog for Tseries, but it is also more logical for people who use a script.
I think we should make all non-public methods known by adding a leading underscore to the method name. This is suggested in PEP8 (https://www.python.org/dev/peps/pep-0008/), and is followed by all major packages (E.g. Numpy, Scipy, Pandas, Flopy). E.g.
Model._get_odelt()
Also, we should not use private methods of other packages in Pastas, like the _base_and_stride taken from Pandas in utils.py. These methods can be dropped without notice, which can cause problems in future versions of Pastas.
This will make it more clear for the Pastas-users which methods they should use (They will pop up first on tab completion) and make maintenance of Pastas easier in the future.
Thoughts?
I am trying to implement the simulation function such that it can also be used when the model is not yet solved. It then uses the initial values.
When you run the example.py without solving, and then try the simulate you get the following error:
File "<ipython-input-5-5a455af4e32c>", line 1, in <module>
ml.simulate()
File "c:\python\pastas\pastas\model.py", line 191, in simulate
self.set_tmin_tmax()
File "c:\python\pastas\pastas\model.py", line 415, in set_tmin_tmax
tmin = tmin - self.get_time_offset(tmin, self.freq) + self.time_offset
File "c:\python\pastas\pastas\model.py", line 515, in get_time_offset
freq = freq.split("-", 1)[0]
AttributeError: 'NoneType' object has no attribute 'split'
Now, when you first have run ml.check_frequency() and ml.simulate() you do get the series. I'll try to fix this tomorrow.
see commit f252845
It is now possible to change the frequency of the observed (dependent) and the independent (stress) series. But the implementation is still very experimental.
Changing the frequency
Changing the frequency works fine for both series, but maybe a user option should be provided on how to resample. Now a forward fill method is standardly applied. Alternatively, a method could be written that uses only existing values and does not rely on interpolation.
The stress series now use the pandas .asfreq function, creating nan-values for each unobserved time index. These nan-values are later filled with user defined function.
Simulating with different frequencies
The simulation of the model works fine when only the frequency of the observed series is changed. When the frequency of the stress series is changed, the residuals result in occasional nan-values and optimisation fails.
TO DO
/Users/mark/git/pastas/pastas/stressmodels.py:298: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
h = h.loc[tindex]
Issue becomes apparent for example when a series contains only NaN. In this case freq cannot be inferred and is set to None.
freq_is_None_series = pd.Series(index=pd.to_datetime(['2009-01-01', '2010-01-01']))
Most of the tests are now failing in Travis with pandas 0.20. @tomvansteijn, can youfigure out how to fix this? E.g. https://travis-ci.org/pastas/pastas/jobs/233972587
There is an error when a certain calibration period is chosen in combination with a noise model. Without the noise model the model works fine. E.g.:
n = NoiseModel() ml.addnoisemodel(n) ml.solve(tmin='1965', tmax='1990')
Gives:
ValueError: operands could not be broadcast together with shapes (4953,) (616,)
A pastas Model is initialized with a hard-coded daily frequency. This makes pastas resample my hourly stresses to daily values when creating the model. Then when I solve on an hourly frequency, pastas uses the original hourly series to eventually solve the model. The downsampling step is quite unnecessary in this case so how can I force pastas to skip it?
One solution would be to allow Models to be initialized with a frequency. But this probably adds to the (or just my) confusion about pastas settings.
This might lead to a whole other discussion, but currently there are three levels (that i can think of now) at which settings can be provided in pastas:
ml.settings
.I've heard there was a reason to not include a settings kwarg in Model(), but in my case, I wouldn't mind if the option existed... So what would be the best way to avoid pastas doing extra resampling work? I'm curious to hear your thoughts!
You can not set the initial value or vary (didn't try pmin and pmax) voor the constant, while this can be done for the noise model (which is also automatically generated). Example code:
import pandas as pd
import pastas as ps
dates = pd.date_range('1990', '1991')
ho = pd.Series(data=1, index=dates)
ml = ps.Model(ho)
ml.parameters
ml.set_initial('noise_alpha', 77)
ml.set_initial('constant_d', 33)
ml.set_vary('noise_alpha', False)
ml.set_vary('constant_d', False)
print(ml.parameters)
Note that for the parameter of the noise model the values have been changed, but not for the constant.
I suggest to move the plot functions to a separate class to keep the Model class clean and succinct, similar to the way the Statistics class couples with the model in the dev branch.
When supplying tmin to Model.simulate, it returns values before tmin for the entire warmup period.
ml = ps.io.load("model.pas")
print(ml.fit_report)
returns:
ValueError: The model is not solved yet
This should be possible and relates to #82
Sampling at 14 and 28th day of the month, using forward fill or linear interpolation.
I will try to implement these.
Just read this interesting blog post on Pandas 1.0 which seems to be coming up next year.
https://www.dataschool.io/future-of-pandas/
Most importantly for us the inplace argument will be removed so we should start removing those from the Pastas code. I think it would be good if we can release a stable Pastas version after the release of Pandas 1.0 as Pastas heavily depends on this package.
Is it an idea to start making a list of what we need to do for such a release?
It is now easy to to change the frequency, but the way we deal with this change of frequency is not yet consistent. For now, the gain is the gain/ml.settings['freq']. That means that changing the frequency from daily to weekly, the gain will be divided by 7.
This is not yet consistent with the get_block_response function and the plotting methods that depend in it. I think this would be a good issue to solve in the next Pastas release.
when the warmup is specified when doing a solve, the value is passed to the residuals function differently than tmin and tmax. In fact, I don't quite know how it is passed, but a different value of the warmup does give a different solution, so it gets passed in some fashion. but it should probably be passed in a similar fashion to tmin and tmax.
When I install pasta using python setup.py install the 'read' and 'recharge' folders are not copied to site-packages\pasta-0.01-py2.7.egg\pasta\
When I copy the folders manually everything works fine.
logging.config is now applied when a model is created. So when TimeSeries object is created before, and logging occurs, this information is not printed to the console..
My test file runs with pytest. Better and more up-to-date compared with nosetests, in my opinion. Up to you!
When one of the tseries is outside the time range for which the plot is drawn, NAN's are returned to height ratio (LINE 168):
fig, ax = plt.subplots(1 + len(self.ml.tseriesdict), sharex=True,
gridspec_kw={'height_ratios': height_ratios})
This causes an error.
It would be nice if we can automatically wrap the (code) cells of the jupyter notebook when creating the docs on readthedocs. Making it easier to read the notebooks online..
Log config file is not in pypi distribution. This causes trouble with the log level. In model.py log_level="error" should be "ERROR" i think.
Right now the index of stresses consists of Timestamps. It is unclear whether this is the beginning or the end of the period which the timestamp represents. For example, the menyanthes-import gives data at the end of each period. So the monthly well discharge with a timestamp at the 1st of february (0:00) represents the extraction in january. For the knmi-data on the other hand, the precipitation with a timestamp on january 1st (0:00) represents the precipitation on january 1st, and so the index of the data is at the beginning of the period it represents. I would propose to always define data at the end of the period it represents, as this is also the moment the amount is registered. So the precipitation of january 1st would have an index of january 2nd. The choice also has implications for the simulation methods, which right now assume the data is defined at the end of each period (I think).
A better approach would be to use Pandas Periods instead of Pandas Timestamps. By defining periods, it is clear form the definition of the index which period the data represents. We need to figure out if all our methods work with Periods as well however.
The following code produces an error in optimisation when the recharge series have nan-values at the end of the beginning.
ts1 = Tseries(recharge, Gamma(), name='recharge', fillnan='interpolate')
This is because 'interpolate' borrowed from pandas does not fill nan-values at the beginning and end.
This might be a solution?
It seems that the built-in plotting functionality of Pastas creates double figure instances when used in a IPython Notebook. A temporary fix can be to suppress output by adding a semi-colon, e.g.:
ml.plots.decomposition();
Or by storing the returned figure instance:
fig = ml.plots.decomposition()
The methods keyword argument show=False has no effect in Notebooks.
Anyone knows how to solve this issue for all plotting methods?
The current timeseries statistics functions q_ghg, q_gvg and q_glg take a quantile of the whole series. The classic definition is better approximated by taking the average of quantiles per year. Proposing to implement this using Pandas resample.
User provided names could be checked for unwanted characters that cause troubles in other methods.
E.g. name="Test/456" causes troubles when using this name for writing files (pandas read_csv) and could be changed to:
name="Test_456".
Statistics functions q_ghg, q_glg, q_gvg do not use the optional arguments tmin and tmax.
Add indexing using .loc?
tmin and tmax can be specified in ml.plot, but tmin is ignored and the plot is drawn including the warmup period. It would be nice to have an easy option to plot the results for the tmin and tmax specified for ml.solve. Now you have to specify them for the solve and then again for the plot. Some kind of option: 'use tmin and tmax from the solution' (which are stored anyway, right?) would be nice.
Same holds for stats.evp. It would be nice to have an option to compute evp for the period used in solving.
Still very impressed with this library. I thought it would be nice, for a particular application, to be able to do some postprocessing on the residuals before feeding them back to lmfit.minimize. In general I think it is better if the objective function is independent of the solver. See the branch dev-obj-functions in my fork. If you like the changes, I can make a pull request.
AttributeError: 'TimeSeries' object has no attribute 'IN'
The order of the model parameters changes after creation, caused by pandas append method that order alphabetically. As a result, pmin is now shown after pmax, which I find confusing. Should be solved somewhere in the get_init_parameters method.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.