chenyangkang / stemflow Goto Github PK

A Python Package for Adaptive Spatio-Temporal Exploratory Model (AdaSTEM)

Home Page: https://chenyangkang.github.io/stemflow/

License: MIT License

Python 100.00%

bird-migration geospatial machine-learning spatio-temporal-analysis species-distribution-modeling spatiotemporal adaptive-spatio-temporal-exploratory-model biodiversity biogeography ebird

stemflow's Issues

NAs detection

Add NAs detection and raise errors when X or y input has NAs. #43

[JOSS review] Improve cartesian indexing system

This issue is to solve the problems in cartesian indexing system. Suggestions proposed by [at]jedalong (openjournals/joss-reviews#6158)

In the mallard example you use 50 spatial and temporal blocks for the global distribution of mallards.
That is, if the data X have longitude ranging from (-180, 180), latitude ranging from (-90, 90), and whole year data (1, 366), each block will approximately contain data of 7.2 longitude (about 720km), 3.6 latitude (about 360km), and 7 days, which approximately catch the spatiotemporal scale of bird migration. These are rough estimates to get a sense of the scale.

One degree of longitude is always about 110km so 7.2 km is about 792 km. The big issue is that 1 degree of latitude varies from 110km at the equator to 0 km at the poles, so the area of your blocks varies greatly from equator to pole. Gridding data on the globe without accounting for the spherical geometry of the earth is problematic (your package is not alone in ignoring this major issue). Could you instead allow the users to pass in spatial objects or define bins based on actual geometry so that bins are more equally sized to reflect global distribution data.

TODO:

Edit the notebook. Change the specific number (720km? 792km?) and add caveat for distortion problem towards the poles.2.
Allow user to pass parameters of actual geometry.

To me it would be more approparite for a user to pass in a single parameter associated with the desired output spatial resolution of the grid size (e.g., grids with a size of 100km x 100km) and then the package would create the grid on the fly from a single parameter. You expecte 4 parameters for this same gridding process? I note that your package s does not force grid cells to be square in area, which is IMO unusual.

TODO:
3. reduce the number of parameters.

This issue is further demonstrated in the Tips section for using a different coordinate system where the user must pass in 1000 to 10000 m as the range for "latitude" and "longitude" values in another coordinate system that does not use latidtude and longitude but rather x and y.

TODO:
4. Not related but consider: Allow concrete gridding parameters instead of only "adaptive".

I'm providing my review of stemflow here in this issue. I don't have many detailed software issues, since the package installs correctly and works as demonstrated in the documentation, for what I've tested. I have a few other comments that I believe will improve user experience.

First, thank you for preparing this package. It is a valuable contribution to the community. The goals of my comments are to make it easier for others to use this package with their own data. It is very easy to run the examples provided using pre-processed data, but there appears to be a series of assumptions regarding how these data should be formatted that are not clear.

Since this package is designed around spatial and temporal data, it would help to clarify how time and space are encoded in the model. Do input dataframes require a DOY column? Does anything change if data are provided at other temporal scales (weekly, monthly, yearly, etc.)? I see latitude and longitude encoded in the mini test data - is that the only supported CRS? I see that geopandas is a dependency - does the model class support passing a GeoDataFrame, or do you need to explicitly encode column names?

Next, I think more guidance regarding feature data could be provided. I was assuming most of the input covariates would be datasets that are temporally resolved at a similar scale as the abundance data (e.g. daily NDVI). But it looks like the mini_data example is nearly all static features, with DOY being the only dynamic variable. Is this how other datasets are expected to be formatted? Can users provide a combination of static and dynamic covariates? Are categorical features supported? More clarification regarding best practices for how to extract and format covariate data would help a lot.

The example notebooks provide clear usage examples, but there is little to no explanation of why certain routines are performed. There are titles for groups of code, but little contextual information. In the intro notebook, for example, why were the parameters Spatio_blocks_count=50, Temporal_blocks_count=50 selected? what would turning these numbers up and down do? What are the grid_len_{}_{}_threshold parameters, and what do those defaults encode? Tracing back to the original function is often still challenging, as there are many parameters with slightly varying names that are still hard to understand. As a user I would appreciate more narrative clarification.

There are lots of great features in this package and in the documentation that I don't feel like I understand as well as I would like to. I'm not familiar with the Hurdle modeling approach, which seems great and appears foundational to AdaSTEM - tips on best practices here would be valuable. You mentioned using other base models from sklearn in the manuscript - when would this be advantageous? You also provide great examples for comparing learning curves and optimizing strixel sizes, but these notebooks are just code blocks with some plots and little interpretation. I found myself wanting more guidance for how to interpret these results, or even guidance on why such optimization is important (do you optimize strixel size to minimize overfitting, for example?).

Overall, I think this is a great package and will make a valuable contribution to the growing ecosystem of python biogeography tools. I mostly recommend providing more guidance to users for how to best use this valuable resource.

Cheers,

[JOSS review] Documentation revision

This issue is to solve the problem suggested by [at]jedalong (openjournals/joss-reviews#6158)

What is the difference between temporal_step and temporal_bin_interval? I didnt follow the whole 'sliding' window part of this are they bins/blocks or a moving window?

Documentation Notes:
Rephrase: stemflow have 4 important gridding parameters. Actually only two:
The maximum grid length, and the minimum grid length. It can be separately set by longitude and latitude, and that will be 4.

TODO:

Add documentation on sliding window.
Add docs on the difference between temporal_step and temporal_bin_interval .
Revise the gridding params docs.

Check random state for all function

Check random state for all function. This should make the modeling results reproducible.

[Feature] STEM module

STEM module will allow users to keep the stixel size fixed, rather than adaptive. This is to solve the issue partial to #23 . It's a quite large project so I open an independent issue.

[REVIEW] consider adding automated testing

Automated testing is a best practice for open source projects, reducing the risk of software regression through breaking changes. stemflow includes a mini_test module that appears to test many parts of the software, but this is run manually. I recommend adding automated tests in CI to verify (and to provide verification to users) that the software continues to function as expected.

[JOSS review] paper revision

This issue is opened for paper revision suggested by [at]jedalong (openjournals/joss-reviews#6158)

original comments:

L10 - I think you need to explain what AdaSTEM is more explicitly at the beginning of the abstract similar to line 32 so the reader knows what this is.
L18 consider rephrasing this sentence.
L26 - consider rephrasing 'mine its merits'
L79 - rephrase 'mounting'
L85 - rephrase this entire sentence, not sure this is what you mean, confusing 'dependency conjugated with bias in data abundance'?
L89 - rephrase 'potentials'

Add Spatial and Temporal Scale Warnings

Warnings when the spatial scale of grid_length -related parameters are significant lower or higher than the input data scale. Same for temporal ones.

Related to #43

Revise the documentation for sphere indexing and mini_test

Revise documentation for two main changes:

sphere indexing
change of mini_test

For

Code docstring
Notebooks
Tips
Home page

[JOSS review] Add spherical indexing system

This issue is to solve the problem suggested by [at]jedalong (openjournals/joss-reviews#6158)

I think it is probably associated with the tools used for modelling that do not expect spatial objects, but binning the data correctly based on spatial objects and then back transforming to the expected data structure migth be a more correct way to deal with the spherical geometry issue.

TODO:

Implement spherical indexing system

[Feature] Speed boost? Using Geo indexing dependency

As suggested during the JOSS review, I should probably use geo-indexing for, like, prediction problem.

This issue is to see of geopandas will speed up the indexing-related tasks.

Gridding params grid search

Add functions and docs to show how to do grid search for best gridding params. Probably using the sklearn.model_selection.GridSearchCV¶. Or some faster way: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.HalvingGridSearchCV.html.

Add Numba for numpy operation optimization

To improve the speed of modeling, consider adding Numba decorations to numpy operation.

https://github.com/numba/numba

[REVIEW] consider updating contributor guidelines

The stemflow README contains a sentence on welcoming contributions, but does not specify how those contributions are best made. There are some nice guides available for providing a contributor guide, and I recommend adding a more detailed guide for users.

[BUG]plot_gif only work for global and WGS84 currently

[Feature] Make parallel computing available

Is your feature request related to a problem? Please describe.
Add parallel modeling for:

split
fit
predict
assign importance to points

Describe the solution you'd like
Add joblib as parallel backend.

Describe alternatives you've considered
multiprocessing with shared memory

Additional context
Add any other context or screenshots about the feature request here.

[BUG] unique_stixel_id

First, thanks for the package! It seems that it would be very useful for my analyses!

I managed to run the example with your data with no problem :) However, when I try to run it with my own data, the fitting crashes with the error: "unique_stixel_id". Specifically, when I use the function .fit() it seems to Generate Ensemble with no problem but later in the Training it crashes.
I'm working with the crs projection EPSG:3035. I tried to change the grid_len_upper_threshold and grid_len_lower_threshold to change the grid size but I always get the same error. Below is the terminal output for the error. Moreover, attached is some sample data in case you want to try yourself. Do you have any idea why the error is happening?

Thanks in advance!

Traceback (most recent call last):
File "", line 1, in
File "C:\Users\XXXX\AppData\Local\miniconda3\envs\r-reticulate\lib\site-packages\stemflow\model\AdaSTEM.py", line 577, in fit
self.SAC_training(self.ensemble_df, X_train, verbosity, njobs)
File "C:\Users\XXXX\AppData\Local\miniconda3\envs\r-reticulate\lib\site-packages\stemflow\model\AdaSTEM.py", line 526, in SAC_training
for ensemble in output_generator:
File "C:\Users\XXXX\AppData\Local\miniconda3\envs\r-reticulate\lib\site-packages\tqdm\std.py", line 1181, in iter
for obj in iterable:
File "C:\Users\XXXX\AppData\Local\miniconda3\envs\r-reticulate\lib\site-packages\stemflow\model\AdaSTEM.py", line 506, in
output_generator = (self.SAC_ensemble_training(index_df=ensemble[1], data=data) for ensemble in groups)
File "C:\Users\XXXX\AppData\Local\miniconda3\envs\r-reticulate\lib\site-packages\stemflow\model\AdaSTEM.py", line 480, in SAC_ensemble_training
.apply(lambda stixel: self.stixel_fitting(stixel))
File "C:\Users\XXXX\AppData\Local\miniconda3\envs\r-reticulate\lib\site-packages\pandas\core\groupby\groupby.py", line 1846, in apply
return self._python_apply_general(f, self._obj_with_exclusions)
File "C:\Users\XXXX\AppData\Local\miniconda3\envs\r-reticulate\lib\site-packages\pandas\core\groupby\groupby.py", line 1885, in _python_apply_general
values, mutated = self._grouper.apply_groupwise(f, data, self.axis)
File "C:\Users\XXXX\AppData\Local\miniconda3\envs\r-reticulate\lib\site-packages\pandas\core\groupby\ops.py", line 919, in apply_groupwise
res = f(group)
File "C:\Users\XXXX\AppData\Local\miniconda3\envs\r-reticulate\lib\site-packages\stemflow\model\AdaSTEM.py", line 480, in
.apply(lambda stixel: self.stixel_fitting(stixel))
File "C:\Users\XXXX\AppData\Local\miniconda3\envs\r-reticulate\lib\site-packages\stemflow\model\AdaSTEM.py", line 403, in stixel_fitting
unique_stixel_id = stixel["unique_stixel_id"].iloc[0]
File "C:\Users\XXXX\AppData\Local\miniconda3\envs\r-reticulate\lib\site-packages\pandas\core\frame.py", line 4102, in getitem
indexer = self.columns.get_loc(key)
File "C:\Users\XXXX\AppData\Local\miniconda3\envs\r-reticulate\lib\site-packages\pandas\core\indexes\base.py", line 3812, in get_loc
raise KeyError(key) from err
KeyError: 'unique_stixel_id'

Sample data:
traindata_sample.csv

Watermark:

chenyangkang / stemflow Goto Github PK

stemflow's Issues

Recommend Projects

Recommend Topics

Recommend Org