climate-explorer-backend's People
Forkers
okanjiclimate-explorer-backend's Issues
generate_climos fails on files with 360_day calendar
Example: Processing /storage/data/climate/CMIP5/output/NIMR-KMA/HadGEM2-AO/historical/day/atmos/day/r1i1p1/v20130422/tasmin/tasmin_day_HadGEM2-AO_historical_r1i1p1_18600101-20051230.nc
:
2017-05-09 22:02:58 INFO: Generating climo period 1961-01-01 to 1990-12-30
2017-05-09 22:02:58 INFO: Selecting temporal subset
2017-05-09 22:03:10 INFO: Forming climatological means
Traceback (most recent call last):
File "generate_climos.py", line 372, in <module>
main(args)
File "generate_climos.py", line 356, in main
convert_longitudes=args.convert_longitudes, split_vars=args.split_vars)
File "generate_climos.py", line 179, in create_climo_files
update_climo_time_meta(climo_means)
File "generate_climos.py", line 314, in update_climo_time_meta
climo_bnds_var[:] = date2num(climo_bounds, cf.time_var.units, cf.time_var.calendar)
File "netCDF4/_netCDF4.pyx", line 5265, in netCDF4._netCDF4.date2num (netCDF4/_netCDF4.c:65355)
File "netcdftime/_netcdftime.pyx", line 795, in netcdftime._netcdftime.utime.date2num (netcdftime/_netcdftime.c:14358)
ValueError: there are only 30 days in every month with the 360_day calendar
This occurs for all model output files under /storage/data/climate/CMIP5/output/NIMR-KMA/
and /storage/data/climate/CMIP5/output/MOHC/
.
unit consistency checks are not ensemble-constrained
This issue is not causing any current problems, but it is unintuitive behavior and can be quite confusing when debugging.
The data
API collects all files that match the parameters passed to it by URL:
- model
- ensemble
- timescale
- variable
And then constructs a separate timeseries for each run that matches the parameters. Before constructing each run's timeseries, it checks that all variable in that run have the same units, using get_units_from_run_object
, filtering datasets by the following parameters:
- run
- variable
Notably, get_units_from_run_object
does not filter by ensemble. This means that while timeseries are constructed only from the datasets contained in an ensemble, an error will be thrown if any file in the database, even one not in the ensemble, uses incorrect or conflicting units for that variable. This is pretty unintuitive.
Practically speaking, this issue would be 95% resolved if the backend accepted synonymous units using a units library (#105), though get_units_from_run_object
and its helpers could also be modified to accept an ensemble and search only within it, to make them behave more intuitively when used by within-ensemble API endpoints to construct timeseries.
Configurable cache size
The backend currently allocates 10MB for caching masks (so they don't have to be recalculated for new files) and 100MB for caching masked data arrays (so they don't have to be recalculated for new timeslices). These should be configurable parameters that can be changed according to where the server is being run and how big the data files it is working with are.
/metadata endpoint has confusing parameter names
A legacy from when all all time resolutions (monthly, yearly, seasonal) were in a single file, the /metadata
API endpoint takes a model_id
parameter that is actually what all the other endpoints call id_
, the unique ID of the datafile. Other queries use model
to represent the GCM that generated the datafile.
Once upon a time, model_id
and id_
were synonymous, as all the data from a model was in a single datafile, but they are not now.
(If we fix this, we should be sure to let the MOTI folks developing for our API know, since it might change under them.)
Calls to timeseries/data without a polygon raise max recursion depth
To reproduce
Start a dev server with real world data using the production mddb
MDDB_DSN=postgresql://[email protected]/pcic_meta python scripts/devserver.py -p 8004
Timeseries
Request
http://<root_url>/api/timeseries?id_=tmax_monClim_PRISM_historical_run1_198101-201012&variable=tmax&area=
Response
RuntimeError: maximum recursion depth exceeded in __instancecheck__
Full stack trace
Data
Request
http://atlas.pcic.uvic.ca:8004/api/data?model=PRISM&emission=historical&time=0&variable=tmax&area=
Response
RuntimeError: maximum recursion depth exceeded while calling a Python object
Full stack trace
Clipping by polygon greatly increases timeseries generation
Timeseries calls tend to have significantly more slowdown when querying a polygon region
Using this polygon: POLYGON+((-135.05859375+52.98828125%2C+-131.05468750000003+49.1796875%2C+-125.97656250000001+46.640625%2C+-122.36328125000001+46.640625%2C+-122.4609375+48.7890625%2C+-114.2578125+48.984375%2C+-119.62890625000001+53.37890625%2C+-135.05859375+52.98828125))
Endpoint | Time w/out poly | Time w/ Poly | Times slower |
---|---|---|---|
Timeseries | 0m0.093s | 0m9.163s | 98.5 |
Data | 0m0.105s | 0m1.128s | 10.7 |
Multistats | 0m0.069s | 0m1.136s | 16.5 |
Endpoint | Calls to geo.polygonToMask |
---|---|
Timeseries | 17 |
Data | 2 |
Multistats | 1 |
Data
Timeseries w/out polygon
time curl "http://<backend_url>/api/timeseries?id_=tasmax_Amon_CanESM2_historical_r1i1p1_19710101-20001231&variable=tasmax&area="{"data": {"1986-01-16T00:00:00Z": 280.3414522058824, "1986-02-15T00:00:00Z": 277.9441223144531, "1986-03-16T00:00:00Z": 278.25262451171875, "1986-04-16T00:00:00Z": 279.67120361328125, "1986-05-16T00:00:00Z": 281.3408203125, "1986-06-16T00:00:00Z": 282.7239990234375, "1986-07-16T00:00:00Z": 283.23187255859375, "1986-08-16T00:00:00Z": 282.784912109375, "1986-09-16T00:00:00Z": 281.61785888671875, "1986-10-16T00:00:00Z": 280.39532470703125, "1986-11-16T00:00:00Z": 279.31549072265625, "1986-12-16T00:00:00Z": 278.6036071777344, "1986-04-17T00:00:00Z": 278.25830078125, "1986-07-17T00:00:00Z": 279.75579833984375, "1986-10-17T00:00:00Z": 282.9156494140625, "1987-01-15T00:00:00Z": 280.4423828125, "1986-07-02T00:00:00Z": 280.35418701171875}, "id": "tasmax_Amon_CanESM2_historical_r1i1p1_19710101-20001231", "units": "K"}
real 0m0.093s
Timeseries w/ polygon
time curl "http://<backend_url>/api/timeseries?id_=tasmax_Amon_CanESM2_historical_r1i1p1_19710101-20001231&variable=tasmax&area=POLYGON+((-135.05859375+52.98828125%2C+-131.05468750000003+49.1796875%2C+-125.97656250000001+46.640625%2C+-122.36328125000001+46.640625%2C+-122.4609375+48.7890625%2C+-114.2578125+48.984375%2C+-119.62890625000001+53.37890625%2C+-135.05859375+52.98828125))"
{"data": {"1986-01-16T00:00:00Z": 264.9690372242647, "1986-02-15T00:00:00Z": 275.4643249511719, "1986-03-16T00:00:00Z": 276.2695007324219, "1986-04-16T00:00:00Z": 278.6673583984375, "1986-05-16T00:00:00Z": 283.45538330078125, "1986-06-16T00:00:00Z": 288.6134033203125, "1986-07-16T00:00:00Z": 293.27569580078125, "1986-08-16T00:00:00Z": 293.18048095703125, "1986-09-16T00:00:00Z": 288.39788818359375, "1986-10-16T00:00:00Z": 281.8310546875, "1986-11-16T00:00:00Z": 277.7283630371094, "1986-12-16T00:00:00Z": 275.1181335449219, "1986-04-17T00:00:00Z": 275.0066833496094, "1986-07-17T00:00:00Z": 279.4727478027344, "1986-10-17T00:00:00Z": 291.7232971191406, "1987-01-15T00:00:00Z": 282.6434326171875, "1986-07-02T00:00:00Z": 282.24981689453125}, "id": "tasmax_Amon_CanESM2_historical_r1i1p1_19710101-20001231", "units": "K"}
real 0m9.163s
Data w/out poly
time curl "http://<backend_url>/api/data?model=CanESM2&variable=tasmax&emission=historical&time=0&area="
{"r1i1p1": {"data": {"1986-01-16T00:00:00Z": 280.3414522058824, "1977-01-16T00:00:00Z": 280.09547334558823}, "units": "K"}}
real 0m0.105s
Data w/ poly
time curl "http://<backend_url>/api/data?model=CanESM2&variable=tasmax&emission=historical&area=POLYGON+((-135.05859375+52.98828125%2C+-131.05468750000003+49.1796875%2C+-125.97656250000001+46.640625%2C+-122.36328125000001+46.640625%2C+-122.4609375+48.7890625%2C+-114.2578125+48.984375%2C+-119.62890625000001+53.37890625%2C+-135.05859375+52.98828125))&time=0"
{"r1i1p1": {"data": {"1986-01-16T00:00:00Z": 264.9690372242647, "1977-01-16T00:00:00Z": 264.9567440257353}, "units": "K"}}
real 0m1.128s
Multistats w/out poly
time curl "http://<backend_url>/api/multistats?variable=tasmax&emission=historical&area=&time=0"{"tasmax_Amon_CanESM2_historical_r1i1p1_19710101-20001231": {"median": 282.08111572265625, "time": "1986-07-15T21:10:35Z", "units": "K", "max": 314.3248291015625, "mean": 278.1967468261719, "ncells": 8192, "stdev": 21.291064986096575, "min": 234.7781524658203}, "tasmax_Amon_CanESM2_historical_r1i1p1_19610101-19901231": {"median": 281.9470520019531, "time": "1977-07-15T21:10:35Z", "units": "K", "max": 314.5687561035156, "mean": 277.9090270996094, "ncells": 8192, "stdev": 21.46124322907459, "min": 234.5634307861328}}
real 0m0.069s
Multistats w/ poly
time curl "http://<backend_url>/api/multistats?variable=tasmax&emission=historical&area=POLYGON+((-135.05859375+52.98828125%2C+-131.05468750000003+49.1796875%2C+-125.97656250000001+46.640625%2C+-122.36328125000001+46.640625%2C+-122.4609375+48.7890625%2C+-114.2578125+48.984375%2C+-119.62890625000001+53.37890625%2C+-135.05859375+52.98828125))&time=0"
{"tasmax_Amon_CanESM2_historical_r1i1p1_19710101-20001231": {"median": 274.51959228515625, "time": "1986-07-15T21:10:35Z", "units": "K", "max": 281.11529541015625, "mean": 274.48187255859375, "ncells": 16, "stdev": 5.350947123320987, "min": 266.13177490234375}, "tasmax_Amon_CanESM2_historical_r1i1p1_19610101-19901231": {"median": 273.93328857421875, "time": "1977-07-15T21:10:35Z", "units": "K", "max": 280.8470764160156, "mean": 274.06365966796875, "ncells": 16, "stdev": 5.470714281718363, "min": 265.55706787109375}}
real 0m1.136s
Travis CI tests are failing on master
Files without runs?
In adding tests for the streamflow/watershed
API, I discovered that if a file does not have a run associated with it, then the models
API endpoint fails. The key code is this:
{dfv.file.run.model.short_name for dfv in ensemble.data_file_variables}
where ensemble
is a pycds.Ensemble
object.
Overlooking the fact that there are likely alternative queries more robust to absent links between ensemble and model, there's the question of whether we already do or plan to allow for files without runs. I am not sure how we resolved this question, so this issue may just be a placeholder for retrieving a previous decision that resolves this problem while I continue working on the watershed API.
multistat API call doesn't handle split climatology files
Currently the multistat API call accepts an integer timeidx
parameter, along with a set of filtering parameters: ensemble
, model
, emission
, and variable
.
It returns a set of statistics calculated from the timeidx
-th slice of each file that fit the rest of the parameters.
It would be nice if it was possible to filter on timescale
(monthly, seasonal, yearly) in addition to ensemble
, model
, emission
, and variable
, so that, for example, you could pass timescale=monthly&timeidx=0
to get only statistical calculation corresponding to January, or timescale=yearly&index=0
to get annual statistics.
It's completely possible to sort out which timeidx=0
results are Januaries, which are winters, and which are annual in the front end, but I think it would make more sense to have the API do that filtering. It would more closely match the revised functioning of the data query, and be more intuitive to a hypothetical new programmer, I imagine.
Use modelmeta database version 12f290b63791
modelmeta will shortly be enhanced to support both gridded (existing) and discrete sampling geometry (new) data files.
This will change the modelmeta ORM and database contents. CE backend needs to handle this.
Questions:
- Is it possible and worth it to handle both old and new database versions? Probably looks like:
- Install variant version of modelmeta in your CE backend environment
- In code, inspect contents of table alembic_version.
- In code, import and use variant ORM classes depending on version discovered. This sounds tricky.
- Would this be worth the trouble?
- If not, how do we want to handle the asynchronous changeover from old to new database version? What are the consequences (apart from the obvious) of a mismatch between database and CE backend?
Multimeta API call should expose time bounds
The single-file metadata API call now includes start_date
, end_date
, and multi_year_mean
metadata about each dataset, as requested in #44 . It would be convenient if the multi-file multimeta API call exposed them too, but I'm open to discussion about the necessity of this.
At present, the climate explorer frontend calls multimeta
once on loading (line 17) to get information on all available datasets, which it uses to populate selectors and user-facing widgets with what datasets are available for examination.
In order to be able to show the user what each dataset's climatological period is, either multimeta
could provide that information, or climate-explorer-frontend could hit the single-file metadata
call once for each file in the user-selected current group (ie, filtered by model, variable, and sometimes experiment). I think having multimeta
provide the information up front would be a little nicer.
Data API call doesn't handle split climatology files
The data API call collects and returns data from all files that match a set of parameters (model, experiment, ensemble, variable, etc).
Previously, it took a time
index argument between 0 and 16 to indicate time resolution and position sought, and would return the nth time slice in the file, which worked when all the files were 17-point chronologies.
Using split data files, with separate monthly (0-11), seasonal (0-3) and yearly(0) files, this API call returns a mix of monthly, seasonal, and annual data when you call it with &time=0
, and an IndexError
if you call it with any other time index.
Here's the result for GET /api/data?ensemble_name=downscaled&model=bcc-csm1-1-m&variable=tasmax&emission=historical,+rcp45&area=&time=0
:
{
"r1i1p1": {
"units": "degC",
"data": {
"2085-07-02T00:00:00Z": -0.5808153910727394,
"2025-07-02T00:00:00Z": -0.8622292459716103,
"1977-01-16T00:00:00Z": -18.78877913696885,
"1997-01-16T00:00:00Z": -17.92588300616977,
"2055-07-02T00:00:00Z": -0.6485861922775469,
"2055-01-16T00:00:00Z": -15.875643052269846,
"1986-07-02T00:00:00Z": -2.4333102893767307,
"1997-07-02T00:00:00Z": -2.077599634237314,
"2025-01-16T00:00:00Z": -16.226023053533392,
"2025-01-15T00:00:00Z": -18.02569051912951,
"2085-01-16T00:00:00Z": -15.60062807695439,
"1986-01-15T00:00:00Z": -19.950983475060585,
"1977-07-02T00:00:00Z": -2.671051067797724,
"1977-01-15T00:00:00Z": -20.599000150601793,
"2055-01-15T00:00:00Z": -17.825752320828578,
"1986-01-16T00:00:00Z": -18.487454919078537,
"1997-01-15T00:00:00Z": -19.534196834187902,
"2085-01-15T00:00:00Z": -17.498223073165622
}
}
}
This appears to be a mix of annual data (July 2 dates), monthly data (January 15), and seasonal data (January 16).
I don't have strong preferences about how exactly this query should behave in the split-file context. It doesn't need to function identically to the old query as long as it's possible to get the data in some reasonably straightforward way. For example, if given the new file structure it makes sense to expect the front end to pass a time resolution (monthly, seasonal, annual) along with the time index, that would be no trouble.
generate_climo scales precip units incorrectly for packed files
generate_climo
can (under control of a flag) convert precipitation units from per-second to per-day. Currently this conversion is done by multiplying by a scale factor of 86400 (s/day). This is correct for absolute (unpacked) files, but fails for packed files where the offset value is non-zero.
Solution is to recognize packed files and apply a different computation to them taking into account the scale and offset values.
Question: Is there a standard for how packed files are defined that we can adhere to, or is it ad hoc?
Optimize cache code
In preparation for using the climate-explorer-backend
to drive Plan2Adapt (P2A), I've been taking a look at the performance of the various queries.
P2A makes hundreds (slightly over a thousand) calls to the /stats
API calls, so small slow downs add up to a lot.
Turns out that the speed of the cache is slow. Like really slow. Like to the point where it would be faster if we simply turned off the cache.
Why? In order to control the size of the cache, it recursively counts the size of every object it contains. Doing this over and over adds up to a huge performance penalty. Over the course of 1 P2A run, simply measuring the size of the cache took around 43% of the run time while fetching the actual data from disk only took 21% of the run time.
I think that it would be fairly easy to have the cache keep track of its own size and just increment/decrement when something is added/expired. That would presumably give us an almost 2x speedup.
Some calls to data endpoint return all zeros
Some data api calls are still returning incorrect information.
Some of the requests that @jameshiebert and I noted as returning zeros now return expected data, but other still don't.
Using a constant polygon:
POLYGON="POLYGON+((-127.9296875+45.76171875%2C+-127.9296875+54.6484375%2C+-118.65234375000001+54.6484375%2C+-118.65234375000001+45.76171875%2C+-127.9296875+45.76171875))"
Works with January:
curl -s 'http://docker1.pcic.uvic.ca:9000/api/data?model=CSIRO-Mk3-6-0&variable=rx5dayETCCDI&emission=rcp85&area='$POLYGON'&time=0' | python -m json.tool
{
"r1i1p1": {
"data": {
"2025-01-16T00:00:00Z": 94.53663330078125,
"2055-01-16T00:00:00Z": 101.20841064453126,
"2085-01-16T00:00:00Z": 111.72054443359374
},
"units": "mm"
},
"r2i1p1": {
"data": {
"2025-01-16T00:00:00Z": 98.94828491210937,
"2055-01-16T00:00:00Z": 102.28649291992187,
"2085-01-16T00:00:00Z": 114.10343017578126
},
"units": "mm"
},
"r3i1p1": {
"data": {
"2025-01-16T00:00:00Z": 93.01469116210937,
"2055-01-16T00:00:00Z": 103.7731689453125,
"2085-01-16T00:00:00Z": 112.4370849609375
},
"units": "mm"
},
"r4i1p1": {
"data": {
"2025-01-16T00:00:00Z": 95.85071411132813,
"2055-01-16T00:00:00Z": 102.0443115234375,
"2085-01-16T00:00:00Z": 115.915771484375
},
"units": "mm"
}
}
Not for any other timestep:
curl -s 'http://docker1.pcic.uvic.ca:9000/api/data?model=CSIRO-Mk3-6-0&variable=rx5dayETCCDI&emission=rcp85&area='$POLYGON'&time=1' | python -m json.tool
{
"r1i1p1": {
"data": {
"2025-02-15T00:00:00Z": 0.0,
"2055-02-15T00:00:00Z": 0.0,
"2085-02-15T00:00:00Z": 0.0
},
"units": "mm"
},
"r2i1p1": {
"data": {
"2025-02-15T00:00:00Z": 0.0,
"2055-02-15T00:00:00Z": 0.0,
"2085-02-15T00:00:00Z": 0.0
},
"units": "mm"
},
"r3i1p1": {
"data": {
"2025-02-15T00:00:00Z": 0.0,
"2055-02-15T00:00:00Z": 0.0,
"2085-02-15T00:00:00Z": 0.0
},
"units": "mm"
},
"r4i1p1": {
"data": {
"2025-02-15T00:00:00Z": 0.0,
"2055-02-15T00:00:00Z": 0.0,
"2085-02-15T00:00:00Z": 0.0
},
"units": "mm"
}
}
Calls for the same model with rcp26 appear to work for alternate time of years:
curl -s 'http://docker1.pcic.uvic.ca:9000/api/data?model=CSIRO-Mk3-6-0&variable=rx5dayETCCDI&emission=rcp26&area='$POLYGON'&time=1' | python -m json.tool
{
"r1i1p1": {
"data": {
"2025-02-15T00:00:00Z": 55.08775024414062,
"2055-02-15T00:00:00Z": 58.02188720703125,
"2085-02-15T00:00:00Z": 54.56318969726563
},
"units": "mm"
},
"r2i1p1": {
"data": {
"2025-02-15T00:00:00Z": 59.523980712890626,
"2055-02-15T00:00:00Z": 59.92850952148437,
"2085-02-15T00:00:00Z": 58.3666015625
},
"units": "mm"
},
"r3i1p1": {
"data": {
"2025-02-15T00:00:00Z": 53.345208740234376,
"2055-02-15T00:00:00Z": 57.764373779296875,
"2085-02-15T00:00:00Z": 52.764739990234375
},
"units": "mm"
},
"r4i1p1": {
"data": {
"2025-02-15T00:00:00Z": 56.85017700195313,
"2055-02-15T00:00:00Z": 59.44209594726563,
"2085-02-15T00:00:00Z": 54.10572509765625
},
"units": "mm"
},
"r5i1p1": {
"data": {
"2025-02-15T00:00:00Z": 54.548590087890624,
"2055-02-15T00:00:00Z": 0.0,
"2085-02-15T00:00:00Z": 0.0
},
"units": "mm"
}
}
But some returned results are 0 when they should not be.
The same issue persists for other variables:
curl -s 'http://docker1.pcic.uvic.ca:9000/api/data?model=CSIRO-Mk3-6-0&variable=rx1dayETCCDI&emission=rcp26&area='$POLYGON'&time=1' | python -m json.tool
{
"r1i1p1": {
"data": {
"2025-02-15T00:00:00Z": 19.7260986328125,
"2055-02-15T00:00:00Z": 22.774565124511717,
"2085-02-15T00:00:00Z": 21.199964904785155
},
"units": "mm"
},
"r2i1p1": {
"data": {
"2025-02-15T00:00:00Z": 20.879574584960938,
"2055-02-15T00:00:00Z": 22.38013153076172,
"2085-02-15T00:00:00Z": 21.877273559570312
},
"units": "mm"
},
"r3i1p1": {
"data": {
"2025-02-15T00:00:00Z": 21.111979675292968,
"2055-02-15T00:00:00Z": 22.47863006591797,
"2085-02-15T00:00:00Z": 19.134806823730468
},
"units": "mm"
},
"r4i1p1": {
"data": {
"2025-02-15T00:00:00Z": 0.0,
"2055-02-15T00:00:00Z": 0.0,
"2085-02-15T00:00:00Z": 0.0
},
"units": "mm"
},
"r5i1p1": {
"data": {
"2025-02-15T00:00:00Z": 0.0,
"2055-02-15T00:00:00Z": 0.0,
"2085-02-15T00:00:00Z": 0.0
},
"units": "mm"
}
}
The back end does not show any errors for these requests.
generate_climos.py should create one file per climo averaging period
Add an option in generate_climos.py
to output one file per climatological averaging period (month, season, year).
Currently they are all in one file, with a single variable with a non-monotonic time dimension (12 months, 4 seasons, 1 year). This is a problem: ncWMS2 won't accept files with non-monotonic dimensions. Separating into different files solves this, and is formally more correct (the different averaging periods technically are distinct random variables).
Add multi_year_mean as parameter to `search_for_unique_ids`
This was removed in 10022a8 to quickly deploy annual data, however the functionality is still required for creating climatological portals.
cell_method filter match() vs fullmatch()
I've discovered that the cell_method
filter is not working as intended. It uses re.match()
to check for desired cell_methods
but will return True
for those with only a partial match. Change it to re.fullmatch()
to ensure we aren't getting mixed data.
This will need to be tagged and released, likely 1.1.1
.
Write a `multistats` API call
Takes a model_id and emissions scenario, searches for unique_ids and delegates to the stats() API call.
Memoizing geo masking by model_id and polygon not guaranteed to be unique
Problem:
As per #10 (comment), while model_id is currently adequate to differentiate between model grids, this will break with other data.
Solution:
Modify memoize_mask
to key on grid properties rather than model_id.
Provide canonical units in API endpoints
While datasets of each variable in active use have matching units, intervariable unit names vary:
1. Temperature
1177 datasets have the units degrees_C
; 2056 have degC
2. Precipitation
95 datasets have the units mm d-1
; 141 have the units mm day-1
; 2319 datasets have the units kg m-2 d-1
, which may not be automatically convertible.
Currently, nothing in the front end explicitly requires that units from different variables be formatted similarly in order to compare them (see pacificclimate/climate-explorer-frontend#107 - we had this requirement once, but changed it).
Still, there are several ways that providing canonically formatted units using a package like udunits
might be helpful in the future, and is worth doing:
- users directly accessing the API or exporting data from the front end could more easily compare data with their own tools and applications
- possible future frontend development might want to compare variables directly
- easier formatting of variable units in the variable config file if units have canonical forms
- easier to integrate data from new sources with different unit conventions
stats api doesn't handle non-multi-year-mean datasets
The stats
query is written to select a single timestep of data and calculate statistics about it. This is the correct behaviour for multi-year-mean (MYM) datasets, where each timestep actually represents something like thirty years of January, thirty years of spring, or thirty years annual data depending on the resolution and query parameters: a time index and a resolution.
This is wrong for non-MYM datasets. At present, all the non-MYM datasets we're supporting are annual. So the stats
query will receive timeres=yearly&timeidx=0
, and return statistics about the 0th timeslice of an annual nonMYM dataset, which would (usually) be 1950, not representative of the whole period.
Stats
needs to detect when it is working with a nonMYM file and, in the case of annual time resolutions, return statistics across all time slices. For completeness, it should probably return stats from every twelfth timeslice for monthly nonMYM files, but we aren't using them now and don't plan to.
Provide spatial coverage information
At present, the backend provides no information about the spatial coverage of a file to the frontend. Therefore, if there are two datasets identical in every way except spatial coverage (for example, if one dataset is all-Canada and the other is BC-only), the metadata the backend provides to the front end makes it impossible for the front end to distinguish between them.
When the frontend selects a dataset to display for a user, it uses the parameters selected by the user: start date, end date, model, experiment, variable. It is possible that different display components would select different datasets if they had the same parameters but a different spatial extent. So the map might show a different area than the graphs, or graphs visualize different datasets, leading to silently incorrect data.
TravisCI doesn't run data prep tests
Explanation in .travis.yml
:
We only run the ce tests because installing cdo is just too hard: NetCDF support isn't built into the
debian cdo pacakges. Once that is resolved (i.e., install the NetCDF support before installing cdo),
we [can] run the data prep tests too.
Investigate possible memory leak in cache code
When we run the climate-explorer-backend in docker, it's RAM usage begins to slowly creep up over a number of days:
# docker stats --no-stream
e4d78bc8801e ceb-latest 0.02% 6.361GiB / 9.606GiB 66.23% 9.55MB / 99.1MB 66.2MB / 0B 11
1
We do have some code around that intentionally caches results to RAM. It's supposed to limit the cache size to 100 MB, however it turns out that our tests don't actually cover the cache eviction branch in ce/api/geo.py
lines 112-115:
(env) james@basalt:~/code/git/climate-explorer-backend$ py.test -v --cov=ce --cov-report=term-missing ce/tests
=========================================================== test session starts ============================================================
...
----------- coverage: platform linux, python 3.5.2-final-0 -----------
Name Stmts Miss Cover Missing
-----------------------------------------------------
ce/api/geo.py 128 7 95% 21, 33, 44, 112-115, 163
...
That section of the code is highly suspect. For this task, write tests to cover cache eviction and investigate whether or not the code is actually working.
Migrate data prep to separate project/repo
Timeseries API call sometimes returns all-zero data when given an area parameter
This bug is intermittent and hard to reproduce, but sometimes passing an area polygon as a parameter to the API results in all-zeroes data. It affects multiple data files and data sets.
Here's a link to an example API call demonstrating the error.
The results of that call are:
{
"id": "tasmin_mClim_BCCAQv2_bcc-csm1-1-m_historical-rcp45_r1i1p1_19610101-19901231_Canada",
"data": {
"1977-01-15T00:00:00Z": 0.0,
"1977-02-15T00:00:00Z": 0.0,
"1977-03-15T00:00:00Z": 0.0,
"1977-04-15T00:00:00Z": 0.0,
"1977-05-15T00:00:00Z": 0.0,
"1977-06-15T00:00:00Z": 0.0,
"1977-07-15T00:00:00Z": 0.0,
"1977-08-15T00:00:00Z": 0.0,
"1977-09-15T00:00:00Z": 0.0,
"1977-10-15T00:00:00Z": 0.0,
"1977-11-15T00:00:00Z": 0.0,
"1977-12-15T00:00:00Z": 0.0
},
"units": "degC"
}
The same area parameter passed to a call to a different id - tasmax instead of tasmin also returns all zeroes:
{
"id": "tasmax_mClim_BCCAQv2_bcc-csm1-1-m_historical-rcp45_r1i1p1_19610101-19901231_Canada",
"data": {
"1977-01-15T00:00:00Z": 0.0,
"1977-02-15T00:00:00Z": 0.0,
"1977-03-15T00:00:00Z": 0.0,
"1977-04-15T00:00:00Z": 0.0,
"1977-05-15T00:00:00Z": 0.0,
"1977-06-15T00:00:00Z": 0.0,
"1977-07-15T00:00:00Z": 0.0,
"1977-08-15T00:00:00Z": 0.0,
"1977-09-15T00:00:00Z": 0.0,
"1977-10-15T00:00:00Z": 0.0,
"1977-11-15T00:00:00Z": 0.0,
"1977-12-15T00:00:00Z": 0.0
},
"units": "degC"
}
Giving this area as a parameter to the third and final data file in the test collection has the same result.
{"id": "pr_mClim_BCCAQv2_bcc-csm1-1-m_historical-rcp45_r1i1p1_19610101-19901231_Canada", "data": {
"1977-01-15T00:00:00Z": 0.0,
"1977-02-15T00:00:00Z": 0.0,
"1977-03-15T00:00:00Z": 0.0,
"1977-04-15T00:00:00Z": 0.0,
"1977-05-15T00:00:00Z": 0.0,
"1977-06-15T00:00:00Z": 0.0,
"1977-07-15T00:00:00Z": 0.0,
"1977-08-15T00:00:00Z": 0.0,
"1977-09-15T00:00:00Z": 0.0,
"1977-10-15T00:00:00Z": 0.0,
"1977-11-15T00:00:00Z": 0.0,
"1977-12-15T00:00:00Z": 0.0
},
"units": "kg m-2 d-1"
}
No error message is generated by these API calls.
There's reasonably good evidence that the values are not zero in the files themselves, in that the ncWMS maps created from the same file show colour variation over the affected polygon.
Passing the same area parameter to an identical copy of the climate explorer backend running against a different metadata database and different data files does not reproduce the error:
{
"units": "K",
"id": "tasmax_mClim_CanESM2_rcp45_r1i1p1_20100101-20391231",
"data": {
"2025-01-15T00:00:00Z": 279.19142659505206,
"2025-02-15T00:00:00Z": 280.26951090494794,
"2025-03-15T00:00:00Z": 280.49456787109375,
"2025-04-15T00:00:00Z": 279.8186442057292,
"2025-05-15T00:00:00Z": 278.9789225260417,
"2025-06-15T00:00:00Z": 278.13330078125,
"2025-07-15T00:00:00Z": 277.2965494791667,
"2025-08-15T00:00:00Z": 276.41054280598956,
"2025-09-15T00:00:00Z": 276.2767740885417,
"2025-10-15T00:00:00Z": 276.3365071614583,
"2025-11-15T00:00:00Z": 276.72873942057294,
"2025-12-15T00:00:00Z": 277.8912353515625
}
}
Some polygons return all 0's in timeseries call
Some calls to the timeseries
backend return all zeros. [This](http://docker1.pcic.uvic.ca:20003/api/timeseries?id_=prcptotETCCDI_yr_BCCAQ-ANUSPLIN300-CanESM2_historical-rcp26_r1i1p1_1950-2100&variable=prcptotETCCDI&area=
POLYGON+%28%28-123+48,+-123+49.80,+-122+49.80,+-122+48,+-123+48%29%29) works as expected.
area= POLYGON+((-123+48,+-123+49.80,+-122+49.80,+-122+48,+-123+48))
But [this one](http://docker1.pcic.uvic.ca:20003/api/timeseries?id_=prcptotETCCDI_yr_BCCAQ-ANUSPLIN300-CanESM2_historical-rcp26_r1i1p1_1950-2100&variable=prcptotETCCDI&area=
POLYGON+%28%28-123+49.8,+-123+49.9,+-122+49.9,+-122+49.8,+-123+49.8%29%29) returns all zeros.
area= POLYGON+((-123+49.8,+-123+49.9,+-122+49.9,+-122+49.8,+-123+49.8))
Appears to have to do with how low the northern extent of a polygon is, not size/orientation.
`multimeta` API call is too slow for moderate number of files
I have an sqlite database which has indexed 660 ClimDEX files, and the multimeta API call takes 30 seconds to return.
$ time curl -s http://localhost:9000/api/multimeta
real 0m29.051s
user 0m0.016s
sys 0m0.000s
PostgreSQL may be more reasonable, but if not, we'll need to do some query optimization for the multimeta call.
create script to split merged climo files into separate monthly, seasonal, and annual files
See #45 for some background.
This script will process files generated by older versions of generate_climos.py
(or other programs) into separate files for each averaging period.
metadata API call should expose time bounds
The metadata
API request returns information about a datasets' time values. For certain datasets, however, (e.g. climatologies and temporal averages) the actual time value is representative of a time range. In these cases, CF-metadata compliant files are attributed with a time_bnds (time bounds) variable.
If time bounds are available, we should expose them in this API call.
Timing tests fail intermittently on Travis CI
Several tests check time of execution: ce/tests/test_api.py#L194, ce/tests/test_util.py#L39, ce/tests/test_geo.py#L40.
These tests intermittently fail on Travis CI, with execution times from a few percent to several hundred percent larger than asserted. Should not have restarted a recent failing build so I could link to it. Trust me.
Make using the API easier
There are a few minor things about the API code that can trip up a new developer writing code than runs against the API and could be easily fixed:
- The default ensemble on queries that require one is
ce
, which no longer exists. Should bece_files
- Documentation on many queries still describes the time argument as a value from 1 to 17, when queries have been updated to accept a resolution and an index
- Overall documentation review, make sure it's all up to date
Inconsistent units associated with txxETCCDI variable
Some txxETCCDI files have degrees_c
as their unit; others degC
. The /data
query quite sensibly refuses to construct a timeseries from files with different units, and returns a 500 error.
Solutions might be:
- use a list of "synonymous" units (
degC
/degrees_c
for temperature andmm
/kg d-1 m-2
for precipitation) - standardize the units in existing files and reindex them, if it's only a few
I haven't investigated thoroughly, and it's possible that this error is actually just a side effect of the /data
query's present erroneous attempt to construct a timeseries from both nomimal-time and multi-year-mean datasets, in which case this bug will go away when PR 82 is merged.
EDIT: PR 82 has been merged, but did not solve this bug.
Generate ensemble statistic files
One comment from the first round of feedback with the MoTI engineers was that they were not interested in individual model runs and would prefer information from an ensemble average (e.g. the "PCIC 12").
We should probably create a simple script that can generate an ensemble average, given a list of some number of files as arguments. It would presumably be a simple call to CDO.
Odd behaviour in make_mask_grid()
make_mask_grid()
generates a unique key for a combination of map grid and polygon; it's used to determine whether a new polygon request has been previously made and doesn't need to be recalculated. The key consists of a tuple with:
- minimum longitude of the grid (float)
- maximum longitude of the grid (float)
- number of longitude steps (float)
- minimum latitude of the grid (float)
- maximum latittude of the grid (float)
- number of latitude steps (float)
- text (wkt) representation of the polygon to be computed (string)
However, during testing today, @nikola-rados was seeing singleton numpy arrays instead of numpy floats for the minimums and maximums, which is not the intended or previously observed behaviour. The obvious cause would seem to be a numpy update - numpy is not pinned in this repository - but we tried testing the steps in the python console with the newest version of numpy and they behaved as I would have expected.
This issue can be easily fixed by switching to np.min()
instead of np_array[0]
to determine the minimum of each coordinate, but I'd like to know why it's happening and if there are any other side effects first, rather than blindly patching it. Needs further investigation.
Non-canonical test data files
The test data files for ce are each in their own way(s) different from what nchelpers
, index_netcdf
, and split_merged_climos
expect. One effect of this is that some of the tests in climate-explorer-backend are failing now that we have established a more correct and desirable labelling for files containing seasonal and yearly means.
Most if not all these datasets appear to be multi-year means, but human expertise has to be applied to determine that.
Question: Which of them, if any, represent datasets the above repos/scripts will have to deal with?
Depending on the answer to this question, either the test files should be replaced with more representative ones, or else the codebase needs some fixing to handle these types of files. Or both.
Details: (files in climate-explorer-backend/ce/tests/data/
)
anuspline_na.nc
:- 1 time step: 1950-01-02 00:00:00
time
variable has no attributeclimatology
(nor any climatology bounds variable) (thereforenchelpers
does not recognize it as a multi-year mean, if that's what it is)- what is it???
CanESM2-rcp85-tasmax-r1i1p1-2010-2039.nc
, indexed asfile3
:- 17 time steps, mid-month, mid-season, mid-year (though not the same as those we use now [CF standard], nor the same as those previously computed by
generate_climos
) - does contain a time bounds variable, properly referenced by
time:climatology
attribute - this file is probably OK (can be split)
- 17 time steps, mid-month, mid-season, mid-year (though not the same as those we use now [CF standard], nor the same as those previously computed by
cgcm.nc
, indexed asfile0
:- does contain a time bounds variable, but referenced by
time:bounds
attribute instead oftime:climatology
(thereforenchelpers
does not recognize it as a multi-year mean) - 12 time variables, end of each month
- history attribute shows formed by
cdo ymonmean
- does contain a time bounds variable, but referenced by
cgcm-tmin.nc
, indexed asfile4
:- history indicates it is a slightly mutated version of
cgcm.nc
; ditto all comments for it
- history indicates it is a slightly mutated version of
prism_pr_small.nc
:- 13 time steps, mid-month, mid-year (no seasons)
- does have a variable named
climatology_bounds
, buttime
variable has no attributeclimatology
referencing it (thereforenchelpers
does not recognize it as a multi-year mean)
Grid test is too verbose
This commit added a test that precisely checks the values of test input.
This is too verbose and would need to be rewritten any time the test input data changes. Please loosen the assertions here. Other alternatives, for example, could be:
- Just check that the keys "latitudes" and "longitudes" exist in the dict
- Just check that the length of the lists is > 0
- Just check that some of the values in the list are > 0
Essentially, the application should be guaranteed to work with any input data, so having the test assertions too closely coupled with the test input data makes the tests annoying to read and difficult to maintain over time.
Missing statistical data
The multistats query is returning blank JSON objects for datasets that seem to be valid and produce reasonable results for other queries. A couple examples:
This is affecting a lot of datasets.
Flat data from timeseries API call when called with a spatial extent
The timeseries API call, when passed a spatial extent to calculate values over, returns values that do not vary over the year:
{
"id": "tasmin_mClim_BCCAQv2_CanESM2_historical-rcp26_r1i1p1_19610101-19901231_Canada",
"units": "degC",
"data": {
"1977-01-15T00:00:00Z": 4.026229503566938,
"1977-02-15T00:00:00Z": 4.026229503566938,
"1977-03-15T00:00:00Z": 4.026229503566938,
"1977-04-15T00:00:00Z": 4.026229503566938,
"1977-05-15T00:00:00Z": 4.026229503566938,
"1977-06-15T00:00:00Z": 4.026229503566938,
"1977-07-15T00:00:00Z": 4.026229503566938,
"1977-08-15T00:00:00Z": 4.026229503566938,
"1977-09-15T00:00:00Z": 4.026229503566938,
"1977-10-15T00:00:00Z": 4.026229503566938,
"1977-11-15T00:00:00Z": 4.026229503566938,
"1977-12-15T00:00:00Z": 4.026229503566938
}
}
The same query run over the entire spatial extent returns expected seasonal variation:
{
"id": "tasmin_mClim_BCCAQv2_CanESM2_historical-rcp26_r1i1p1_19610101-19901231_Canada",
"units": "degC",
"data": {
"1977-01-15T00:00:00Z": -29.574449355153153,
"1977-02-15T00:00:00Z": -28.212806823529412,
"1977-03-15T00:00:00Z": -23.460362097196857,
"1977-04-15T00:00:00Z": -14.638448772434755,
"1977-05-15T00:00:00Z": -5.053808769799956,
"1977-06-15T00:00:00Z": 2.690855450688279,
"1977-07-15T00:00:00Z": 6.272098836979196,
"1977-08-15T00:00:00Z": 5.254183921397172,
"1977-09-15T00:00:00Z": -0.013880960970580513,
"1977-10-15T00:00:00Z": -8.014392844233534,
"1977-11-15T00:00:00Z": -17.73349189331108,
"1977-12-15T00:00:00Z": -25.708370434857155
}
}
Polygon clipping is still slow
Polygon clipping on large grids (BCCAQ scale... 10km over Canada) remains pretty slow (on the order of 5-10 seconds). This needs another round of optimization.
`data` endpoint does not produce multiple climo periods
A call to <backend_url>/data?model=CanESM2&variable=tasmax&emission=rcp26&area=&time=16
results in:
{"r1i1p1": {"units": "K", "data": {"2025-07-02T00:00:00Z": 282.42840576171875}}}
This should produce something more like:
{"r1i1p1": {
"units": "K", "data": {
"2025-07-02T00:00:00Z": 282,
"2055-07-02T00:00:00Z": 283,
"2075-07-02T00:00:00Z": 284,
}
}}
This is caused by:
return {
run.name: {
'data': {
timeval.strftime('%Y-%m-%dT%H:%M:%SZ'): getdata(file_) for file_ in get_files_from_run_variable(run, variable)
},
'units': get_units_from_run_object(run, variable)
} for run, timeval in results
}
Which does the following related bad things:
1 overwrites run.name
on each iteration
2 increments timeval
in the outer loop
Provide HTTP cache support
There are numerous methods in HTTP to provide support for caching:
Our data changes almost never, so we could easily provide some very aggressive and particularly accurate caching control, using the mod times of the underlying data files. If we do this, we could provide much better performance and reduce the system load by a ton.
Area query on climdex file causes server crash
Timeseries query and multistats query crash the entire backend when made against a climdex file with an area specified.
metadata API endpoint should include climatological information
The metadata
API endpoint should be extended with the following properties:
multi_year_mean
: fromTimeSet.multi_year_mean
start_date
,end_date
: fromTimeSet.start_date
,TimeSet.end_date
, which now contain climo bounds for multi-year mean files (see pacificclimate/modelmeta#12)timescale
: fromTimeSet.timescale
(see also issue pacificclimate/modelmeta#9)
All values derived from TimeSet
are None
if there is no such record associated to the DataFileVariable
for the requested file.
Handle differently formatted experiment strings
Climdex files typically have experiment strings in the form of "historical, rcp26
", but GCM output files typically have experiment strings in the form "historical,rcp26
".
At present, in order to use the data API endpoint, an exact match for the emissions scenario string is required, but the Climate Explorer frontend also needs to be able to compare metadata for two files and tell if they are run with the same emissions scenario. So the Climate Explorer frontend converts back and forth between the exact experiment string and a standardized easily-comparable experiment string as needed, which is not great.
It would be much more straightforward if the backend:
- Always gave the front end a standardized experiment string in the multimeta endpoint and the metadata endpoint or
- Was able to accept an experiment string in either form at the data endpoint,
or both.
Travis CI build is very slow
It takes Travis > 10 min to create the build. The vast majority of the time is taken up by the installation of GDAL 2.0.2, which does not have a prepackaged Ubuntu apt-get install for it (only GDAL 1.11.3 does).
Suggest we create our own GDAL 2.x package for Ubuntu. We can publish it to a PPA if we don't want to jump the hoops to have it included in the public Ubuntu offerings.
Update README for installing dp
Installation instructions needed for installing data processing extras.
Add streamflow API
- Define streamflow API in OpenAPI/Swagger
- Implement API
- This will require accompanying changes to the modelmeta database.
Add security features to streamflow API
Security risks:
- Orders trigger significant computation. It would be easy to mount a DoS attack by submitting a lot of orders.
- Order notification could become a spam bot if emails are not verified.
- Malicious users could cancel other users' orders.
Possible responses (numbers do not correspond to risk enumeration):
-
Have a user authorization system. Don't let users modify (or even perhaps see) sensitive resources owned by other users, e.g., orders. This would cover all risks.
-
Throttle orders by originating IP address?
-
Obfuscate order ids to prevent spoofing of order URLs for cancellation?
- That would only work if we did not expose the
/orders
list resource. And that resource is necessary if the app(s) are to be able to recover from various fails (by, amongst other things, reloading lists of orders issued). - Maybe we can require the user to provide their email and list only orders notified to that email. This would still leave the door ajar to malicious users who know other users' emails, but it is less of an opening.
- That would only work if we did not expose the
See some parts of this discussion for a bit more on this.
More research and discussion needed.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.