pacificclimate / climate-explorer-backend Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 1.0 19.99 MB

Python 99.51% Shell 0.49%

actions docker make pip

climate-explorer-backend's People

Contributors

Stargazers

Watchers

Forkers

okanji

climate-explorer-backend's Issues

generate_climos fails on files with 360_day calendar

Example: Processing /storage/data/climate/CMIP5/output/NIMR-KMA/HadGEM2-AO/historical/day/atmos/day/r1i1p1/v20130422/tasmin/tasmin_day_HadGEM2-AO_historical_r1i1p1_18600101-20051230.nc:

2017-05-09 22:02:58 INFO: Generating climo period 1961-01-01 to 1990-12-30
2017-05-09 22:02:58 INFO: Selecting temporal subset
2017-05-09 22:03:10 INFO: Forming climatological means
Traceback (most recent call last):
  File "generate_climos.py", line 372, in <module>
    main(args)
  File "generate_climos.py", line 356, in main
    convert_longitudes=args.convert_longitudes, split_vars=args.split_vars)
  File "generate_climos.py", line 179, in create_climo_files
    update_climo_time_meta(climo_means)
  File "generate_climos.py", line 314, in update_climo_time_meta
    climo_bnds_var[:] = date2num(climo_bounds, cf.time_var.units, cf.time_var.calendar)
  File "netCDF4/_netCDF4.pyx", line 5265, in netCDF4._netCDF4.date2num (netCDF4/_netCDF4.c:65355)
  File "netcdftime/_netcdftime.pyx", line 795, in netcdftime._netcdftime.utime.date2num (netcdftime/_netcdftime.c:14358)
ValueError: there are only 30 days in every month with the 360_day calendar

This occurs for all model output files under /storage/data/climate/CMIP5/output/NIMR-KMA/ and /storage/data/climate/CMIP5/output/MOHC/.

unit consistency checks are not ensemble-constrained

This issue is not causing any current problems, but it is unintuitive behavior and can be quite confusing when debugging.

The data API collects all files that match the parameters passed to it by URL:

model
ensemble
timescale
variable

And then constructs a separate timeseries for each run that matches the parameters. Before constructing each run's timeseries, it checks that all variable in that run have the same units, using get_units_from_run_object, filtering datasets by the following parameters:

run
variable

Notably, get_units_from_run_object does not filter by ensemble. This means that while timeseries are constructed only from the datasets contained in an ensemble, an error will be thrown if any file in the database, even one not in the ensemble, uses incorrect or conflicting units for that variable. This is pretty unintuitive.

Practically speaking, this issue would be 95% resolved if the backend accepted synonymous units using a units library (#105), though get_units_from_run_object and its helpers could also be modified to accept an ensemble and search only within it, to make them behave more intuitively when used by within-ensemble API endpoints to construct timeseries.

Configurable cache size

The backend currently allocates 10MB for caching masks (so they don't have to be recalculated for new files) and 100MB for caching masked data arrays (so they don't have to be recalculated for new timeslices). These should be configurable parameters that can be changed according to where the server is being run and how big the data files it is working with are.

/metadata endpoint has confusing parameter names

A legacy from when all all time resolutions (monthly, yearly, seasonal) were in a single file, the /metadata API endpoint takes a model_id parameter that is actually what all the other endpoints call id_, the unique ID of the datafile. Other queries use model to represent the GCM that generated the datafile.

Once upon a time, model_id and id_ were synonymous, as all the data from a model was in a single datafile, but they are not now.

(If we fix this, we should be sure to let the MOTI folks developing for our API know, since it might change under them.)

Calls to timeseries/data without a polygon raise max recursion depth

To reproduce

Start a dev server with real world data using the production mddb
MDDB_DSN=postgresql://[email protected]/pcic_meta python scripts/devserver.py -p 8004

Timeseries

Request
http://<root_url>/api/timeseries?id_=tmax_monClim_PRISM_historical_run1_198101-201012&variable=tmax&area=
Response
RuntimeError: maximum recursion depth exceeded in __instancecheck__
Full stack trace

Data

Request
http://atlas.pcic.uvic.ca:8004/api/data?model=PRISM&emission=historical&time=0&variable=tmax&area=
Response
RuntimeError: maximum recursion depth exceeded while calling a Python object
Full stack trace

Clipping by polygon greatly increases timeseries generation

Timeseries calls tend to have significantly more slowdown when querying a polygon region

Using this polygon: POLYGON+((-135.05859375+52.98828125%2C+-131.05468750000003+49.1796875%2C+-125.97656250000001+46.640625%2C+-122.36328125000001+46.640625%2C+-122.4609375+48.7890625%2C+-114.2578125+48.984375%2C+-119.62890625000001+53.37890625%2C+-135.05859375+52.98828125))

Endpoint	Time w/out poly	Time w/ Poly	Times slower
Timeseries	0m0.093s	0m9.163s	98.5
Data	0m0.105s	0m1.128s	10.7
Multistats	0m0.069s	0m1.136s	16.5

Endpoint	Calls to `geo.polygonToMask`
Timeseries	17
Data	2
Multistats	1

Data

Timeseries w/out polygon

time curl "http://<backend_url>/api/timeseries?id_=tasmax_Amon_CanESM2_historical_r1i1p1_19710101-20001231&variable=tasmax&area="{"data": {"1986-01-16T00:00:00Z": 280.3414522058824, "1986-02-15T00:00:00Z": 277.9441223144531, "1986-03-16T00:00:00Z": 278.25262451171875, "1986-04-16T00:00:00Z": 279.67120361328125, "1986-05-16T00:00:00Z": 281.3408203125, "1986-06-16T00:00:00Z": 282.7239990234375, "1986-07-16T00:00:00Z": 283.23187255859375, "1986-08-16T00:00:00Z": 282.784912109375, "1986-09-16T00:00:00Z": 281.61785888671875, "1986-10-16T00:00:00Z": 280.39532470703125, "1986-11-16T00:00:00Z": 279.31549072265625, "1986-12-16T00:00:00Z": 278.6036071777344, "1986-04-17T00:00:00Z": 278.25830078125, "1986-07-17T00:00:00Z": 279.75579833984375, "1986-10-17T00:00:00Z": 282.9156494140625, "1987-01-15T00:00:00Z": 280.4423828125, "1986-07-02T00:00:00Z": 280.35418701171875}, "id": "tasmax_Amon_CanESM2_historical_r1i1p1_19710101-20001231", "units": "K"}
real    0m0.093s

Timeseries w/ polygon

time curl "http://<backend_url>/api/timeseries?id_=tasmax_Amon_CanESM2_historical_r1i1p1_19710101-20001231&variable=tasmax&area=POLYGON+((-135.05859375+52.98828125%2C+-131.05468750000003+49.1796875%2C+-125.97656250000001+46.640625%2C+-122.36328125000001+46.640625%2C+-122.4609375+48.7890625%2C+-114.2578125+48.984375%2C+-119.62890625000001+53.37890625%2C+-135.05859375+52.98828125))"
{"data": {"1986-01-16T00:00:00Z": 264.9690372242647, "1986-02-15T00:00:00Z": 275.4643249511719, "1986-03-16T00:00:00Z": 276.2695007324219, "1986-04-16T00:00:00Z": 278.6673583984375, "1986-05-16T00:00:00Z": 283.45538330078125, "1986-06-16T00:00:00Z": 288.6134033203125, "1986-07-16T00:00:00Z": 293.27569580078125, "1986-08-16T00:00:00Z": 293.18048095703125, "1986-09-16T00:00:00Z": 288.39788818359375, "1986-10-16T00:00:00Z": 281.8310546875, "1986-11-16T00:00:00Z": 277.7283630371094, "1986-12-16T00:00:00Z": 275.1181335449219, "1986-04-17T00:00:00Z": 275.0066833496094, "1986-07-17T00:00:00Z": 279.4727478027344, "1986-10-17T00:00:00Z": 291.7232971191406, "1987-01-15T00:00:00Z": 282.6434326171875, "1986-07-02T00:00:00Z": 282.24981689453125}, "id": "tasmax_Amon_CanESM2_historical_r1i1p1_19710101-20001231", "units": "K"}
real    0m9.163s

Data w/out poly

time curl "http://<backend_url>/api/data?model=CanESM2&variable=tasmax&emission=historical&time=0&area="
{"r1i1p1": {"data": {"1986-01-16T00:00:00Z": 280.3414522058824, "1977-01-16T00:00:00Z": 280.09547334558823}, "units": "K"}}
real    0m0.105s

Data w/ poly

time curl "http://<backend_url>/api/data?model=CanESM2&variable=tasmax&emission=historical&area=POLYGON+((-135.05859375+52.98828125%2C+-131.05468750000003+49.1796875%2C+-125.97656250000001+46.640625%2C+-122.36328125000001+46.640625%2C+-122.4609375+48.7890625%2C+-114.2578125+48.984375%2C+-119.62890625000001+53.37890625%2C+-135.05859375+52.98828125))&time=0"
{"r1i1p1": {"data": {"1986-01-16T00:00:00Z": 264.9690372242647, "1977-01-16T00:00:00Z": 264.9567440257353}, "units": "K"}}
real    0m1.128s

Multistats w/out poly

time curl "http://<backend_url>/api/multistats?variable=tasmax&emission=historical&area=&time=0"{"tasmax_Amon_CanESM2_historical_r1i1p1_19710101-20001231": {"median": 282.08111572265625, "time": "1986-07-15T21:10:35Z", "units": "K", "max": 314.3248291015625, "mean": 278.1967468261719, "ncells": 8192, "stdev": 21.291064986096575, "min": 234.7781524658203}, "tasmax_Amon_CanESM2_historical_r1i1p1_19610101-19901231": {"median": 281.9470520019531, "time": "1977-07-15T21:10:35Z", "units": "K", "max": 314.5687561035156, "mean": 277.9090270996094, "ncells": 8192, "stdev": 21.46124322907459, "min": 234.5634307861328}}
real    0m0.069s

Multistats w/ poly

time curl "http://<backend_url>/api/multistats?variable=tasmax&emission=historical&area=POLYGON+((-135.05859375+52.98828125%2C+-131.05468750000003+49.1796875%2C+-125.97656250000001+46.640625%2C+-122.36328125000001+46.640625%2C+-122.4609375+48.7890625%2C+-114.2578125+48.984375%2C+-119.62890625000001+53.37890625%2C+-135.05859375+52.98828125))&time=0"
{"tasmax_Amon_CanESM2_historical_r1i1p1_19710101-20001231": {"median": 274.51959228515625, "time": "1986-07-15T21:10:35Z", "units": "K", "max": 281.11529541015625, "mean": 274.48187255859375, "ncells": 16, "stdev": 5.350947123320987, "min": 266.13177490234375}, "tasmax_Amon_CanESM2_historical_r1i1p1_19610101-19901231": {"median": 273.93328857421875, "time": "1977-07-15T21:10:35Z", "units": "K", "max": 280.8470764160156, "mean": 274.06365966796875, "ncells": 16, "stdev": 5.470714281718363, "min": 265.55706787109375}}
real    0m1.136s

Travis CI tests are failing on master

Commit ebf6b62 passed.
Then failures began while working on #88 , originally only the 'push' tests, later both 'push' and 'pr'.
Now master is failing on both.

Files without runs?

In adding tests for the streamflow/watershed API, I discovered that if a file does not have a run associated with it, then the models API endpoint fails. The key code is this:

{dfv.file.run.model.short_name for dfv in ensemble.data_file_variables}

where ensemble is a pycds.Ensemble object.

Overlooking the fact that there are likely alternative queries more robust to absent links between ensemble and model, there's the question of whether we already do or plan to allow for files without runs. I am not sure how we resolved this question, so this issue may just be a placeholder for retrieving a previous decision that resolves this problem while I continue working on the watershed API.

multistat API call doesn't handle split climatology files

Currently the multistat API call accepts an integer timeidx parameter, along with a set of filtering parameters: ensemble, model, emission, and variable.

It returns a set of statistics calculated from the timeidx-th slice of each file that fit the rest of the parameters.

It would be nice if it was possible to filter on timescale (monthly, seasonal, yearly) in addition to ensemble, model, emission, and variable, so that, for example, you could pass timescale=monthly&timeidx=0 to get only statistical calculation corresponding to January, or timescale=yearly&index=0 to get annual statistics.

It's completely possible to sort out which timeidx=0 results are Januaries, which are winters, and which are annual in the front end, but I think it would make more sense to have the API do that filtering. It would more closely match the revised functioning of the data query, and be more intuitive to a hypothetical new programmer, I imagine.

Use modelmeta database version 12f290b63791

modelmeta will shortly be enhanced to support both gridded (existing) and discrete sampling geometry (new) data files.

This will change the modelmeta ORM and database contents. CE backend needs to handle this.

Questions:

Is it possible and worth it to handle both old and new database versions? Probably looks like:
1. Install variant version of modelmeta in your CE backend environment
2. In code, inspect contents of table alembic_version.
3. In code, import and use variant ORM classes depending on version discovered. This sounds tricky.
4. Would this be worth the trouble?
If not, how do we want to handle the asynchronous changeover from old to new database version? What are the consequences (apart from the obvious) of a mismatch between database and CE backend?

Multimeta API call should expose time bounds

The single-file metadata API call now includes start_date, end_date, and multi_year_mean metadata about each dataset, as requested in #44 . It would be convenient if the multi-file multimeta API call exposed them too, but I'm open to discussion about the necessity of this.

At present, the climate explorer frontend calls multimeta once on loading (line 17) to get information on all available datasets, which it uses to populate selectors and user-facing widgets with what datasets are available for examination.

In order to be able to show the user what each dataset's climatological period is, either multimeta could provide that information, or climate-explorer-frontend could hit the single-file metadata call once for each file in the user-selected current group (ie, filtered by model, variable, and sometimes experiment). I think having multimeta provide the information up front would be a little nicer.

Data API call doesn't handle split climatology files

The data API call collects and returns data from all files that match a set of parameters (model, experiment, ensemble, variable, etc).

Previously, it took a time index argument between 0 and 16 to indicate time resolution and position sought, and would return the nth time slice in the file, which worked when all the files were 17-point chronologies.

Using split data files, with separate monthly (0-11), seasonal (0-3) and yearly(0) files, this API call returns a mix of monthly, seasonal, and annual data when you call it with &time=0, and an IndexError if you call it with any other time index.

Here's the result for GET /api/data?ensemble_name=downscaled&model=bcc-csm1-1-m&variable=tasmax&emission=historical,+rcp45&area=&time=0:

{
  "r1i1p1": {
    "units": "degC",
    "data": {
      "2085-07-02T00:00:00Z": -0.5808153910727394,
      "2025-07-02T00:00:00Z": -0.8622292459716103,
      "1977-01-16T00:00:00Z": -18.78877913696885,
      "1997-01-16T00:00:00Z": -17.92588300616977,
      "2055-07-02T00:00:00Z": -0.6485861922775469,
      "2055-01-16T00:00:00Z": -15.875643052269846,
      "1986-07-02T00:00:00Z": -2.4333102893767307,
      "1997-07-02T00:00:00Z": -2.077599634237314,
      "2025-01-16T00:00:00Z": -16.226023053533392,
      "2025-01-15T00:00:00Z": -18.02569051912951,
      "2085-01-16T00:00:00Z": -15.60062807695439,
      "1986-01-15T00:00:00Z": -19.950983475060585,
      "1977-07-02T00:00:00Z": -2.671051067797724,
      "1977-01-15T00:00:00Z": -20.599000150601793,
      "2055-01-15T00:00:00Z": -17.825752320828578,
      "1986-01-16T00:00:00Z": -18.487454919078537,
      "1997-01-15T00:00:00Z": -19.534196834187902,
      "2085-01-15T00:00:00Z": -17.498223073165622
    }
  }
}

This appears to be a mix of annual data (July 2 dates), monthly data (January 15), and seasonal data (January 16).

I don't have strong preferences about how exactly this query should behave in the split-file context. It doesn't need to function identically to the old query as long as it's possible to get the data in some reasonably straightforward way. For example, if given the new file structure it makes sense to expect the front end to pass a time resolution (monthly, seasonal, annual) along with the time index, that would be no trouble.

generate_climo scales precip units incorrectly for packed files

generate_climo can (under control of a flag) convert precipitation units from per-second to per-day. Currently this conversion is done by multiplying by a scale factor of 86400 (s/day). This is correct for absolute (unpacked) files, but fails for packed files where the offset value is non-zero.

Solution is to recognize packed files and apply a different computation to them taking into account the scale and offset values.

Question: Is there a standard for how packed files are defined that we can adhere to, or is it ad hoc?

Optimize cache code

In preparation for using the climate-explorer-backend to drive Plan2Adapt (P2A), I've been taking a look at the performance of the various queries.

P2A makes hundreds (slightly over a thousand) calls to the /stats API calls, so small slow downs add up to a lot.

Turns out that the speed of the cache is slow. Like really slow. Like to the point where it would be faster if we simply turned off the cache.

Why? In order to control the size of the cache, it recursively counts the size of every object it contains. Doing this over and over adds up to a huge performance penalty. Over the course of 1 P2A run, simply measuring the size of the cache took around 43% of the run time while fetching the actual data from disk only took 21% of the run time.

I think that it would be fairly easy to have the cache keep track of its own size and just increment/decrement when something is added/expired. That would presumably give us an almost 2x speedup.

Some calls to data endpoint return all zeros

Some data api calls are still returning incorrect information.

Some of the requests that @jameshiebert and I noted as returning zeros now return expected data, but other still don't.

Using a constant polygon:

POLYGON="POLYGON+((-127.9296875+45.76171875%2C+-127.9296875+54.6484375%2C+-118.65234375000001+54.6484375%2C+-118.65234375000001+45.76171875%2C+-127.9296875+45.76171875))"

Works with January:

curl -s 'http://docker1.pcic.uvic.ca:9000/api/data?model=CSIRO-Mk3-6-0&variable=rx5dayETCCDI&emission=rcp85&area='$POLYGON'&time=0' | python -m json.tool

{
    "r1i1p1": {
        "data": {
            "2025-01-16T00:00:00Z": 94.53663330078125,
            "2055-01-16T00:00:00Z": 101.20841064453126,
            "2085-01-16T00:00:00Z": 111.72054443359374
        },
        "units": "mm"
    },
    "r2i1p1": {
        "data": {
            "2025-01-16T00:00:00Z": 98.94828491210937,
            "2055-01-16T00:00:00Z": 102.28649291992187,
            "2085-01-16T00:00:00Z": 114.10343017578126
        },
        "units": "mm"
    },
    "r3i1p1": {
        "data": {
            "2025-01-16T00:00:00Z": 93.01469116210937,
            "2055-01-16T00:00:00Z": 103.7731689453125,
            "2085-01-16T00:00:00Z": 112.4370849609375
        },
        "units": "mm"
    },
    "r4i1p1": {
        "data": {
            "2025-01-16T00:00:00Z": 95.85071411132813,
            "2055-01-16T00:00:00Z": 102.0443115234375,
            "2085-01-16T00:00:00Z": 115.915771484375
        },
        "units": "mm"
    }
}

Not for any other timestep:

curl -s 'http://docker1.pcic.uvic.ca:9000/api/data?model=CSIRO-Mk3-6-0&variable=rx5dayETCCDI&emission=rcp85&area='$POLYGON'&time=1' | python -m json.tool

{
    "r1i1p1": {
        "data": {
            "2025-02-15T00:00:00Z": 0.0,
            "2055-02-15T00:00:00Z": 0.0,
            "2085-02-15T00:00:00Z": 0.0
        },
        "units": "mm"
    },
    "r2i1p1": {
        "data": {
            "2025-02-15T00:00:00Z": 0.0,
            "2055-02-15T00:00:00Z": 0.0,
            "2085-02-15T00:00:00Z": 0.0
        },
        "units": "mm"
    },
    "r3i1p1": {
        "data": {
            "2025-02-15T00:00:00Z": 0.0,
            "2055-02-15T00:00:00Z": 0.0,
            "2085-02-15T00:00:00Z": 0.0
        },
        "units": "mm"
    },
    "r4i1p1": {
        "data": {
            "2025-02-15T00:00:00Z": 0.0,
            "2055-02-15T00:00:00Z": 0.0,
            "2085-02-15T00:00:00Z": 0.0
        },
        "units": "mm"
    }
}

Calls for the same model with rcp26 appear to work for alternate time of years:

curl -s 'http://docker1.pcic.uvic.ca:9000/api/data?model=CSIRO-Mk3-6-0&variable=rx5dayETCCDI&emission=rcp26&area='$POLYGON'&time=1' | python -m json.tool

{
    "r1i1p1": {
        "data": {
            "2025-02-15T00:00:00Z": 55.08775024414062,
            "2055-02-15T00:00:00Z": 58.02188720703125,
            "2085-02-15T00:00:00Z": 54.56318969726563
        },
        "units": "mm"
    },
    "r2i1p1": {
        "data": {
            "2025-02-15T00:00:00Z": 59.523980712890626,
            "2055-02-15T00:00:00Z": 59.92850952148437,
            "2085-02-15T00:00:00Z": 58.3666015625
        },
        "units": "mm"
    },
    "r3i1p1": {
        "data": {
            "2025-02-15T00:00:00Z": 53.345208740234376,
            "2055-02-15T00:00:00Z": 57.764373779296875,
            "2085-02-15T00:00:00Z": 52.764739990234375
        },
        "units": "mm"
    },
    "r4i1p1": {
        "data": {
            "2025-02-15T00:00:00Z": 56.85017700195313,
            "2055-02-15T00:00:00Z": 59.44209594726563,
            "2085-02-15T00:00:00Z": 54.10572509765625
        },
        "units": "mm"
    },
    "r5i1p1": {
        "data": {
            "2025-02-15T00:00:00Z": 54.548590087890624,
            "2055-02-15T00:00:00Z": 0.0,
            "2085-02-15T00:00:00Z": 0.0
        },
        "units": "mm"
    }
}

But some returned results are 0 when they should not be.

The same issue persists for other variables:

curl -s 'http://docker1.pcic.uvic.ca:9000/api/data?model=CSIRO-Mk3-6-0&variable=rx1dayETCCDI&emission=rcp26&area='$POLYGON'&time=1' | python -m json.tool

{
    "r1i1p1": {
        "data": {
            "2025-02-15T00:00:00Z": 19.7260986328125,
            "2055-02-15T00:00:00Z": 22.774565124511717,
            "2085-02-15T00:00:00Z": 21.199964904785155
        },
        "units": "mm"
    },
    "r2i1p1": {
        "data": {
            "2025-02-15T00:00:00Z": 20.879574584960938,
            "2055-02-15T00:00:00Z": 22.38013153076172,
            "2085-02-15T00:00:00Z": 21.877273559570312
        },
        "units": "mm"
    },
    "r3i1p1": {
        "data": {
            "2025-02-15T00:00:00Z": 21.111979675292968,
            "2055-02-15T00:00:00Z": 22.47863006591797,
            "2085-02-15T00:00:00Z": 19.134806823730468
        },
        "units": "mm"
    },
    "r4i1p1": {
        "data": {
            "2025-02-15T00:00:00Z": 0.0,
            "2055-02-15T00:00:00Z": 0.0,
            "2085-02-15T00:00:00Z": 0.0
        },
        "units": "mm"
    },
    "r5i1p1": {
        "data": {
            "2025-02-15T00:00:00Z": 0.0,
            "2055-02-15T00:00:00Z": 0.0,
            "2085-02-15T00:00:00Z": 0.0
        },
        "units": "mm"
    }
}

The back end does not show any errors for these requests.

generate_climos.py should create one file per climo averaging period

Add an option in generate_climos.py to output one file per climatological averaging period (month, season, year).

Currently they are all in one file, with a single variable with a non-monotonic time dimension (12 months, 4 seasons, 1 year). This is a problem: ncWMS2 won't accept files with non-monotonic dimensions. Separating into different files solves this, and is formally more correct (the different averaging periods technically are distinct random variables).

Add multi_year_mean as parameter to `search_for_unique_ids`

This was removed in 10022a8 to quickly deploy annual data, however the functionality is still required for creating climatological portals.

cell_method filter match() vs fullmatch()

I've discovered that the cell_method filter is not working as intended. It uses re.match() to check for desired cell_methods but will return True for those with only a partial match. Change it to re.fullmatch()to ensure we aren't getting mixed data.

This will need to be tagged and released, likely 1.1.1.

Write a `multistats` API call

Takes a model_id and emissions scenario, searches for unique_ids and delegates to the stats() API call.

Memoizing geo masking by model_id and polygon not guaranteed to be unique

Problem:
As per #10 (comment), while model_id is currently adequate to differentiate between model grids, this will break with other data.

Solution:
Modify memoize_mask to key on grid properties rather than model_id.

Provide canonical units in API endpoints

While datasets of each variable in active use have matching units, intervariable unit names vary:

1. Temperature
1177 datasets have the units degrees_C; 2056 have degC

2. Precipitation
95 datasets have the units mm d-1; 141 have the units mm day-1; 2319 datasets have the units kg m-2 d-1, which may not be automatically convertible.

Currently, nothing in the front end explicitly requires that units from different variables be formatted similarly in order to compare them (see pacificclimate/climate-explorer-frontend#107 - we had this requirement once, but changed it).

Still, there are several ways that providing canonically formatted units using a package like udunits might be helpful in the future, and is worth doing:

users directly accessing the API or exporting data from the front end could more easily compare data with their own tools and applications
possible future frontend development might want to compare variables directly
easier formatting of variable units in the variable config file if units have canonical forms
easier to integrate data from new sources with different unit conventions

stats api doesn't handle non-multi-year-mean datasets

The stats query is written to select a single timestep of data and calculate statistics about it. This is the correct behaviour for multi-year-mean (MYM) datasets, where each timestep actually represents something like thirty years of January, thirty years of spring, or thirty years annual data depending on the resolution and query parameters: a time index and a resolution.

This is wrong for non-MYM datasets. At present, all the non-MYM datasets we're supporting are annual. So the stats query will receive timeres=yearly&timeidx=0, and return statistics about the 0th timeslice of an annual nonMYM dataset, which would (usually) be 1950, not representative of the whole period.

Stats needs to detect when it is working with a nonMYM file and, in the case of annual time resolutions, return statistics across all time slices. For completeness, it should probably return stats from every twelfth timeslice for monthly nonMYM files, but we aren't using them now and don't plan to.

Provide spatial coverage information

At present, the backend provides no information about the spatial coverage of a file to the frontend. Therefore, if there are two datasets identical in every way except spatial coverage (for example, if one dataset is all-Canada and the other is BC-only), the metadata the backend provides to the front end makes it impossible for the front end to distinguish between them.

When the frontend selects a dataset to display for a user, it uses the parameters selected by the user: start date, end date, model, experiment, variable. It is possible that different display components would select different datasets if they had the same parameters but a different spatial extent. So the map might show a different area than the graphs, or graphs visualize different datasets, leading to silently incorrect data.

TravisCI doesn't run data prep tests

Explanation in .travis.yml:

We only run the ce tests because installing cdo is just too hard: NetCDF support isn't built into the
debian cdo pacakges. Once that is resolved (i.e., install the NetCDF support before installing cdo),
we [can] run the data prep tests too.

Investigate possible memory leak in cache code

When we run the climate-explorer-backend in docker, it's RAM usage begins to slowly creep up over a number of days:

# docker stats --no-stream
e4d78bc8801e        ceb-latest                         0.02%               6.361GiB / 9.606GiB   66.23%              9.55MB / 99.1MB     66.2MB / 0B         11
1

We do have some code around that intentionally caches results to RAM. It's supposed to limit the cache size to 100 MB, however it turns out that our tests don't actually cover the cache eviction branch in ce/api/geo.py lines 112-115:

(env) james@basalt:~/code/git/climate-explorer-backend$ py.test -v --cov=ce --cov-report=term-missing ce/tests
=========================================================== test session starts ============================================================
...
----------- coverage: platform linux, python 3.5.2-final-0 -----------
Name                    Stmts   Miss  Cover   Missing
-----------------------------------------------------
ce/api/geo.py             128      7    95%   21, 33, 44, 112-115, 163
...

That section of the code is highly suspect. For this task, write tests to cover cache eviction and investigate whether or not the code is actually working.

Migrate data prep to separate project/repo

Timeseries API call sometimes returns all-zero data when given an area parameter

This bug is intermittent and hard to reproduce, but sometimes passing an area polygon as a parameter to the API results in all-zeroes data. It affects multiple data files and data sets.

Here's a link to an example API call demonstrating the error.

The results of that call are:

{
"id": "tasmin_mClim_BCCAQv2_bcc-csm1-1-m_historical-rcp45_r1i1p1_19610101-19901231_Canada", 
"data": {
  "1977-01-15T00:00:00Z": 0.0, 
  "1977-02-15T00:00:00Z": 0.0, 
  "1977-03-15T00:00:00Z": 0.0, 
  "1977-04-15T00:00:00Z": 0.0,  
  "1977-05-15T00:00:00Z": 0.0, 
  "1977-06-15T00:00:00Z": 0.0, 
  "1977-07-15T00:00:00Z": 0.0, 
  "1977-08-15T00:00:00Z": 0.0, 
  "1977-09-15T00:00:00Z": 0.0, 
  "1977-10-15T00:00:00Z": 0.0, 
  "1977-11-15T00:00:00Z": 0.0, 
  "1977-12-15T00:00:00Z": 0.0
  }, 
"units": "degC"
}

The same area parameter passed to a call to a different id - tasmax instead of tasmin also returns all zeroes:

{
"id": "tasmax_mClim_BCCAQv2_bcc-csm1-1-m_historical-rcp45_r1i1p1_19610101-19901231_Canada", 
"data": {
  "1977-01-15T00:00:00Z": 0.0, 
  "1977-02-15T00:00:00Z": 0.0, 
  "1977-03-15T00:00:00Z": 0.0, 
  "1977-04-15T00:00:00Z": 0.0, 
  "1977-05-15T00:00:00Z": 0.0, 
  "1977-06-15T00:00:00Z": 0.0, 
  "1977-07-15T00:00:00Z": 0.0, 
  "1977-08-15T00:00:00Z": 0.0,
  "1977-09-15T00:00:00Z": 0.0, 
  "1977-10-15T00:00:00Z": 0.0, 
  "1977-11-15T00:00:00Z": 0.0, 
  "1977-12-15T00:00:00Z": 0.0
  },
 "units": "degC"
}

Giving this area as a parameter to the third and final data file in the test collection has the same result.

{"id": "pr_mClim_BCCAQv2_bcc-csm1-1-m_historical-rcp45_r1i1p1_19610101-19901231_Canada", "data": {
  "1977-01-15T00:00:00Z": 0.0, 
  "1977-02-15T00:00:00Z": 0.0, 
  "1977-03-15T00:00:00Z": 0.0, 
  "1977-04-15T00:00:00Z": 0.0, 
  "1977-05-15T00:00:00Z": 0.0, 
  "1977-06-15T00:00:00Z": 0.0, 
  "1977-07-15T00:00:00Z": 0.0, 
  "1977-08-15T00:00:00Z": 0.0, 
  "1977-09-15T00:00:00Z": 0.0, 
  "1977-10-15T00:00:00Z": 0.0, 
  "1977-11-15T00:00:00Z": 0.0, 
  "1977-12-15T00:00:00Z": 0.0
  }, 
"units": "kg m-2 d-1"
}

No error message is generated by these API calls.

There's reasonably good evidence that the values are not zero in the files themselves, in that the ncWMS maps created from the same file show colour variation over the affected polygon.

Passing the same area parameter to an identical copy of the climate explorer backend running against a different metadata database and different data files does not reproduce the error:

{
"units": "K", 
"id": "tasmax_mClim_CanESM2_rcp45_r1i1p1_20100101-20391231", 
"data": {
  "2025-01-15T00:00:00Z": 279.19142659505206, 
  "2025-02-15T00:00:00Z": 280.26951090494794, 
  "2025-03-15T00:00:00Z": 280.49456787109375, 
  "2025-04-15T00:00:00Z": 279.8186442057292, 
  "2025-05-15T00:00:00Z": 278.9789225260417, 
  "2025-06-15T00:00:00Z": 278.13330078125, 
  "2025-07-15T00:00:00Z": 277.2965494791667, 
  "2025-08-15T00:00:00Z": 276.41054280598956, 
  "2025-09-15T00:00:00Z": 276.2767740885417, 
  "2025-10-15T00:00:00Z": 276.3365071614583, 
  "2025-11-15T00:00:00Z": 276.72873942057294, 
  "2025-12-15T00:00:00Z": 277.8912353515625
  }
}

Some polygons return all 0's in timeseries call

Some calls to the timeseries backend return all zeros. [This](http://docker1.pcic.uvic.ca:20003/api/timeseries?id_=prcptotETCCDI_yr_BCCAQ-ANUSPLIN300-CanESM2_historical-rcp26_r1i1p1_1950-2100&variable=prcptotETCCDI&area=
POLYGON+%28%28-123+48,+-123+49.80,+-122+49.80,+-122+48,+-123+48%29%29) works as expected.
area= POLYGON+((-123+48,+-123+49.80,+-122+49.80,+-122+48,+-123+48))

But [this one](http://docker1.pcic.uvic.ca:20003/api/timeseries?id_=prcptotETCCDI_yr_BCCAQ-ANUSPLIN300-CanESM2_historical-rcp26_r1i1p1_1950-2100&variable=prcptotETCCDI&area=
POLYGON+%28%28-123+49.8,+-123+49.9,+-122+49.9,+-122+49.8,+-123+49.8%29%29) returns all zeros.
area= POLYGON+((-123+49.8,+-123+49.9,+-122+49.9,+-122+49.8,+-123+49.8))

Appears to have to do with how low the northern extent of a polygon is, not size/orientation.

`multimeta` API call is too slow for moderate number of files

I have an sqlite database which has indexed 660 ClimDEX files, and the multimeta API call takes 30 seconds to return.

$ time curl -s http://localhost:9000/api/multimeta
real    0m29.051s
user    0m0.016s
sys 0m0.000s

PostgreSQL may be more reasonable, but if not, we'll need to do some query optimization for the multimeta call.

create script to split merged climo files into separate monthly, seasonal, and annual files

See #45 for some background.

This script will process files generated by older versions of generate_climos.py (or other programs) into separate files for each averaging period.

metadata API call should expose time bounds

The metadata API request returns information about a datasets' time values. For certain datasets, however, (e.g. climatologies and temporal averages) the actual time value is representative of a time range. In these cases, CF-metadata compliant files are attributed with a time_bnds (time bounds) variable.

If time bounds are available, we should expose them in this API call.

Timing tests fail intermittently on Travis CI

Several tests check time of execution: ce/tests/test_api.py#L194, ce/tests/test_util.py#L39, ce/tests/test_geo.py#L40.

These tests intermittently fail on Travis CI, with execution times from a few percent to several hundred percent larger than asserted. Should not have restarted a recent failing build so I could link to it. Trust me.

Make using the API easier

There are a few minor things about the API code that can trip up a new developer writing code than runs against the API and could be easily fixed:

The default ensemble on queries that require one is ce, which no longer exists. Should be ce_files
Documentation on many queries still describes the time argument as a value from 1 to 17, when queries have been updated to accept a resolution and an index
Overall documentation review, make sure it's all up to date

Inconsistent units associated with txxETCCDI variable

Some txxETCCDI files have degrees_c as their unit; others degC. The /data query quite sensibly refuses to construct a timeseries from files with different units, and returns a 500 error.

Solutions might be:

use a list of "synonymous" units (degC / degrees_c for temperature and mm / kg d-1 m-2 for precipitation)
standardize the units in existing files and reindex them, if it's only a few

I haven't investigated thoroughly, and it's possible that this error is actually just a side effect of the /data query's present erroneous attempt to construct a timeseries from both nomimal-time and multi-year-mean datasets, in which case this bug will go away when PR 82 is merged.

EDIT: PR 82 has been merged, but did not solve this bug.

Generate ensemble statistic files

One comment from the first round of feedback with the MoTI engineers was that they were not interested in individual model runs and would prefer information from an ensemble average (e.g. the "PCIC 12").

We should probably create a simple script that can generate an ensemble average, given a list of some number of files as arguments. It would presumably be a simple call to CDO.

Odd behaviour in make_mask_grid()

make_mask_grid() generates a unique key for a combination of map grid and polygon; it's used to determine whether a new polygon request has been previously made and doesn't need to be recalculated. The key consists of a tuple with:

minimum longitude of the grid (float)
maximum longitude of the grid (float)
number of longitude steps (float)
minimum latitude of the grid (float)
maximum latittude of the grid (float)
number of latitude steps (float)
text (wkt) representation of the polygon to be computed (string)

However, during testing today, @nikola-rados was seeing singleton numpy arrays instead of numpy floats for the minimums and maximums, which is not the intended or previously observed behaviour. The obvious cause would seem to be a numpy update - numpy is not pinned in this repository - but we tried testing the steps in the python console with the newest version of numpy and they behaved as I would have expected.

This issue can be easily fixed by switching to np.min() instead of np_array[0] to determine the minimum of each coordinate, but I'd like to know why it's happening and if there are any other side effects first, rather than blindly patching it. Needs further investigation.

Non-canonical test data files

The test data files for ce are each in their own way(s) different from what nchelpers, index_netcdf, and split_merged_climos expect. One effect of this is that some of the tests in climate-explorer-backend are failing now that we have established a more correct and desirable labelling for files containing seasonal and yearly means.

Most if not all these datasets appear to be multi-year means, but human expertise has to be applied to determine that.

Question: Which of them, if any, represent datasets the above repos/scripts will have to deal with?

Depending on the answer to this question, either the test files should be replaced with more representative ones, or else the codebase needs some fixing to handle these types of files. Or both.

Details: (files in climate-explorer-backend/ce/tests/data/)

anuspline_na.nc:
- 1 time step: 1950-01-02 00:00:00
- time variable has no attribute climatology (nor any climatology bounds variable) (therefore nchelpers does not recognize it as a multi-year mean, if that's what it is)
- what is it???
CanESM2-rcp85-tasmax-r1i1p1-2010-2039.nc, indexed as file3:
- 17 time steps, mid-month, mid-season, mid-year (though not the same as those we use now [CF standard], nor the same as those previously computed by generate_climos)
- does contain a time bounds variable, properly referenced by time:climatology attribute
- this file is probably OK (can be split)
cgcm.nc, indexed as file0:
- does contain a time bounds variable, but referenced by time:bounds attribute instead of time:climatology (therefore nchelpers does not recognize it as a multi-year mean)
- 12 time variables, end of each month
- history attribute shows formed by cdo ymonmean
cgcm-tmin.nc, indexed as file4:
- history indicates it is a slightly mutated version of cgcm.nc; ditto all comments for it
prism_pr_small.nc:
- 13 time steps, mid-month, mid-year (no seasons)
- does have a variable named climatology_bounds, but time variable has no attribute climatology referencing it (therefore nchelpers does not recognize it as a multi-year mean)

Grid test is too verbose

This commit added a test that precisely checks the values of test input.

This is too verbose and would need to be rewritten any time the test input data changes. Please loosen the assertions here. Other alternatives, for example, could be:

Just check that the keys "latitudes" and "longitudes" exist in the dict
Just check that the length of the lists is > 0
Just check that some of the values in the list are > 0

Essentially, the application should be guaranteed to work with any input data, so having the test assertions too closely coupled with the test input data makes the tests annoying to read and difficult to maintain over time.

Missing statistical data

The multistats query is returning blank JSON objects for datasets that seem to be valid and produce reasonable results for other queries. A couple examples:

http://docker-dev01.pcic.uvic.ca:30061/api/multistats?ensemble_name=all_downscale_files&model=GFDL-ESM2G&variable=tasmax&emission=historical,+rcp26&time=0

This is affecting a lot of datasets.

Flat data from timeseries API call when called with a spatial extent

The timeseries API call, when passed a spatial extent to calculate values over, returns values that do not vary over the year:

{
  "id": "tasmin_mClim_BCCAQv2_CanESM2_historical-rcp26_r1i1p1_19610101-19901231_Canada",
  "units": "degC",
  "data": {
    "1977-01-15T00:00:00Z": 4.026229503566938,
    "1977-02-15T00:00:00Z": 4.026229503566938,
    "1977-03-15T00:00:00Z": 4.026229503566938,
    "1977-04-15T00:00:00Z": 4.026229503566938,
    "1977-05-15T00:00:00Z": 4.026229503566938,
    "1977-06-15T00:00:00Z": 4.026229503566938,
    "1977-07-15T00:00:00Z": 4.026229503566938,
    "1977-08-15T00:00:00Z": 4.026229503566938,
    "1977-09-15T00:00:00Z": 4.026229503566938,
    "1977-10-15T00:00:00Z": 4.026229503566938,
    "1977-11-15T00:00:00Z": 4.026229503566938,
    "1977-12-15T00:00:00Z": 4.026229503566938
  }
}

The same query run over the entire spatial extent returns expected seasonal variation:

{
  "id": "tasmin_mClim_BCCAQv2_CanESM2_historical-rcp26_r1i1p1_19610101-19901231_Canada",
  "units": "degC",
  "data": {
    "1977-01-15T00:00:00Z": -29.574449355153153,
    "1977-02-15T00:00:00Z": -28.212806823529412,
    "1977-03-15T00:00:00Z": -23.460362097196857,
    "1977-04-15T00:00:00Z": -14.638448772434755,
    "1977-05-15T00:00:00Z": -5.053808769799956,
    "1977-06-15T00:00:00Z": 2.690855450688279,
    "1977-07-15T00:00:00Z": 6.272098836979196,
    "1977-08-15T00:00:00Z": 5.254183921397172,
    "1977-09-15T00:00:00Z": -0.013880960970580513,
    "1977-10-15T00:00:00Z": -8.014392844233534,
    "1977-11-15T00:00:00Z": -17.73349189331108,
    "1977-12-15T00:00:00Z": -25.708370434857155
  }
}

Polygon clipping is still slow

Polygon clipping on large grids (BCCAQ scale... 10km over Canada) remains pretty slow (on the order of 5-10 seconds). This needs another round of optimization.

`data` endpoint does not produce multiple climo periods

A call to <backend_url>/data?model=CanESM2&variable=tasmax&emission=rcp26&area=&time=16

results in:

{"r1i1p1": {"units": "K", "data": {"2025-07-02T00:00:00Z": 282.42840576171875}}}

This should produce something more like:

{"r1i1p1": {
  "units": "K", "data": {
    "2025-07-02T00:00:00Z": 282,
    "2055-07-02T00:00:00Z": 283,
    "2075-07-02T00:00:00Z": 284,
  }
}}

This is caused by:

return {
    run.name: {
        'data': {
            timeval.strftime('%Y-%m-%dT%H:%M:%SZ'): getdata(file_) for file_ in get_files_from_run_variable(run, variable)
        },
        'units': get_units_from_run_object(run, variable)
    } for run, timeval in results
}

Which does the following related bad things:
1 overwrites run.name on each iteration
2 increments timeval in the outer loop

Provide HTTP cache support

There are numerous methods in HTTP to provide support for caching:

Our data changes almost never, so we could easily provide some very aggressive and particularly accurate caching control, using the mod times of the underlying data files. If we do this, we could provide much better performance and reduce the system load by a ton.

Area query on climdex file causes server crash

Timeseries query and multistats query crash the entire backend when made against a climdex file with an area specified.

metadata API endpoint should include climatological information

The metadata API endpoint should be extended with the following properties:

multi_year_mean: from TimeSet.multi_year_mean
start_date, end_date: from TimeSet.start_date, TimeSet.end_date, which now contain climo bounds for multi-year mean files (see pacificclimate/modelmeta#12)
timescale: from TimeSet.timescale (see also issue pacificclimate/modelmeta#9)

All values derived from TimeSet are None if there is no such record associated to the DataFileVariable for the requested file.

Handle differently formatted experiment strings

Climdex files typically have experiment strings in the form of "historical, rcp26", but GCM output files typically have experiment strings in the form "historical,rcp26".

At present, in order to use the data API endpoint, an exact match for the emissions scenario string is required, but the Climate Explorer frontend also needs to be able to compare metadata for two files and tell if they are run with the same emissions scenario. So the Climate Explorer frontend converts back and forth between the exact experiment string and a standardized easily-comparable experiment string as needed, which is not great.

It would be much more straightforward if the backend:

Always gave the front end a standardized experiment string in the multimeta endpoint and the metadata endpoint or
Was able to accept an experiment string in either form at the data endpoint,

or both.

Define streamflow API in OpenAPI/Swagger
Implement API
- This will require accompanying changes to the modelmeta database.

Add security features to streamflow API

Security risks:

Orders trigger significant computation. It would be easy to mount a DoS attack by submitting a lot of orders.
Order notification could become a spam bot if emails are not verified.
Malicious users could cancel other users' orders.

Possible responses (numbers do not correspond to risk enumeration):

Have a user authorization system. Don't let users modify (or even perhaps see) sensitive resources owned by other users, e.g., orders. This would cover all risks.
Throttle orders by originating IP address?
Obfuscate order ids to prevent spoofing of order URLs for cancellation?
- That would only work if we did not expose the /orders list resource. And that resource is necessary if the app(s) are to be able to recover from various fails (by, amongst other things, reloading lists of orders issued).
- Maybe we can require the user to provide their email and list only orders notified to that email. This would still leave the door ajar to malicious users who know other users' emails, but it is less of an opening.

See some parts of this discussion for a bit more on this.

More research and discussion needed.

pacificclimate / climate-explorer-backend Goto Github PK

climate-explorer-backend's People

Contributors

Stargazers

Watchers

Forkers

climate-explorer-backend's Issues

To reproduce

Timeseries

Data

Data

Recommend Projects

Recommend Topics

Recommend Org