Coder Social home page Coder Social logo

climate-explorer-data-prep's People

Contributors

basilveerman avatar cairosanders avatar corviday avatar eyvorchuk avatar jameshiebert avatar nikola-rados avatar rod-glover avatar sum1lim avatar

Watchers

 avatar  avatar  avatar

climate-explorer-data-prep's Issues

Precompute GDD, HDD, FFD, Snowfall from GCM outputs

Precompute data files of the following variables, for each model output file available (raw GCM and downscaled). Descriptions below are taken from Plan2Adapt:

  1. Growing Degree-Days (GDDs) is a derived variable that indicates the amount of heat energy available for plant growth, useful for determining the growth potential of crops in a given area. It is calculated by multiplying the number of days that the mean daily temperature exceeded 5°C by the number of degrees above that threshold. For example, if a given day saw an average temperature of 8°C (3°C above the 5°C threshold), that day contributed 3 GDDs to the total. If a month had 15 such days, and the rest of the days had mean temperatures below the 5°C threshold, that month would result in 45 GDDs.

  2. Heating Degree-Days (HDDs) is a derived variable that can be useful for indicating energy demand (i.e. the need to heat homes, etc.). It is calculated by multiplying the number of days that the average (mean) daily temperature is below 18°C by the number of degrees below that threshold. For example, if a given day saw an average (mean) temperature of 14°C (4°C below the 18°C threshold), that day contributed 4 HDDs to the total. If a month had 15 such days, and the rest of the days had average (mean) temperatures above the 18°C threshold, that month would result in 60 HDDs.

  3. Frost-free days is a derived variable referring to the number of days that the minimum daily temperature stayed above 0°C, useful for determining the suitability of growing certain crops in a given area. The method used to compute this on a monthly basis is from (Wang et al, 2006).

  4. 'Precipitation as Snow' is a derived variable, calculated from GCM projected total precipitation (rain and snow) as well as temperature as per (Wang et al, 2006).

References

Wang, T.L., Hamann, A., Spittlehouse, D.L. and Aitken, S.N., 2006. "Development of scale-free climate data for Western Canada for use in resource management", International Journal of Climatology, 26: 383-397. Details the ClimateBC empirical downscaling tool.

prsn data should be in cm, not mm

According to Trevor, precipitation as snow is normally given in centimeters, so having our prsn data in centimeters as well (instead of mm or kg/m2/day) will better communicate that we are dealing with snow.

Generate backwards-compatible frequency values

We've updated the frequency attribute of multi year climatologies to include the operation used to aggregate data for the climatology, so files that might previous have had the frequency value sClim will now be generated with sClimMean or sClimSD.

I can see this posing a problem when we need to re-create an already-existing climatology. For ecample, if a mistake is discovered in the datafile txxETCCDI_sClim_BCCAQ_MRI-CGCM3_historical-rcp85_r1i1p1_20700101-20991231 and it needs to be recreated, the recreation will have the unique_id txxETCCDI_sClimMean_BCCAQ_MRI-CGCM3_historical-rcp85_r1i1p1_20700101-20991231 and the indexer won't realize this file is an update of the previous one, possibly resulting in weird bugs when they are both present in the database.

Possible options:

  • Modify the indexer to understand that frequency values with whateverMean match frequency values a unique_id with the frequency value whatever
  • Add a generate-backwards-compatible-frequency-values flag to the generate_climos script
  • Do nothing, because this problem probably won't come up very often, and either of those solutions is more headache than it's worth

generate_prsn not producing correct fill values

The files output by the snowfall generation script (generate_prsn) have a strange issue with their fill values. The values themselves are the same as they in the parent pr netCDF, -32767. However, they do not mask appropriately displaying the number rather than an _. Furthermore, the metadata states that the fill values should be 32768.

I assume the issue is occurring somewhere in create_prsn_netcdf_from_source(...) but cannot say for sure. This needs to be explored further.

Non-monotonic longitudes in netCDF file

NetCDF files are sometimes generated with longitudes that go from 0 to 180 and then from -180 to 0, like this:

ncdump -v lon tasmax_aClim_CanESM2_historical_r3i1p1_19610101-19901231.nc 

netcdf tasmax_aClim_CanESM2_historical_r3i1p1_19610101-19901231 {
dimensions:
    ...

// global attributes:
   ...

data:

 lon = 0, 2.8125, 5.625, 8.4375, 11.25, 14.0625, 16.875, 19.6875, 22.5, 
    25.3125, 28.125, 30.9375, 33.75, 36.5625, 39.375, 42.1875, 45, 47.8125, 
    50.625, 53.4375, 56.25, 59.0625, 61.875, 64.6875, 67.5, 70.3125, 73.125, 
    75.9375, 78.75, 81.5625, 84.375, 87.1875, 90, 92.8125, 95.625, 98.4375, 
    101.25, 104.0625, 106.875, 109.6875, 112.5, 115.3125, 118.125, 120.9375, 
    123.75, 126.5625, 129.375, 132.1875, 135, 137.8125, 140.625, 143.4375, 
    146.25, 149.0625, 151.875, 154.6875, 157.5, 160.3125, 163.125, 165.9375, 
    168.75, 171.5625, 174.375, 177.1875, -180, -177.1875, -174.375, 
    -171.5625, -168.75, -165.9375, -163.125, -160.3125, -157.5, -154.6875, 
    -151.875, -149.0625, -146.25, -143.4375, -140.625, -137.8125, -135, 
    -132.1875, -129.375, -126.5625, -123.75, -120.9375, -118.125, -115.3125, 
    -112.5, -109.6875, -106.875, -104.0625, -101.25, -98.4375, -95.625, 
    -92.8125, -90, -87.1875, -84.375, -81.5625, -78.75, -75.9375, -73.125, 
    -70.3125, -67.5, -64.6875, -61.875, -59.0625, -56.25, -53.4375, -50.625, 
    -47.8125, -45, -42.1875, -39.375, -36.5625, -33.75, -30.9375, -28.125, 
    -25.3125, -22.5, -19.6875, -16.875, -14.0625, -11.25, -8.4375, -5.625, 
    -2.8125 ;
}

There is no geographic discontinuity in this file, but there is a numerical discontinuity. Some software tools, such as ncWMS and CDO, have trouble working with the bounding boxes of polygons which span across the -180/180 longitude line, and have positive longitude minimums and negative longitude maximums.

ncWMS's response to requesting a map for an area that crosses the numerical discontinuity is:

<ServiceExceptionReport version="1.3.0" xsi:schemaLocation="http://www.opengis.net/ogc http://schemas.opengis.net/wms/1.3.0/exceptions_1_3_0.xsd"><ServiceException>
        Invalid bounding box format
    </ServiceException></ServiceExceptionReport>

Currently we anticipate no concrete repercussions from this issue, since all our use cases involve displaying maps of Canada, which does not go anywhere near the antemeridian. So it's low priority.

Excluded tests are broken

The test suite passes on Travis but will fail on local machines. This is because only 3 out of 5 test files are run by Travis (test_units_helpers.py, test_update_metadata.py, test_decompose_flow_vectors.py). Running pytest with excluded files (test_split_merged_climos.py, test_create_climo_files.py) break with a collection of similar errors:

tests/test_split_merged_climos.py:69: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
dp/split_merged_climos.py:51: in split_merged_climos
    output_filepath = os.path.join(outdir, cf.cmor_filename)
/tmp/venv/lib/python3.6/site-packages/nchelpers/decorators.py:26: in wrapper
    res = func(*args, **kwargs)
/tmp/venv/lib/python3.6/site-packages/nchelpers/__init__.py:351: in __getattribute__
    value = super(CFDataset, self).__getattribute__(name)
/tmp/venv/lib/python3.6/site-packages/nchelpers/__init__.py:1568: in cmor_filename
    extension='.nc', **self._cmor_type_filename_components()
/tmp/venv/lib/python3.6/site-packages/nchelpers/__init__.py:1494: in _cmor_type_filename_components
    components.update(ensemble_member=self.ensemble_member)
/tmp/venv/lib/python3.6/site-packages/nchelpers/decorators.py:26: in wrapper
    res = func(*args, **kwargs)
/tmp/venv/lib/python3.6/site-packages/nchelpers/__init__.py:351: in __getattribute__
    value = super(CFDataset, self).__getattribute__(name)
/tmp/venv/lib/python3.6/site-packages/nchelpers/__init__.py:562: in ensemble_member
    components[component] = getattr(self.gcm, attr)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <nchelpers.CFDataset.AutoGcmPrefixedAttribute object at 0x7fb510b52748>, attr = 'realization'

    def __getattr__(self, attr):
        prefixed_attr = self._prefixed(attr)
        try:
            return getattr(self.dataset, prefixed_attr)
        except AttributeError:
            raise CFAttributeError(
                "Expected file to contain attribute '{}' but no such "
>               "attribute exists".format(self._prefixed(attr)))
E           nchelpers.exceptions.CFAttributeError: Expected file to contain attribute 'GCM__realization' but no such attribute exists

Generally, there is an attribute being accessed that does not exist named [some prefix]__realization.

One of threes things could be done:

  1. Fix the broken tests
  2. Remove the broken tests
  3. Comment/Explain why the tests are broken/excluded

Correct computation of climatologies of min/max climate index variables

We have been computing climatologies for climate index variables involving miniumum and maximum values incorrectly. Specifically, this is known to be a problem for the variables rx1day, rx5day. There may be others.

The climatology script presently computes multi-decadal monthly, seasonal, and annual means by applying the CDO operators ymonmean, yseasmean, and timmean (respectively) to the data file, regardless of the variable. All three of these operators take averages of the variable values both within the the intra-year interval (month, season, year) and then across the multi-decadal period for each such interval.

For climate index variables such as rx1day and rx5day, it is incorrect to take averages within the intra-year interval. Instead, an operator appropriate to the type of value must be applied, namely the maximum of the values within the interval. For other variables, the intra-year interval operator may have to be different, e.g., for a variable that involved minimum values, the operator would likely have to take minimums within the intra-year interval.

(Note: As our base datasets for rx1day and rx5day have monthly resolution, the problem only arises for the seasonal and annual climatologies; mean and max are the same for a 1-item (1-month) interval. But we should fix it generically, because there is no guarantee that this won't be applied to some dataset with sub-monthly temporal resolution.)

The following is an excerpt from an email chain discussing the problem.

[[REG]] Then we take climatological means of both these variables, meaning we take 30-year means of monthly, seasonal, and annual averages of the variables. It's not clear to me which of these values is meaningful and/or useful to our clients.

[[TQM]] Almost. We don’t take 30-year means of monthly, seasonal, and annual AVERAGES rather we take 30-year means of monthly, seasonal, or annual MAXIMA

[[REG]] Maybe that's what we SHOULD be doing, but our data preparation script currently forms 30-year means of monthly, seasonal, and annual AVERAGES. Specifically, for a time series with delta-t = 1 month, the seasonal and annual means are 30-year means of the MEANS of that time series the indicated period (seasonal, annual) within each year. I think that you are saying (and it makes sense to me), that this should be 30-year means of the MAXIMA over the indicated period. If so, we need to change our climatological-values script to process these variables correctly. And we need to establish exactly which ones get this treatment, which get the "means of means" treatment, and which, if any get some other treatment.

[[TQM]] It’s fine to take a climatological mean in the sense of averaging over 30 years – as long as you’re doing that last. But annual RX1day is the maximum of the 12 monthly maxima, summer RX1day is the maximum of the 3 monthly maxima. If you are instead averaging where I’m saying that you should take a maximum that variable isn’t a thing – there’s nothing else we can call it. It should never be calculated that way and certainly never be displayed – it’s quite misleading since it’s similar but different to something we do produce on a regular basis. The reason that this thing that is now being computed doesn’t have a name is because from a user perspective it’s meaningless. RX1day June is the wettest day in June, RX1day July is wettest day in July, RX1day is wettest day in August. RX1day summer is wettest day in summer – that HAS to be the maximum of the three monthly maximums. The average of those 3 values just doesn’t measure anything since individual months in the same season can be quite different from each other. The average of all months’ RX1day values doesn’t tell us anything.

update_metadata: Need to correct attributes of CLIMDEX variables

This is an epic (overarching issue or user story).

Problem:

All of the CLIMDEX files formed from BCCAQ (ver 1) downscaled GCM data have the following two problems. It seems likely that other CLIMDEX files generated at PCIC may also.

  1. attribute cell_methods is absent or else = "time: maximum" (this is incorrect for many indices)
  2. attribute long_name = CLIMDEX index abbreviation, not the long name

In other BCCAQ CLIMDEX files there are likely similar problems.

Proposed solution:

Add features to update_metadata that make it possible to write updates along the following lines:

<dependent variable name>:
    cell_methods: = cell_method_for(<dependent variable name>)
    standard_name: = standard_name_for(<dependent variable name>)

This innocent bit of specification requires the following features:

Degree day annual data should be the sum of degree day seasonal data

I generated degree data climatologies, but I think the usual climatology approach where an annual value is the mean of the seasonal value is meaningless for an accumulative value like degree days. I think we want annual values to be the sum of seasonal values. I need to update generate_climos and redo that data.

Form climo means of streamflows

Currently we can form climatological means from files containing variables defined over spatiotemporal grids, such as the outputs of GCMs, but not from streamflow output files.

Streamflow, however, is not defined on a grid. A streamflow for a given spatial location is a time series at that location, called an outlet. The collection of outlets do not form a uniform grid -- instead they are distributed essentially at random. Outlets are addressed by an outlet index, with several dependent variables defining the spatial location, name, and streamflow at that outlet.

We need to handle this case too.

Precompute multi-model ensemble statistics

Precompute files of statistics across ensembles for all available variables.

Ensembles:

  • All runs
  • All available models

Statistics:

  • mimimum
  • maximum
  • ? average
  • percentiles:
    • 10th
    • 25th
    • 50th (median)
    • 75th
    • 90th

Variables:

  • tasmin
  • tasmax
  • pr
  • CLIMDEX indices (all?)

Wrong units in datafiles

According to Trevor, the following variables have the wrong units:

Variable Current Units Correct Units
rp20pr mm/day mm
rp50pr mm/day mm
rp5pr mm/day mm
sdiiETCCDI mm/day mm

Units are extracted directly from the datafiles, so the solution should just be updating the affected datafiles and re-indexing them, perhaps also letting whoever generated the datafiles know that they ended up with the wrong units.

This is distinct from the issue of variables with scientifically correct, but non-user-friendly units.

Add a climatological periods argument to generate_climos

By default, the generate_climos script creates climatlogies for all of the periods available in the input file. This is great when starting from scratch. However there would be use cases when we're infilling climatologies (e.g. after a failure mid-script) where we want to generate some of the periods, but not all.

We should add a command line flag (like the one that is already sketched out here) that allow the user to select only the periods that they want. All by default.

update_metadata: Add for ... in ... iteration syntax

YAML syntax:

"for <variables> in <expression>":
   <update key>:
       <etc>: ...

Semantics: Execute the specified updates in a context that includes the value of the <variables> for each result of <expression>.

OK, it's Friday night, this is way over the top, but it is certainly doable.

The most general and elegant (for certain values of 'elegant') is to use exec to build a generator from the for expression, and to then iterate that generator and evaluate the subsidiary update directives in a context with the <variables> set according to the yielded values. Something like so:

def make_for_generator(variables, expression):
    exec('''
             for {v} in {e}:
                 yield {v}
             '''.format(v=variables, e=expression)
    )

# ... parse YAML "for" key into variables, expression ...
for_generator = make_for_generator(variables, expression)

for vars in for_generator:
    execute_updates(updates, vars)

Isn't Python just fucking awesome?

Generate Precipitation as Snow

Create a script that will generate precipitation as snow data using precipitation, tasmin and tasmax . Ensure that the result has all the necessary metadata such that it can then be run through generate_climos.

update_metadata: Don't delete on rename from absent attribute

When reprocessing a file with the same updates, a rename will cause an already-renamed attribute to be deleted (because the renamed attribute is now missing and its value returns None, which in turn causes NetCDF4 to remove the target attribute). Don't do this. It's inconvenient and it doesn't have any useful application.

Multiple fill values in one file

An example file:
/storage/data/climate/downscale/CMIP5/BCCAQ/climdex/CNRM-CM5_historical+rcp26_r1i1p1/rx5dayETCCDI_mon_BCCAQ_CNRM-CM5_historical-rcp26_r1i1p1_19500116-21001216.nc

The fill value for rx5dayETCCDI is listed as 1e+20:

$ncdump -h rx5dayETCCDI_mon_BCCAQ+ANUSPLIN300+CNRM-CM5_historical+rcp26_r1i1p1_195001-210012.nc

	float rx5dayETCCDI(time, lat, lon) ;
		rx5dayETCCDI:units = "mm" ;
		rx5dayETCCDI:_FillValue = 1.e+20f ;
		rx5dayETCCDI:long_name = "Monthly Maximum Consecutive 5-day Precipitation" ;
		rx5dayETCCDI:cell_methods = "time: maximum" ;
		rx5dayETCCDI:history = "Created by climdex.pcic 1.1.1 on Wed Jun  4 10:09:38 2014" ;

But large chunks of the array are filled with 2.945782e+34 instead. The backend yields the following not-very-graphable timeseries from this file:

{
  "units": "mm", 
  "id": "rx5dayETCCDI_mon_BCCAQ_CNRM-CM5_historical-rcp26_r1i1p1_19500116-21001216",
  "data": {
    "1950-01-16T00:00:00Z": Infinity, 
    "1950-02-14T12:00:00Z": Infinity, 
    "1950-03-16T00:00:00Z": Infinity, 
    "1950-04-15T12:00:00Z": Infinity, 
    "1950-05-16T00:00:00Z": Infinity, 
    "1950-06-15T12:00:00Z": Infinity, 
    "1950-07-16T00:00:00Z": Infinity, 
  }
}

Not entirely sure if this is a data prep issue or a backend issue.

Generate Climatologies with Snowfall Data

Snowfall data needs to be added as a variable that can be accepted by generate_climos. Furthermore, convert_pr_var_units(...) needs to be extended to handle prsn data.

Create (or modify) script to rename variables

Motivation: pacificclimate/modelmeta#46

Task: Script similar to (or extension of) update_metadata that can rename a variable in a NetCDF file.

Definitely tending towards extending update_metadata, which already contains 90% of the machinery necessary for a nice implementation of this. Should rename it to something like update_netcdf.

Add copy and function value features to update_metadata

This is in support of making existing CLIMDEX files indexable, by standardizing their metadata.

Add the following features to update_metadata:

  1. Copy assignment:

    1. Copy the value of one attribute to another
    2. Syntax: <name1>: =<name2>
    3. Semantics: assign value of attribute named <name1> to value of attribute named <name2>
  2. Function value assignment:

    1. Assign an attribute a value computed by an arbitrary function of the value of another attribute
    2. So far can only see a need for passing the value of 1 other attribute, but if it is easy, extend to multiple arguments.
    3. Don't handle constants as arguments.
    4. Syntax: <name1>: =<func>(<name2>, <name3>, ...)
    5. Semantics: Assign value of attribute named <name1> to value of function <func> applied to arguments <name2>, <name3>, .... is defined, by name, in the update_metadata code, and if we need to add functions, that's another PR.
  3. Functions:

    1. realization(ensemble_member): extract realization m from r<m>i<n>p<l> ensemble member code
    2. initialization_method(ensemble_member): extract initialization method n from r<m>i<n>p<l> ensemble member code
    3. physics_version(ensemble_member): extract physics version l from r<m>i<n>p<l> ensemble member code

Ensembles of Degree Day Data

It looks like the PCIC12 is only available for T & P right now.

We just had a question from a user about comparing Cooling Degree Days on the explorer to other tools and getting different values (partly because the other tools do a bad job of showing change from baseline, which we do a better job of, but CDD only shows up for individual models).

Is the PCIC12 for climdex indices in process and do we have an ETA for, or is there a stumbling block for computing it?

Frost Days - missing?

I don't seem to be able to find the frost days variable. There's freezing degree days but that isn't the same thing. FD counts the # of days below freezing.

dtrETCCDI has inconsistent units

The dtrETCCDI data uses both degC and degrees_C as units.

2019-10-13 20:19:56 [2085] [INFO] 172.18.0.1 - - [13/Oct/2019:20:19:56 +0000] "GET /api/data?ensemble_name=ce_files&model=CanESM2&variable=dtrETCCDI&emission=historical,+rcp85&timescale=yearly&time=0&area= HTTP/1.1" 500 291 "https://services.pacificclimate.org/pcex/app/" "Mozilla/5
.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"
2019-10-13 20:19:57 [2073] [ERROR] Exception on /api/data [GET]
Traceback (most recent call last):
  File "/app/ce/api/data.py", line 142, in data
    run_result = result[data_file_variable.file.run.name]
KeyError: 'r1i1p1'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 2292, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1815, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.6/dist-packages/flask_cors/extension.py", line 161, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1718, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.6/dist-packages/flask/_compat.py", line 35, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1813, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1799, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/app/ce/views.py", line 11, in api_request
    return ce.api.call(db.session, *args, **kwargs)
  File "/app/ce/api/__init__.py", line 75, in call
    rv = func(session, **args)
  File "/app/ce/api/data.py", line 147, in data
    data_file_variable.file.run, variable),
  File "/app/ce/api/util.py", line 31, in get_units_from_run_object
    raise Exception("File list {} does not have consistent units {}".format(run.files, units))
Exception: File list [<modelmeta.v2.DataFile object at 0x7f59c0f395f8>, <modelmeta.v2.DataFile object at 0x7f59c0f39668>, <modelmeta.v2.DataFile object at 0x7f59c0f39780>, <modelmeta.v2.DataFile object at 0x7f59c0f39898>, <modelmeta.v2.DataFile object at 0x7f59c0f399b0>, <modelmeta.v2.DataFile object at 0x7f59c0f39ac8>, <modelmeta.v2.DataFile object at 0x7f59c0f39be0>, <modelmeta.v2.DataFile object at 0x7f59c0f39cf8>, <modelmeta.v2.DataFile object at 0x7f59c0f39e10>, <modelmeta.v2.DataFile object at 0x7f59c0f39f28>, <modelmeta.v2.DataFile object at 0x7f59c0f42080>, <modelmeta.v2.DataFile object at 0x7f59c0f42198>, <modelmeta.v2.DataFile object at 0x7f59c0f422b0>, <modelmeta.v2.DataFile object at 0x7f59c0f424e0>, <modelmeta.v2.DataFile object at 0x7f59c0f42668>, <modelmeta.v2.DataFile object at 0x7f59c0f427f0>, <modelmeta.v2.DataFile object at 0x7f59c0f42978>, <modelmeta.v2.DataFile object at 0x7f59c0f42b00>, <modelmeta.v2.DataFile object at 0x7f59c0f42c88>......... File object at 0x7f59c1245748>, <modelmeta.v2.DataFile object at 0x7f59c1245eb8>, <modelmeta.v2.DataFile object at 0x7f59c1245358>, <modelmeta.v2.DataFile object at 0x7f59c1245860>]
does not have consistent units {'degC', 'degrees_C'}

Fix the data to be consistent.

  • generate new climatologies
  • upload new data to compute canada
  • add new data to ncWMS
  • replace old data with new data in database

generate multi year means for annual-only climdex indices

There are 480 climdex datasets that are annual-only non-climatology datasets in active use by Climate Explorer.

Historically, we did not support annual-only climatologies, but we do now, and our analysis tools are much nicer for climatologies than non-climatologies, so it makes sense to generate climatologies and replace the non-MYM datasets with them in the database.

SELECT DISTINCT 
  data_files.unique_id
FROM 
  ce_meta.ensemble_data_file_variables, 
  ce_meta.ensembles, 
  ce_meta.data_files, 
  ce_meta.time_sets, 
  ce_meta.data_file_variables
WHERE 
  ensemble_data_file_variables.ensemble_id = ensembles.ensemble_id AND
  ensemble_data_file_variables.data_file_variable_id = data_file_variables.data_file_variable_id AND
  data_files.time_set_id = time_sets.time_set_id AND
  data_file_variables.data_file_id = data_files.data_file_id AND
  ensembles.ensemble_id > 13 AND 
  time_sets.multi_year_mean = FALSE;

update_metadata: Add access to attributes of variables

Extend the local variables context with a dict of variables:

  • variable name: variables
  • value: dict of variables
    • key: variable name
    • value: dict of attributes

Example expression in updates file:
= variables['taxmax']['units'] => value of attribute units of variable tasmax

Standardize experiment strings

Climdex variables have experiment strings of the format "historical, rcp26", but downscaled GCM outputs use the format "historical,rcp26". This is a low-priority issue, since there's a workaround in the CE frontend.

Recalculate return period climatologies

The return period climatologies we are using in Climate Explorer do not follow Climate Explorer's conventions on time formatting. Every other annual climatology in Climate Explorer "assigns" the value for the entire climatology to a date in the central year of the climatology. The return period datasets, at least some of them, assign the value to the last day of the climatology.

The nicest way to solve this problem would be to see if Stephen has nominal versions of this data and generate our own climatologies from it.

I did write a script to supply missing time values for this data collection; a lot of the files had timestamps of 0, and units of "days since 01-01-01". The script may be defective, or the error may be in files the script wasn't run on.

Review cell_methods

Recent work has uncovered information that suggests that we may have not been setting cell_methods correctly in our files.

In particular,

  • cell_methods in input data files frequently don't record the spacing (interval) of the original data. This may or may not be a real issue, but it does seem to have some relevance when we form climatological statistics, as they form the basis for the climo statistics cell_methods.

  • cell_methods are probably not correct for climatological outputs. The CF Metadata Conventions is very clear about what cell_methods values are considered permissible and correct for climatological statistics. We are not, I believe, following these.

Therefore: Review the content of cell_methods, both what we receive in input files and what we generate for output files, and determine:

  1. what they should be
    1. possibly extending the CF Metadata Conventions if they do not seem to fit our case(s) properly -- but be skeptical of this impulse too
    2. documenting this in detail for our cases, probably in PCIC Metadata Standards
  2. what they currently are
  3. how to handle the differences between (1) and (2), which may include
    1. rewriting file contents
    2. updating modelmeta database contents
    3. asking scientists to modify their data-generation code
    4. ripping our collective hair out

ACCESS1-0 model outputs have not been indexed

While there are derived outputs from ACCESS1-0 in climate explorer (climdex indices and degree days), the ACCESS1-0 model output climatologies do not seem to be in the climate explorer database.

update_metadata cannot handle invalid blank attributes

I'm not sure if fixing this issue is actually possible.

I attempted to use the update_metadata script to remove an invalid global attribute from some netCDF files. Specifically, some files had a blank string for global: history, which results in all sorts of weird errors.

I wasn't able to remove the invalid attribute with update metadata, and got this traceback:

2018-12-20 16:17:39 INFO: Processing file: /storage/data/climate/downscale/BCCAQ2/bccaqv2_with_metadata/tasmin_day_BCCAQv2+ANUSPLIN300_BNU-ESM_historical+rcp45_r1i1p1_19500101-21001231.nc
2018-12-20 16:17:39 INFO: Global attributes:
Traceback (most recent call last):
  File "climate-explorer-data-prep/scripts/update_metadata", line 31, in <module>
    main(args)
  File "/local_temp/lzeman/climate-explorer-data-prep/venv/lib64/python3.4/site-packages/dp/update_metadata.py", line 247, in main
    process_updates(dataset, updates)
  File "/local_temp/lzeman/climate-explorer-data-prep/venv/lib64/python3.4/site-packages/dp/update_metadata.py", line 227, in process_updates
    apply_attribute_updates(dataset, target, update)
  File "/local_temp/lzeman/climate-explorer-data-prep/venv/lib64/python3.4/site-packages/dp/update_metadata.py", line 202, in apply_attribute_updates
    apply_attribute_updates(dataset, target, element)
  File "/local_temp/lzeman/climate-explorer-data-prep/venv/lib64/python3.4/site-packages/dp/update_metadata.py", line 196, in apply_attribute_updates
    modify_attribute(dataset, target, *attr_updates)
  File "/local_temp/lzeman/climate-explorer-data-prep/venv/lib64/python3.4/site-packages/dp/update_metadata.py", line 181, in modify_attribute
    return delete_attribute(target, name)
  File "/local_temp/lzeman/climate-explorer-data-prep/venv/lib64/python3.4/site-packages/dp/update_metadata.py", line 145, in delete_attribute
    if hasattr(target, name):
  File "/local_temp/lzeman/climate-explorer-data-prep/venv/lib64/python3.4/site-packages/nchelpers/decorators.py", line 26, in wrapper
    res = func(*args, **kwargs)
  File "/local_temp/lzeman/climate-explorer-data-prep/venv/lib64/python3.4/site-packages/nchelpers/__init__.py", line 353, in __getattribute__
    is_indirected, indirected_property = _indirection_info(value)
  File "/local_temp/lzeman/climate-explorer-data-prep/venv/lib64/python3.4/site-packages/nchelpers/__init__.py", line 148, in _indirection_info
    if isinstance(value, six.string_types) and value[0] == '@':
IndexError: string index out of range

As a workaround, I first set the attribute to a valid string with update_metadata. Then I ran update_metadata a second time to delete the attribute.

Not a very high priority, since there is a workaround.

Climatological time bounds should be closed intervals, not half-open

Currently, climatological time bounds are calculated as half-open intervals, which is to say that the end date is computed as the day after the last day averaged; more specifically hour 00:00 of the day after. Because of calendar variations, this is much simpler to compute, but in fact the end date should be the last day averaged.

data corrupted by update_metadata script

I ran the update_metadata script on the giant BCCAQ2 files to rename some metadata attributes. This resulted in an error in the file data. Affected files have normal data for the first few thousand timesteps, but subsequently have a weird data offet, resulting in maps that look like this:
screenshot from 2018-11-01 15-14-09

My best guess for the mechanism here is that the offset is caused by a failure to correctly move the data further down the file when adding length (longer attribute names?) to the metadata header. Perhaps because these are netCDF Classic files of size 56 G, and netCDF classic is designed for files smaller than 2G.

You are, I think, allowed to have netCDF classic files longer than 2G if all but one of the variables fits completely with the 2G, which would be the case here. But that may be a grey area that some libraries don't work well with, or something. Maybe only 2G of the data was "scooted down"?

Diagnose issue, and have update_metadata warn the user / refuse to run if it seems like it applies.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.