rigoudyg / climaf Goto Github PK

View Code? Open in Web Editor NEW

16.0 16.0 7.0 286.73 MB

CliMAF - a Climate Model Analysis Framework - doc at : http://climaf.readthedocs.org/

License: Other

Shell 3.23% Python 78.55% NCL 11.06% Perl 5.73% HTML 0.91% Roff 0.52%

climaf's People

Contributors

Stargazers

Watchers

Forkers

ludivinev jservonnat agstephens cehbrecht ucyo senesis climatepals

climaf's Issues

last_XXY now exists! what about first_XXY?

first_XXY is the last option left (period='last_XXY' and period='*' are already available) that could be transferred from the CESMEP modules (time_manager) to cdataset.explore.

Consider re-coding CAMI with CliMAF

This would prove that CliMAF is mature enough for most use cases

Need to handle multi-variable datasets

It would be useful to let dataset be multi-variable, in order to cope with datafile organizations where variables are grouped in files (such as for Nemo model diags). This would save file operations when such groups of variable are used together by some operators. This would however break the regularity of CliMAF dataset model

Make used data archive configurable

Currently one has to edit the Python module site_settings.py to change or add the data archive used by climaf. It would be nice if the data archive could be configured.

Beta-testing -- Adding realm to the dataloc class

When exploring the dataloc functionality with the example available in cmip5drs.py, I've had the case of getting two files for the following request:

urls_CMIP5_Ciclad=["/prodigfs/esg"]
dataloc(organization="CMIP5_DRS", url=urls_CMIP5_Ciclad)
cdef("frequency","monthly") ;  cdef("project","CMIP5")
tas1pc=ds(model="IPSL-CM5A-MR", experiment="historical", variable="pr", period="1860-1961")
files=tas1pc.selectFiles()
print files

Here is what I get:

/prodigfs/esg/CMIP5/merge/IPSL/IPSL-CM5A-MR/historical/mon/atmos/Amon/r1i1p1/v20111119/pr/pr_Amon_IPSL-CM5A-MR_historical_r1i1p1_185001-200512.nc /prodigfs/esg/CMIP5/merge/IPSL/IPSL-CM5A-MR/historical/mon/ocean/Omon/r1i1p1/v20111119/pr/pr_Omon_IPSL-CM5A-MR_historical_r1i1p1_185001-200512.nc

I've tried adding cdef("realm","atmos") but it didn't change the result.

cdef("frequency","monthly") ;  cdef("project","CMIP5") ; cdef("realm","atmos")

If you confirm that it is relevant then I'll try to do my first contribution to CliMAF by adding "realm" to dataloc.

Fix for daily CMIP5 datasets

I've encountered difficulties to reach daily datasets on the CMIP5 archive:

summary(ds(project = 'CMIP5', variable='pr', model = 'GFDL-CM3', experiment = 'historical', frequency = 'daily', period = '19900101-19901231'))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-3564c6456b32> in <module>()
      9 #crm(pattern='ensemble_ts_plot')
     10 #ncdump(clim_average(ds(variable='tas', **cmip_dict), 'JJA'))
---> 11 summary(ds(variable='pr', **cmip_dict))
     12 #if 'daily' in test.crs:
     13 #    print 'ok'

/home/jservon/Evaluation/CliMAF/climaf_installs/climaf_1.0.3_CESMEP/climaf/functions.pyc in summary(dat)
    348             print '--'
    349     elif isinstance(dat,classes.cdataset):
--> 350         if not dat.baseFiles():
    351             print '-- No file found for:'
    352         else:

/home/jservon/Evaluation/CliMAF/climaf_installs/climaf_1.0.3_CESMEP/climaf/classes.pyc in baseFiles(self, force)
    446                 if filenameVar : dic["filenameVar"]=filenameVar
    447             clogger.debug("Looking with dic=%s"%`dic`)
--> 448             self.files=dataloc.selectLocalFiles(**dic)
    449         return self.files
    450 

/home/jservon/Evaluation/CliMAF/climaf_installs/climaf_1.0.3_CESMEP/climaf/dataloc.py in selectLocalFiles(**kwargs)
    196             rep.extend(selectEmFiles(**kwargs2))
    197         elif (org == "CMIP5_DRS") :
--> 198             rep.extend(selectCmip5DrsFiles(urls,**kwargs2))
    199         elif (org == "generic") :
    200             rep.extend(selectGenericFiles(urls, **kwargs2))

/home/jservon/Evaluation/CliMAF/climaf_installs/climaf_1.0.3_CESMEP/climaf/dataloc.py in selectCmip5DrsFiles(urls, **kwargs)
    517                     #if freqd in ['daily','day']:
    518                     #   regex=r'^.*([0-9]{4}[0-9]{2}[0-9]{2}-[0-9]{4}[0-9]{2}[0-9]{2}).nc$'
--> 519                     fileperiod=init_period(re.sub(regex,r'\1',f))
    520                     if (fileperiod and period.intersects(fileperiod)) :
    521                         rep.append(f)

/home/jservon/Evaluation/CliMAF/climaf_installs/climaf_1.0.3_CESMEP/climaf/period.pyc in init_period(dates)
    160     start=(4-len(start))*"0"+start
    161     # TBD : check that start actually matches a date
--> 162     syear  =int(start[0:4])
    163     smonth =int(start[4:6])  if len(start) > 5  else 1
    164     sday   =int(start[6:8])  if len(start) > 7  else 1

ValueError: invalid literal for int() with base 10: '/pro'

It actually comes from dataloc.py: the provided 'regex' is only valid for monthly datasets. I've tried this patch in dataloc (line 516) and it works:
replace:

                    regex=r'^.*([0-9]{4}[0-9]{2}-[0-9]{4}[0-9]{2}).nc$'

with:

                    if freqd in ['monthly','mo']:
                       regex=r'^.*([0-9]{4}[0-9]{2}-[0-9]{4}[0-9]{2}).nc$'
                    if freqd in ['daily','day']:
                       regex=r'^.*([0-9]{4}[0-9]{2}[0-9]{2}-[0-9]{4}[0-9]{2}[0-9]{2}).nc$'

I will set up a PR as soon as possible.

The "Grand Tour" notebook is no more up to date re. added functions

The macro feature doesn't allow to change plot titles

Operator 'plot' should allow for an optional vector field

Automatic cache management

When the cache becomes heavily loaded with results, the time spent scanning the index can be of the same order than the time spent to actually compute the result (especially in the context of the production of big atlases).

Here are some ideas to implement an automatic smart cache management (applied at the end of a CliMAF routine script for instance).

Each time we do a cfile (only when it leads to a new result), we keep:

the time spent by CliMAF to scan the index
the time spent to obtain the result

Using this information, we could clean the cache (say, with user-provided instructions at the end of the CliMAF script) by removing:

the files that have a longer 'search time' than 'execution time'
the files that have a longer 'search time' than a user-provided threshold (say, 1s)

We could also say that we don't want to have more than XX files in the cache (use of the disks), so we keep the XX files that have the longest 'execution time' and remove all the others.

An new output format shoud be allowed for operators : text

This would allow to use e.g. 'ncdump -h' from inside CliMAF. The text output wouldn't be managed by CliMAF (but only displayed)

operators 'lines' and 'plot' should be smarter re. time axis

Tick marks should be smartly adapted to the time period duration. When datasets does not cover the same time period, the user should be able to choose wether time axis should be aligned to the same origin or just be the union of all time periods

Operator 'plot' doesn't work if field type is 'short'

Check the actually available period

We need to add (or confirm) a check from explore that ensures that the CliMAF dataset actually covers the period requested by the user.
If the period is not fully available, it would be very useful to update the .period in the ds object (from explore('resolve') for instance).

Iterating operators on ensemble members should be parallel

Operators should be allowed to have optional input fields (in addition to a mandatory main one)

This is needed at least for operator 'plot' for a secondary scalar input field

Ensembles on multiple attributes?

Today, we can easily build an ensemble over multiple values of one attribute using cdataset.explore('ensemble').
It would definitely be interesting to build ensembles over multiple attributes: model and realization, institute, driving_model and model (for CORDEX/RCM projects).
One issue is the naming of the member; an answer is that we name the members with the value of each attribute (for the member), separated with '_' (or a user-provided separator?).
Example: CNRM-CM5_r1i1p1

The behaviour of explore('ensemble') could be:

cdataset.explore('ensemble') returns an error if more than one attribute has multiple values; and also returns the list of those attributes
cdataset.explore('ensemble', build_ensemble_on=['model','realization']) forces to build the ensemble on the multiple values of model and realization and names the members with model_realization; returns a similar error as above if not only the attributes provided to build_ensemble_on (here model and realization) have multiple values)

Need a composite operator for plotting two average maps, and their difference with shading for significance

Outputs and logs written to current directory

I'm using climaf in a web processing service and the climaf executable is started in directory where it might not have write permission (using an unprivileged service user). Climafs writes log files (and probably more) to the current directory ... this fails in my service use case.

Output path with expected write permission should be made configurable (logs/, temp/, outputs/, ...).

Operators should be allowed to have optional output fields

This to match cases where, depending on the input parameters, an operator will or won't output a given, secondary, field

A notebook to illustrate how we use CliMAF results directly in python

A short notebook to show how to use cMA or get the file name and open it in python with netcdf4 (or any other netcdf library).

Basic script mcdo.sh enforces NetCDF3 format

This is due to limitation at CNRM of the implementation of HDF5, which is not thread safe

package ScientificIO shouldn't be mandatory

It would we preferable that alternate packages are supported too (e.g. NetCDF4, scipy.io.netcdf ..)

Support Python 3.x

Python 3.x (starting with 3.5) is more and more used. climaf is currently only supporting Python 2.7. It would be nice if climaf can support both 2.7 and 3.x (>=3.6).

Compatibility can be achieved by using six:
https://pythonhosted.org/six/

One can also use a compat.py module to handle 2.7/3.x compatibility, example:
https://github.com/geopython/pywps/blob/master/pywps/_compat.py

Python packaging: setup.py missing

climaf currently has now setup.py and can not be installed using pip.

I just hacked quickly a setup.py for climaf in my fork:
https://github.com/cehbrecht/climaf/blob/pingudev/setup.py

This should be done in a cleaner way (scripts?).

I'm using the fork to build a conda package:

An operator is missing for plotting Taylor diagrams

Should interface to ES-DOC errata system

Data users should be provided with an easy way to query the ES-Doc errata system for the dataset they are using

Operator plot should allow to enforce the reference longitude

a CliMAF server would be useful

CliMAF allows to derive basic and advanced results, and it can cache it. When the data is already computed and handled in the cache, the main part of the response time is due to loading the software. Implementing a CliMAF server with a light client communicating through RPC would significantly improve the response time

Problem converting to CliMAF object to MA

I've been working with on the possibility to mix CDAT and CliMAF. But for the moment I have difficulties to convert a CliMAF object to a MA:

jservon@ciclad-ng:~/Evaluation/CliMAF> python
Python 2.7.4 (default, Apr 22 2014, 14:55:23)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> from climaf.api import *
Cache set to /data/jservon/climaf_cache
>>> cdef('project','CMIP5')
>>> cdef('experiment','historical')
>>> cdef('frequency','monthly')
>>> dataloc(organization="CMIP5_DRS",url=['/prodigfs/esg/'])
<climaf.dataloc.dataloc instance at 0x7f47c10a1c68>
>>> dat=ds(model='IPSL-CM5A-LR',
...        rip='r1i1p1',
...        variable='tas',
...        period='1980-2000',
...        )
>>> dat.baseFiles()
'/prodigfs/esg/CMIP5/merge/IPSL/IPSL-CM5A-LR/historical/mon/atmos/Amon/r1i1p1/v20110406/tas/tas_Amon_IPSL-CM5A-LR_historical_r1i1p1_185001-200512.nc'
>>> test = cMA(dat)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ssenesi/climaf/climaf/api.py", line 160, in cMA
    return climaf.driver.ceval(obj,format='MaskedArray',deep=deep)
  File "/home/ssenesi/climaf/climaf/driver.py", line 180, in ceval
    rep=ceval(extract,userflags=userflags,format=format,deep=deep,recurse_list=recurse_list)
  File "/home/ssenesi/climaf/climaf/driver.py", line 267, in ceval
    return cread(file)
  File "/home/ssenesi/climaf/climaf/driver.py", line 545, in cread
    if varname is None: varname=varOfFile(datafile)
  File "/home/ssenesi/climaf/climaf/netcdfbasics.py", line 14, in varOfFile
    if (filevar not in fileobj.dimensions) and not re.findall("^time_",filevar) :
NameError: global name 're' is not defined
>>>

@senesis any thought?

An EOF operator for CliMAF and multi-variable outputs

The computation of EOFs comes with an issue: we normally have two outputs from an EOF decomposition (the eigenvectors and the principal components). We might also need to have access to the eigenvalues (explained variance).
Therefore two strategies are possible:

we say that climaf only handles one-variable output files and we try to find a solution to produce the EOFs, the principal components and the eigenvalues separately (cdo might be able to do that)
we produce one netcdf with more than one variable (in this case, three)
Discussion is open!

CliMAF should be proved to interface nicely with ECLIS for simulation monitoring

Operators which invokes Python functions (rather than external binaries) are not yet managed

Dealing with a 5 dimensional dataset (CMIP6 msftyz)

The variable msftyz in CMIP6 is 5 dimensional (x, y, olevel, time, and \3basin).
At the moment mcdo.sh (and consequently ds()) can't work on a 5 dimensional dataset (not supported by CDO, and not in short term plans on their side).
Adding a ncks + collapsing of the \3basin dimension would allow reducing to 4 dimensions from mcdo, but the feasibility is yet to be explored...

More examples on how to plug scripts

We should add some simple examples of scripts in the different languages (python, ncl, R, ferret...) to provide a simple basis to develop a script that can be plugged in CliMAF.

Operator plot is slow

Olivier : 4.5s for plotting a 2d field is too long

Need for more basic functions in module html

Function html_table_line(s) are fine for tables. There is however a need for a more basic function, which would take two arguments : a CliMAF object of type figure and a label, and which would return the html code for a link from that label to the Climaf cache file for the figure

Interface to Drakkar CDFTools

CliMAF should interface to the following Drakkar CDFTools : cdfmean, cdfheatc, cdftransport, cdfsection, cdfmxlheatc, cdfstd

Function cfile should apply to an ensemble dataset

In that case, it should provide a single file with variable names suffixed by member label

IPSL data organization needs optimization of the generic file selection function

In order to cope with cases where for a given variable, the corresponding filename may be formed either using the variable name or using 'filenameVar', another string which is declared using 'calias()'

Latitude and longitude names in gplot.ncl

We have to find a way so that the script gplot.ncl is less sensitive to the name of the dimensions, notably for the spatial dimensions 'lat' and 'lon'. At the moment, if the input file has dimensions called 'LON' and 'LAT', the script returns an error.

fatal:["Execute.c":5861]:variable (lat) is not in file (ffile)

Saving the 'url' that matches the file found in ds()

Having access to the pattern that matches the file(s) found with ds() would allow:

automatically fill the missing values of the keywords
and an automatic handling of all the projects in the period manager (C-ESM-EP); avoid hard coding

mktemp creates dir in current folder

I run into an issue with temp folders in mcdo.sh:

https://github.com/senesis/climaf/blob/3e1762ec788674b470d895b15aa398184c77bb4a/scripts/mcdo.sh#L25

It creates the temp folder in the current folder which might be write protected. The following patch worked for me:

$ mktemp -t climaf_mcdo -d
OR
$ mktemp -d /tmp/climaf_mcdo_XXXXXX

Allow to read a dataset from any file on-the-fly

There is a need to be able to read a dataset from a file and work at once with it without configuring a 'CliMAF project' for that (issue actually reported by Jerome)

CliMAF and provenance

CliMAF should record in NetCDF files history the list of basic datafiles used upstream of the computation, together with their creation date and maybe their tracking-ID and checksum.

Need a punchy notebook for popularizing CliMAF unique features

There is extensive doc and examples for CliMAF use. It is however rather a reference documentation, which is boring at first glance. The front page of doc should give access to a rather short doc, which would be the html version of a punchy notebook, and which would exemplifies CliMAF most unique features, from the point of view of its use by a scientist. It may also provide links to the various chapter of the doc for providing further reference for those features.