Coder Social home page Coder Social logo

ltelab / disdrodb Goto Github PK

View Code? Open in Web Editor NEW
18.0 2.0 7.0 47.78 MB

A global database of disdrometers measurements

Home Page: https://disdrodb.readthedocs.io/en/latest/

License: GNU General Public License v3.0

Python 87.44% Jupyter Notebook 12.51% Makefile 0.05%
disdrometer disdrometer-data dsd parsivel-data psd parsivel

disdrodb's People

Contributors

charlottegiseleweil avatar dependabot[bot] avatar ghiggi avatar jacgraz avatar kimcandolfi avatar pre-commit-ci[bot] avatar regislon avatar saveriogzz avatar sphamba avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

disdrodb's Issues

[FEATURE] Creation of disdrodb_archive repository for sample data and metadata

This issue aim to expose the idea of creating a separate repository called disdrodb_archive where to locate sample data and metadata for each station and reader.

The main goals are to:

  • share with the community the DISDRODB stations metadata in order to collaboratively edit it;
  • share with the community the DISDRODB stations issue files in order to collaboratively improve the quality of the dataset;
  • facilitate the testing of the L0 readers.

Related to this point, in the DISDRODB full archive, we could better enforce the format and structure of the raw files within each station directory, by for example banning heterogenous archive/compression formats (tar, gzip, bz2,...) or directory nested structures.
For portability of the DISDRODB raw archive, it might be useful to have all files within a station zipped in a single directory ... especially for stations where the deployment has terminated and we do not expect further data stream.

This feature request will require the development of code to synchronize between this repo metadata and the DISDRODB Full Archive.

In the context of this PR ... please deprecate the use of station_id in favor of station_name.
The station directory must have the same (hopefully explanatory) name as the corresponding station_name key in the metadata

[DOCS] Contributing Guidelines index is outdated

The index at the beginning of https://github.com/ltelab/disdrodb/blob/main/CONTRIBUTING.rst is outdated.
This doc portion should be updated:

Before submitting your contribution, please make sure to take a moment and read through the following guidelines :

[Code of Conduct](https://github.com/ltelab/disdrodb/blob/main/CODE_OF_CONDUCT.md)
[Issue Reporting Guidelines](https://github.com/ltelab/disdrodb/blob/main/CONTRIBUTING.rst#issue-reporting-guidelines)
[Pull Request Guidelines](https://github.com/ltelab/disdrodb/blob/main/CONTRIBUTING.rst#pull-request-guidelines)
[Development Setup](https://github.com/ltelab/disdrodb/blob/main/CONTRIBUTING.rst#development-setup)
[Project Structure](https://github.com/ltelab/disdrodb/blob/main/CONTRIBUTING.rst#project-structure)
[Github Flow](https://github.com/ltelab/disdrodb/blob/main/CONTRIBUTING.rst#github-flow)
[Commit Lint](https://github.com/ltelab/disdrodb/blob/main/CONTRIBUTING.rst#commit-lint)

[BUG] Replace get_L0_dtype_standards by get_L0A_dtype

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

We removed the get_L0_dtype_standards function and replaced it with disdrodb.l0.standards.get_L0A_dtype.
Yes, you should pass the sensor_name argument.

Expected Behavior

No response

Steps To Reproduce

No response

Environment

No response

Anything else?

No response

[FEATURE] Clarify the branch name definition in the doc

The naming of the branch is not clear to me in the contributors_guidelines

It is currently defined as follow :

reader-<institute>-<campaign>

In the reader page it is defined as follow :

Guidelines for the Name of the institution or country folder :

  • We use the institution name when campaign data spans more than 1 country.

  • We use country when all campaigns (or sensor networks) are inside a given country.

Which is bit more clear to me.

Can I update the contributors_guidelines with the explanations from reader page


I'm working on the delft dataset. If I have open one station yml file, I see that the compain = DELFT and the country = Netherlands

Then I've named my branch reader-Netherland-delft. Am I right @ghiggi ? I would rather concider Delft as the institution...

What if the delft uni works outside the Netherlands ?

  • there may be a branch named reader-delft-blabla
  • Same remark for the raw file structure : there will be delft under Raw and under Raw/NETHERLANDS

Why not having every time the following structure Raw/Country/Institution/...

To be honest, I find mixing country and institution a bit confusing

[FEATURE] Define and document installation process

Is your feature request related to a problem? Please describe.
Installing pip / conda env in mac OS seems not to work properly.

Describe the solution you'd like
Define the correct process

Describe alternatives you've considered
It will work

Additional context
None

[FEATURE] update conda env. file

the current environment.yml contains lots of unused packages. The aim here is to keep only the package need to run DISDRODB and to generate the documentation.

[DESIGN] Design automated tests (maybe requires : Enforce typing)

Do you want us to implement automated tests on new readers provided by the community ? If so, do we need them to add their sample data for a specific reader in order to test it ? We can require them to upload a data_sample at each new reader ..
What do you think ?

[BUG] empty date in parquet file

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Working on the reader for the Netherlands, I've noticed that some dates are empty in the parquet file. The exact same data frame written as csv outputs no-empty dates. These lost dates seem to be formatted correctly.
The processed folder in the LTE nas contains these empty dates which is, in my opinion, wrong.

Expected Behavior

If the source file as corrected formatted dates they should be replicated into the parquet.

Steps To Reproduce

pickel file here

`
import pandas as pd

df = pd.read_pickle("<your_path>\sample_df.pkl")
df.to_parquet('<your_path>\sample_df.parquet')
df.to_csv('<your_path>\sample_df.csv')
`

Environment

- OS:windows
- python:3.8.10

Anything else?

@ghiggi please react if you are already aware of this behaviour. I'm working on it now.

[FEATURE] Definition of DISDRODB product names structure.

The names of files produced by DISDRODB currently have the following structure:
<campaign_name>_s<station_id>_<optional_suffix>.<file_extension>
which results for example as EPFL_2011_s1.nc

I suggest changing the file name structure to something more appropriate and informative similar to the following:
DISDRODB.<product_level>.<product_name>.<campaign_name>.<station_name>.<sensor_name>.s<start_time>.e<end_time>.p<production_time>.<version>.<file_extension>.

DISDRODB.L0B.Raw.EPFL2011.Campus1.OTT_Parsivel2.s20220125000000.e20220130000000.p20220130000000.V01.nc

This file structure will enable to perform relevant filtering operations without the need of opening the files.

To implement such a file structure we might want to ban and well document that:

  • campaign_name, station_name and sensor_name cannot contain the delimiter .
  • For readability, we might want to enforce the sensor_name to use - instead of _

For the time components we could choose between the YYYYMMDDhhmmss or the YYYYDOYhhmmss.

Suggestions are very welcome !!!

[BUG] <NETCDF: nc_def_var_deflate fails with string variables: Filter error: bad id or parameters or duplicate filter>

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

When creating a netcdf file from an xarray object in L0B_processing.py, the code crashes as soon as you try to add compression on string variables.

Expected Behavior

Set the compression level to 0 for weather_code_metar_4678 and weather_code_nws in L0B_encoding.yml

Steps To Reproduce

No response

Environment

- OS:
- python:

Anything else?

No response

[ENHANCEMENT] Fix the time epoch in the dimension encodings

Is your feature request related to a problem? Please describe.

We need to ensure that the time epoch does not vary between netCDF files.
I suggest fixing it to the Unix EPOCH.

The code should be added in either the L0.standards.get_L0B_encodings_dict or in the
L0.L0B_processing.write_L0B function.

Describe the solution you'd like

EPOCH = u"seconds since 1970-01-01 00:00:00" # somewhere on top of encoding 
# .... 
encoding = ds['time'].encoding
encoding['units'] = EPOCH
encoding['calendar'] = 'proleptic_gregorian'  
 

metadata entry "sensor_name" does not accept integers in its value

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

When using the entry Parsivel2 as entry in the metadata's sensor_name key, I receive a TypeError and the script stops executing.

Expected Behavior

It should be ideally possible to use any string as sensor_name's value

Steps To Reproduce

  • Use the metadata here
  • Convert them to yml as it is the only format accept at the moment
  • See difference in behaviour when using Parsivel (ends execution) and Parsivel2 (error!)

Anything else?

The full traceback error:

There are the following metadata files without corresponding data: ['PAR002', '20', 'testconki', 'PAR003']
 - L0 processing of station_id PAR001 has started.
 - 1 files to process in /home/sguzzo/Parsivel/RAW_TELEGRAM
 - Conversion to Apache Parquet started.
 - Conversion to Apache Parquet ended.
Traceback (most recent call last):
  File "/home/sguzzo/PycharmProjects/disdrodb/disdrodb/readers/DELFT/parser_RASPBERRY.py", line 487, in <module>
    main()
  File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/sguzzo/PycharmProjects/disdrodb/disdrodb/readers/DELFT/parser_RASPBERRY.py", line 407, in main
    check_L0_standards(fpath=fpath,
  File "/home/sguzzo/PycharmProjects/disdrodb/disdrodb/check_standards.py", line 48, in check_L0_standards
    if not df[column].between(*dict_field_value_range[column]).all():
  File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/pandas/core/series.py", line 5110, in between
    lmask = self >= left
  File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/pandas/core/ops/common.py", line 69, in new_method
    return method(self, other)
  File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/pandas/core/arraylike.py", line 52, in __ge__
    return self._cmp_method(other, operator.ge)
  File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/pandas/core/series.py", line 5502, in _cmp_method
    res_values = ops.comparison_op(lvalues, rvalues, op)
  File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/pandas/core/ops/array_ops.py", line 284, in comparison_op
    res_values = comp_method_OBJECT_ARRAY(op, lvalues, rvalues)
  File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/pandas/core/ops/array_ops.py", line 73, in comp_method_OBJECT_ARRAY
    result = libops.scalar_compare(x.ravel(), y, op)
  File "pandas/_libs/ops.pyx", line 107, in pandas._libs.ops.scalar_compare
TypeError: '>=' not supported between instances of 'str' and 'int'

What python version are you using?

3.9.*

[FEATURE] Tutorial enhancements

  • Include plotting functions within reader_preparation.ipynb nb for visual data exploration
  • Include measurement errors/flawed data handling examples within reader_preparation.ipynb to use the errors handling config file (in data/.. issue ../ yml)

[FEATURE] Create a wrapper for L0 readers click commands

Is your feature request related to a problem? Please describe.

I think the code readability of the L0 readers would improve if we manage to compact reduce the list of click commands
that are present above the reader function definition

Describe the solution you'd like

Follow this approach

@click.command()  # options_metavar='<options>'
@click.argument('raw_dir', type=click.Path(exists=True), metavar='<raw_dir>')
@click.argument('processed_dir', metavar='<processed_dir>')
@click.option('-L0A', '--L0A_processing', type=bool, show_default=True, default=True, help="Perform L0A processing")
@click.option('-L0B', '--L0B_processing', type=bool, show_default=True, default=True, help="Perform L0B processing")
@click.option('-k', '--keep_L0A', type=bool, show_default=True, default=True, help="Whether to keep the L0A Parquet file")
@click.option('-f', '--force', type=bool, show_default=True, default=False, help="Force overwriting")
@click.option('-v', '--verbose', type=bool, show_default=True, default=False, help="Verbose")
@click.option('-d', '--debugging_mode', type=bool, show_default=True, default=False, help="Switch to debugging mode")
@click.option('-l', '--lazy', type=bool, show_default=True, default=True, help="Use dask if lazy=True")
@click.option('-s', '--single_netcdf', type=bool, show_default=True, default=True, help="Produce single netCDF")
def main(raw_dir,
         processed_dir,
         L0A_processing=True,
         L0B_processing=True,
         keep_L0A=False,
         force=False,
         verbose=False,
         debugging_mode=False,
         lazy=True,
         single_netcdf = True, 
         ):

would become

# Define this in some file
def readers_click_options(function):
  function = click.argument('raw_dir', type=click.Path(exists=True), metavar='<raw_dir>')(function)
  function = click.argument('processed_dir', metavar='<processed_dir>')(function)
  function = click.option('-L0A', '--L0A_processing', type=bool, show_default=True, default=True, help="Perform L0A processing")(function)
  function = click.option('-L0B', '--L0B_processing', type=bool, show_default=True, default=True, help="Perform L0B processing")(function)
  function = click.option('-k', '--keep_L0A', type=bool, show_default=True, default=True, help="Whether to keep the L0A Parquet file")(function)
  function = click.option('-f', '--force', type=bool, show_default=True, default=False, help="Force overwriting")(function)
  function = click.option('-v', '--verbose', type=bool, show_default=True, default=False, help="Verbose")(function)
  function = click.option('-d', '--debugging_mode', type=bool, show_default=True, default=False, help="Switch to debugging mode")(function)
  function = click.option('-l', '--lazy', type=bool, show_default=True, default=True, help="Use dask if lazy=True")(function)
  function = click.option('-s', '--single_netcdf', type=bool, show_default=True, default=True, help="Produce single netCDF")(function)
  return function 

## In each reader 
#  Add the import of  readers_click_options
# And modify as follow
@click.command()
@readers_click_options
def main(raw_dir,
         processed_dir,
         L0A_processing=True,
         L0B_processing=True,
         keep_L0A=False,
         force=False,
         verbose=False,
         debugging_mode=False,
         lazy=True,
         single_netcdf = True, 
         ):



@regislon can you take care of that?
Related to this refactor ... do we maybe want to change the name of the function? From main to reader

[REFACTOR] Refactor Project structure + Gh repo structure

  • Finalize Contributing guidelines PR
  • Refactor project structure

disdrodb/
├── processing
├── L0
├── L1
├── L2
├── pipelines
├── api
├── utils
├── configs
├── data
├── docs
├── references
.gitignore
LICENSE
CONTRIBUTING.md
README.md
requirements.txt

[BUG] Remove get_OTT_Parsivel_dict functions

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

These imports and code lines are outdated and should be removed.

Expected Behavior

No response

Steps To Reproduce

No response

Environment

No response

Anything else?

No response

[DOC] Document alternative L0 netCDF production of a single raw file

We should document how a single raw file can be alternatively processed to L0 netCDF.
There are people that might want to just exploit some functionality provided by disdrodb.

I think the following example could be useful to lot of people

# Import the relevant packages 
from disdrodb.L0.import read_raw_data, cast_column_dtypes, create_L0B_from_L0A, set_encodings

# Specify the filepath of a single raw text file 
filepath ="/file/path/to/your/raw/text/file.txt"  

# Define the sensor type 
sensor_name = "OTT_Parsivel"

# Define processing mode 
lazy = False 

# Define (dummy) attribute dictionary to enable further processing 
# --> The attrs dictionary will be attached to the output xr.Dataset (and netCDF4)
attrs = {}
attrs["sensor_name"] = sensor_name
attrs["latitude"] = "-9999"
attrs["longitude"] = "-9999"
attrs["altitude"] = "-9999"
attrs["crs"] = "dummy"

# Specify here the required reader_kwargs, colum_names and df_sanitizer_fun
# --> You can copy it from a specific reader 
reader_kwargs = {}
colum_names = [] 
def df_sanitizer_fun(df, lazy=False): 
     pass
     return df 

# Read the raw file 
df = read_raw_data(filepath, column_names, reader_kwargs, lazy=lazy)

# Sanitize the dataframe to met the DISDRODB standard columns 
df = df_sanitizer_fun(df, lazy=lazy)
print(df)

# Change the column dtype to match the DISDRODB standards
df = cast_column_dtypes(df, sensor_name)
print(df)

# Derive the corresponding xr.Dataset 
ds = create_L0B_from_L0A(df, attrs, lazy=lazy, verbose=False)
print(ds)

# Set dataset encodings 
# - This also convert object dtype into string 
# - This also chunk the array in blocks 
ds_encoded = set_encodings(ds.copy(), sensor_name)
print(ds_encoded)

# Write your DISDRODB L0 netCDF4  
ds.to_netcdf("/tmp/dummy.nc")

Before adding this example to the docs, we need to add the following imports in the disdrodb.L0.__init__ file

from .L0A_processing import read_raw_data, cast_column_dtypes
from .L0B_processing import retrieve_L0B_arrays, create_L0B_from_L0A, set_encodings

problems using nco with netcdf files created by disdrodb

Hi guys, I'm opening the issue here so we can have a public discussion!

Recently @mschleiss has experienced an issue while trying to use ncdump on the converted netcdf files using the [latest version of] parser_RASPBERRY.py.
The error that Marc was receiving was

NetCDF: HDF error
Location: file ; line 1705

which was pretty uninformative.
The headers file got (marc_headers.txt) were resulting incomplete, with the first line failing being string weather_code_METAR_4678(time) ;

Also my colleague Rob faced a similar issue using Matlab:
MicrosoftTeams-image

I was able to open the "corrupted" files using any method (nco, xarray, ecc.) on my machine.
When I finally ran wither the operator nccopy (without any additional option) or ncks -4 -L 5 notworking.nc good.nc to create new files out of the corrupted ones, both Marc and Rob were finally able to use them.

Please find here PAR001.tar.gz a few example files of both not-working files and working ones.

I'm curious to know what you guys think of this!

Thanks :)

[BUG] <replace import statement in run_DELFT_processing>

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

The import statements for the parser module need to be updated

Current line (points to wrong folder)

from disdrodb.utils.parser import get_parser_cmd

replace with

from disdrodb.pipeline.utils_cmd import get_parser_cmd

Expected Behavior

No response

Steps To Reproduce

No response

Environment

- OS:
- python:

Anything else?

No response

[DOCS] install instructions

  • Users Install instructions
  • Developers install instructions not reproducible successfully on Mac.
  • Conda environment : the reason why conda install is so slow is because environment.yml should specify specific versions. Update environment.ymlwith specific versions

[BUG] Error due to processed folder deletion and recreation

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

While working with jupiter notebook, the log created under data\DISDRODB\Processed\NETHERLANDS\DELFT` blocks the deletion and recreation of the process folder

Expected Behavior

In raw_dir, processed_dir = check_directories(raw_dir, processed_dir, force=False) If force = False --> no erreor should be raised if the folder exists already

Steps To Reproduce

just run twice the cell under 2. Initialization

Environment

No response

Anything else?

No response

[REFACTOR] Remove templates folder

@ghiggi Okay we're ready for the big leap - now with the notebook as a line-by-line replacement of reader_template.py; people can jsut copy that with their own data to avoid confusion. There should be no reason to keep this "templates" folder !

Let's be consistent with collaborative open-source good practices, if you want to see what people are doing, you can always go in their forks, but they shouldn't have to commit half-baked WIP readers development into the main repo .

I'm going ahead and making the bold move - so we can then rename every "parser" as "reader" to avoid confusion (see #46)

encountering `KeyError: 'sensor_temperature_PCB'` when using DELFT_processing.py

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Hi guys,
I'll start keeping public track of some of the common errors I encounter, I guess it will be useful in the future!

When running the run_DELFT_processing.py script on the files for the PAR002 instrument. I get a Key Error when the program is trying to convert the parquet to netcdf file. Here is the log response:

There are the following metadata files without corresponding data: ['PAR001']
 - L0 processing of station_id PAR002 has started.
 - 79 files to process in /home/sguzzo/Parsivel/RAW_TELEGRAM/CABAUW
 - 0 of 1 have been skipped.
 - Conversion to Apache Parquet started.
 - Conversion to Apache Parquet ended.
 - L0 processing of station_id PAR002 ended in 1.34s
 - L1 processing of station_id PAR002 has started.
 - Reading L0 Apache Parquet file at /home/sguzzo/Parsivel/Processed/CABAUW/L0/CABAUW_sPAR002_20211008.parquet started
 - Reading L0 Apache Parquet file at /home/sguzzo/Parsivel/Processed/CABAUW/L0/CABAUW_sPAR002_20211008.parquet ended
 - Retrieval of L1 data matrix started.
 - Retrieval of L1 data matrix finished.
Traceback (most recent call last):
  File "/home/sguzzo/PycharmProjects/disdrodb/disdrodb/readers/DELFT/parser_RASPBERRY.py", line 504, in <module>
    main()
  File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/sguzzo/miniconda3/envs/disdrodb/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/sguzzo/PycharmProjects/disdrodb/disdrodb/readers/DELFT/parser_RASPBERRY.py", line 466, in main
    write_L1_to_netcdf(ds, fpath=fpath, sensor_name=sensor_name)
  File "/home/sguzzo/PycharmProjects/disdrodb/disdrodb/L1_proc.py", line 319, in write_L1_to_netcdf
    encoding_dict = {k: encoding_dict[k] for k in ds.data_vars}
  File "/home/sguzzo/PycharmProjects/disdrodb/disdrodb/L1_proc.py", line 319, in <dictcomp>
    encoding_dict = {k: encoding_dict[k] for k in ds.data_vars}
KeyError: 'sensor_temperature_PCB'

Expected Behavior

No response

Steps To Reproduce

Example of raw data file that can be used to reproduce the error: 20211007.zip

Anything else?

No response

What python version are you using?

3.9.*

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.