Coder Social home page Coder Social logo

wis2-topic-hierarchy's Introduction

wis2-topic-hierarchy's People

Contributors

6a6d74 avatar amienshxq avatar amilan17 avatar antje-s avatar efucile avatar golfvert avatar josusky avatar maaikelimper avatar solson-nws avatar tomkralidis avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wis2-topic-hierarchy's Issues

uploading of changes fails on GET requests

GET request that is used to check existence of a registry/code ends with the following error:

HTTP Status 415 – Unsupported Media Type
Type Status Report

Message Unsupported Media Type

Description The origin server is refusing to service the request because the payload is in a format not supported by this method on the target resource.

Level 9+ subcategories scope and management

As discussed, I think that maintenance (GitHub), publication (WMO Codes Registry) and validation (Broker) of the subcategories shall apply to core data only. Subcategories associated with recommended data are not in scope.

If we are in agreement, then we should clearly document this.
However, there are use cases where a particular domain might want to define subcategories for certain types of recommended data, e.g. NWP highly-recommended, and we should consider options for supporting the normalization of these codes in a feasible manner.

Having version in the top levels of the topic hierarchy is potentially a problem as it can be really disruptive for all users

Having the version in the top of the tree is a problem because a change anywhere in the topic hierarchy and requiring an increase of the version number will force all topic publishers to update their configuration.

See below an example where it is decided to rename Ocean into "Marine" a one of the sub domains. In that case the version would be incremented from a to b and all messages publishers would then be required to update their destination topic configuration at the same time because the topic paths would not be valid anymore.

This means that potentially an centre publishing only messages in the hydrology domain would have to update its topic paths because the Ocean sub tree has been modified. A lot of messages consumers could also be impacted.

Here is an example with the original tree:

original-tree

Here would be the modified tree with Marine:

Screenshot 2023-10-06 at 08 56 57

Because the version is incremented and at the top of the tree, all publishers in order to publish messages in their correct new topics will have to update their configuration (their topic paths have changed).

This is like an unnecessary big bang approach when only the publisher (and consumers) impacted by the change should make a modification of their systems.

My proposal would be to remove the version number from the topic hiearchy. The changes and version number should be managed outside of the tree implementation and announced/communicated on the main WIS2 web presence (Global Catalogue, the web site around the Global Broker if there is one). Then a parallel phase where both old and new trees will be supported by the Broker for a given period and then the dead branch removed at the end of the parallel message distribution.

Cases to manage are:

  • modify/add a node: In that case it can be inserted in the existing tree.
  • deletion of a node and reattaching or not all the subtopic nodes.

Below is an example where core would be removed from the tree and all its sub-topics reattached to data.

Here is the original tree with the new structure (sub-topics attached to data) maintained in parallel.

Screenshot 2023-10-06 at 09 18 53

Then the core topic is deleted at the end of the parallel period:

Screenshot 2023-10-06 at 09 05 04

In that case all the publishers not impacted by the tree branch change can continue publishing without changing there configuration.

Did I forget something ? Thoughts ?
@tomkralidis @jsieland @amilan17 @josusky @antje-s @solson-nws ....

In general it seems to me that we are going to have a lot of surprises with the topic hierarchy when implemented and we should not make it yet part of any "standards" or official WMO paper from which it will be very difficult to change it afterward.

This should be done once the Pilot or frist implementation phase has been completed.

Document basic business rules for the Topic Hierarchy

Through the development of sub-discipline categories many questions are arising and I think that the domain specific communities would benefit from more guidance on TH rules for the Level 9+ values.

https://github.com/wmo-im/wis2-topic-hierarchy/wiki/Change-Management.
https://github.com/wmo-im/wis2-topic-hierarchy/wiki/Business-rules

Scope

#18

Extensibility

#30 

Relationship to scope of metadata record

Repeatability

Versioning

#25

Other Related issues:
wmo-im/wis2-guide#38
#21

merge country and centre-id

As discussed at TT-WISMD 2023-09 face-to-face, as well as W2AT 2023-09-18

Proposal for adapting WTH to merge and country and centre-id

Overview

WTH levels 4/5 describe country and centre-id as follows:

  • country: Lower case representation of ISO3166 3-letter code. Includes extensions for partner organizations
  • centre-id: Acronym as specified by member and endorsed by the PR of the country and by WMO

Issues with country

  • ambiguous meaning/role
    • issuing centre? PR representation/member? location of data?
  • ambiguous coupling / relationship of centres and countries (one to many)
  • can create numerous permutations of topics
  • increase the difficulty to find the datasets users are interested in as they first have to understand from which country and then centre it comes from. For instance, the country code for international organizations is really not obvious for non-knowledgeable users.

Proposal

Remove country and define the centre-id on reverse hostname notation (starting with TLD) into a single compound level. In other words, the citation authority based on the Internet domain name of the issuing centre.

Examples:

  • ca.gc.ec.msc
  • uk.gov.metoffice
  • fr.meteo
  • int.eumetsat
  • de.dwd
  • gov.noaa
  • cn.gov.cma
  • test.wis2node1

Benefits

  • clearer attribution to the issuing centre
  • reduces WTH by one level
  • WTH validation becomes easier for runtime
  • Same notation and practice as in the WIS 1 and on the Web. Users are already familiar with it and know the centre ids for WIS1 users.

Implications

  • dotted notation may or may not be problematic for AMQP-based implementations
    • however, MQTT is the requirement for WIS2, so this is an implementation detail

Change management

  • Update to WIS2 Nodes, Global Brokers, and clients

remove reports from notification type

W2AT 2023-11-14:

In the context of monitoring and metrics, it was decided that reports/metrics/alerts would occur in a WIS2 system message bus.

As a result, it was decided to remove report from the notification-type level of WTH.

Topic Hierarchy Structure: The extreme complexity of the Topic hierarchy could potentially lead to a limited adoption of the service or very large performance issues

I hope that I am misunderstood something and that it should be resolved easily by updating my understanding of the WIS architecture but I have a couple of point to raise on the topic hierarchy.

I have been looking at the WIS2 topic hierarchy structure which is meant to be built for helping users finding datasets and filtering the data topics per subject. Thinking of it and how it could be implemented, it looks to me that its complexity will be a very large barrier to entry or it could lead to having users completely ignoring it.
Another point is that the topic hierarchy could lead to the implementation of a very complex system for the main broker reflecting the entire hierachy and in addition maintaining good performances could be extremelly challenging.
Below are the points that I have been trying to develop:

Large Discovery/Domain information in the topic hierarchy will be counter-productive in helping user understanding what data is available and how to find relevant data for users

A quick calculation taking the 8 first levels and assuming that we have around 195 countries and 20 centres per country in average (which is probably below the real number).
I end-up to 2x1x1x195x20x4x2x8 = 499200 branches for the 8 first levels and for the total tree taking 3 level of 5 sub discipline each: 2x1x1x195x20x4x2x8x5x5x5 = 62.400.000 topics. The assumption taken might be too large but reducing the problem by a factor 100 will lead to the same conclusion.

From the discovery/usability point of view, this is a large obstacle for users if the intention is to have them understanding the topic hierarchy and use it to find the data they are interested.
Users will most probably not find their way and might simply use + or # wildcards at many levels to receive some data.
They could then be overwhelmed by the number of messages received and the main brokers could be overloaded by such queries and the number of clients subscribing to many topics.

This is why I am questioning, the purpose of providing so much semantic and discovery information in the topic hierarchy and making it so deep.

Additionally, if the intention is to help users understanding what data is available why do we have 8 levels of technical (version, WIS2) and political information before the domain information ?

At least the topic hierarchy should be reversed but in my opinion, mostly simplified.

If the answer to the interrogations above is that the catalogue will provide the discovery services to find the data then there is no need to create such a complex topic hierarchy structure that will make the implementation very complex and challenging for the users.

Potential performance issues and challenges for implementation

Another point is performance of a system that will have to replicate and manage for distribution 62 Millions topics with some topics having a very high distribution frequency. This means that it is certainly leading to the implementation of a large scale system and tests of that scale should be performed to assess that the products on the market (HIVEMQ, RabbitMQ, Mosquitto, Amazon MQTT service) can cope easily with such scale.
It should also be noted that this complex hierarchy forces users to use wildcards (+, #) which will make the system to be created, even more demanding in term of resources (need of tables in memory, on disc, databases to resolve the wild cards and maintain the multi subscriptions or thousands of users).

Proposal for a way forward

I would propose to re-think the topic hierarchy and go back to the initial requirements:

  • Remy said what the intention was to use it to help user not subscribing to too many topics and being overwhelmed by the number of messages received.

How the topic hierarchy should be organise to focus on such requirement ?

Here are some leads that could help solving the issue and not leading to a difficult full scale implementation:

  • The discovery services of the catalogue shall be used to provide the different topics to which user will want to subscribe. Then we do not provide semantic in the topic hierarchy (It is not a discovery service). Use arbitrary names to avoid any mis-interpretation and minimize their numbers.
  • Limit the number of levels in the topic hierarchy.
  • The originator of the notification is in the messages so the political structure might not be needed in the hierarchy, the domain structure also might not be needed to minimize the complexity as it will be available from the discovery catalogue.
  • It might be that only a limited set of data/messages need such a deep topic hierarchy and it should be only built for that limited purpose.
  • A practical organisation might be to limit the number of levels and let centres define a simple technical/logical structure while alimiting the number of topics.
  • Rules on how many topics at each different levels should be created and enforced. Exceptions should be reviewed and approved by a WMO body.
  • What about multi-purpose datasets ? How are centres going to classify this type of data and respond to the users' queries. Indeed currently one choice of topic domain category will be made for a dataset and a user using this data for another purpose will have difficulty to find it. Then again what is the purpose of providing a wrong semantic structure for that user. On the other hands datasets can be qualified in multiple domain categories in a discovery catalogue.

Another proposal would be to implement a large scale prototype simulating the load and number of topics to be created and reflected on the main brokers.

What do you think ? Comments ?

rename level 3 name from wis2 to network

Level 3 of the topic hierarchy is named wis2 and defines a fixed value of wis2.

Should we consider renaming the name of the level (not the fixed / single value of wis2) itself to something like project or system for some "future proofness" (essentially renaming topic-hierarchy/wis2.csv to topic-hierarchy/project.csv topic-hierarchy/system.csv, or something else)?

add topics for hydrology

proposal circa Feb 2024, see branch

  • prediction
    • hydrological-hazards
      • short-range-flash
      • long-range-riverine
      • nine-months-hdi
      • three-months-phdi
      • spei
    • water-resources-groundwater
      • monthly-well-water-levels
      • three-month-springs-levels
    • water-resources-soil
      • seasonal-soil-moisture
    • water-resources-surface
      • monthly-streamflow
      • stream-stage
      • monthly-reservoir-levels
      • seasonal-snowmelt
    • water-resources
      • monthly-streamflow
    • space-based-observations
      • earth-systems
      • space-based-observations
        • soil
          • ...
        • groundwater-quality
          • ...
        • groundwater-quantity
          • ...
    • subsurface-based-observations
      • water-quantity
      • water-quality
      • land-atmospheric
      • sediment-surface-water
    • surface-based-observations
      ...

add centre-id for ECMWF

According to issue #1 , ECMWF will be associated with RAVI, but we need to have the centre id in the topic hierarchy also.

"Country code" for WMO and Partner Organisations

(from @golfvert; manually moved from https://github.com/wmo-cop/wis2-topic-hierarchy/issues/1)

According to https://www.iso.org/glossary-for-iso-3166.html

User-assigned codes - If users need code elements to represent country names not included in ISO 3166-1, the series of letters AA, QM to QZ, XA to XZ, and ZZ, and the series AAA to AAZ, QMA to QZZ, XAA to XZZ, and ZZA to ZZZ respectively, and the series of numbers 900 to 999 are available.
NOTE: Please be advised that the above series of codes are not universal, those code elements are not compatible between different entities.

for Partner Organisation without a Country Code (eg. ECMWF, EUMETSAT, ESA,...) we can choose a 3-letter acronym in the list of user-assigned.

XPO seems to be used by Interpol (XPO is used for Interpol travel documents) so, what do we choose APO ? AAA ? ZZZ ? and then that would mean (eg) ZZZ.ECMWF in the topic tree.

centre-id and notification-type transposed in example, section 7.1.1.

See standard/sections/clause_7_normative_text.adoc line 19:

The representation is encoded as a simple text string of values in each topic level separated by a /. For example, origin/a/wis2/data/ca-eccc-msc/core/weather/surface-based-observations/synop or origin/a/wis2/data/ca-eccc-msc/recommended/atmospheric-composition/experimental/space-based-observation/geostationary/solar-flares.

Should be centre-id before notification-type

add guidance on centre-id

WTH should provide guidance (requirements or recommendations) on how centres should craft their centre-id value. Examples:

  • dashes, not underscores
  • no special characters
  • no infrastructure (i.e. ! my-centre-wis2node)
  • additional guidance?

cc @golfvert

Add guidance/rules about including non-core/undefined topics in the TH tree

As I understand it so far, 

  1. All metadata records and notification messages for core and recommended data will include a topic hierarchy.
  2. Level 9+ sub-categories will be defined by the appropriate community for core data. 
  3. Values for core data will be proposed to and maintained by TT-WISMD. 
  4. All records for core data will use this defined TH.
  5. Records for recommended data will include as much as possible of the defined TH that applies.

The questions are:

  1. Can the TH include undefined topics at the end of the tree? e.g. "origin/../.../../data/recommended/weather/space-based-observations/lightning" or e.g. "origin/../.../../data/recommended/weather/ship-based-observations"
  2. A sub-category may change from recommended to core. It would be useful to have this already defined for consistency. Where does the TT-WISMD draw the line between managing potentially all sub-categories and only core-subcategories? Is it important to make this clear?
  3. If we allow these undefined values, what happens to them in GB and GC?

These questions arose during the NWPMetadata workshop in January.

rename resource-type.csv

As discussed at TT-WISMD 2023-09-13, rename https://github.com/wmo-im/wis2-topic-hierarchy/blob/main/topic-hierarchy/resource-type.csv to https://github.com/wmo-im/wis2-topic-hierarchy/blob/main/topic-hierarchy/notification-type.csv given the clash of the "resource-type" concept with WCMP2 codelists. This is a working level change to the inner workings of the CSV files here on GitHub.

Thinking more, I'm thinking we should rename to simply type.csv. Thoughts?

cc @gaubert @jsieland @antje-s @josusky @solson-nws @Amienshxq @McDonald-Ian @amilan17 @david-i-berry

clarify centre-id management

The WIS2 Global Registry manages WIS2 services from members, part of which includes centre-id definitions.

As part of WTH and WIS2 development, WTH manages centre-id's in https://github.com/wmo-im/wis2-topic-hierarchy/blob/main/topic-hierarchy/centre-id.csv

The intention of #129 attempts to sync GR, however we can see some inconsistencies (Description wording, etc.)

We need to establish a clear working level workflow that results in consistent and quality centre-id entries with a single source of truth which is synchronized accordingly.

Principles

(TBD whether principles are put forth in the Guide or internal workings of GR).

  • centre-id's:
    • are managed by WMO Secretariat as part of the Global Registry (GR) role
    • follow centre-id naming rules of WTH
    • are populated with fulsome and descriptive Description metadata
      • Global services formatted like <centre name>,<global service type>, for example: Deutscher Wetterdienst (Germany), Global Cache Service
      • Country name visible as required

Workflow

  • Creation of GitHub Action in GR, when GR is updated:
    • run script that adds new centre-id's (in alphabetical order by Name column)
    • pushes update commit to WTH GitHub

For @wmo-im/tt-wismd discussion.

Organize level 8+ sub-topics

Levels 1 - 7 are organized as flat CSVs, but at some point the sub-topics will need to branch off. For discussion, below are some screenshots of the current organization and some alternative options.

Version management of Topic Hierarchy

During the TT-NWPMetadata there were questions about how the versions of the topic hierarchy are managed. We should have it documented and also discuss with architecture team.

updated: 31 May 2023

DECISION (draft)

  • additions are expected (FT) as minor (x.y)
  • updates/removals are breaking changes (not bound to FT) (x)
  • domain independent, following pattern in TT-TDCF, for example
  • how many versions to support?
  • support n-1 (current/previous)
  • Actual version notation will be addressed in #11

clarify role of country

Level 4 (country) is currently defined as:

Lower case representation of ISO3166 3-letter code. Includes extensions for partner organizations

We need to clarify the role of the country in the documentation so that the topic hierarchy's county level is clearly defined to users. Options:

  • originator: party who created the resource
  • distributor: party who distributes the resource
  • other options?

add experimental level

W2AT 2023-11-14:

  • add a topic level experimental foreach earth-system-discipline-subcategory, where subtopics are not to be validated by WIS2 global services

Monitor Topic

In WIS-Guide a monitor topic is mentioned, e.g. under 2.7.3.1 "...Global Broker will not discard the message but will send a message on the monitor topic hierarchy to inform the originating centre and its GISC." and under 2.7.4.1 "...Global Cache decides not to cache data it should behave as though the cache property is set to false and send a message on the monitor topic hierarchy to inform the originating centre and its GISC.". Should we add the monitor value to WTH even if it is a separate subtree, so that everyone is aware of it and for clarity?

version (level 2) - version of topic hierarchy or of message format?

Currently the notes states that it would be "Alphabetic version of the topic hierarchy", but originally it was for the version of the message format - as far as I know. I think it would be good to use the version of message format, because then you have a possibility to introduce a new message format and all consumers can switch independently

clarify what is a valid topic

As discussed with @golfvert, we need to clarify whether "partial" topics are deemed valid and can/should be used or not.

For example, origin/a/wis2/ca-eccc-msc/data/core/weather/surface-based-observations/synop is a valid topic.

Should origin/a/wis2/ca-eccc-msc/data/core/weather/surface-based-observations be considered a valid topic as well? This means a topic without a leaf?

We would need to update the specification to be clear in this regard (and whether Requirement 1B needs to be updated/augmented. In addition, we would need to update the artefacts made available on schemas.wmo.int for the WTH CV bundle/lookup.

Normalization of WTH references to artifacts such as schemas, codes...

Ensure consistency of WTH URLs and resources outside of the Manual on WIS. Below are lists of what exists today in WTH.

consistent hierarchy levels in centre-id?

Dear colleagues, when looking at the centre-id.csv I noticed that at first it seems to separate hierarchy levels by hyphens, e.g. for DWD:

de-dwd: <country>-<institution>

But then it doesn't seem the case such as in

de-dwd-gts-to-wis2 where the last 3 items seem to be one name, but suggest further hierarchy levels via the hyphens.

or in
fr-meteo-france: after the country, only "meteo" would be the institution when machine-parsing with a hyphen as separator.

As far as I understood, the scheme urn:wmo:md:{centre_id}:{local_identifier} offers the opportunity to parse the origin of a dataset without opening it. In the examples above, hyphens as hierarchy level separators are mixed with hyphens as part of names. That will make automatic parsing of the data source ambiguous.

Best regards,
Hella Riede (DWD)

provide publication artifacts

Currently, we have a GitHub Actions CI that uses pywcmp's bundle workflow to publish a first pass/working level JSON file of all topics.

There are a few issues with this workflow:

  • centres should be bound to countries only (per #29 (comment))
  • the list will explode as domain topics begin to populate
  • this workflow should be part of this repository, not pywcmp. Note that there will be some relation to #26 (which in contrast would provide different inputs to codes.wmo.int)

Assigning to @antje-s and @josusky; additional help is welcome (cc @wmo-im/tt-wismd)

update definition of country

It needs to be clear that the 'country' is for the location of the data center, not the geographic location of the dataset.

define dedicated GTS topic

As part of transition from GTS to WIS2, it is important to be able to clearly articulate what data is coming from GTS vs. WIS2.

create top level CSV for all levels with descriptions?

Right now this content is just a table in the readme

Level Name Notes
1 channel Location of where the data originates from (data providers [origin] or global services [cache])
2 version Alphabetical version of the topic hierarchy
3 system Fixed value of wis2 for WIS2
4 country Lower case representation of ISO3166 3-letter code. Includes extensions for partner organizations
5 centre-id Acronym as specified by member and endorsed by the PR of the country and by WMO
6 resource-type WIS2 resources types (data, metadata, reports [from monitoring activities])
7 data-policy Data policy as defined by the WMO Unified Data Policy. core data are available from the Global Caches with open access on a free and unrestricted basis. Notifications for core and recommended data are available by subscription to Global Brokers. recommended data are downloaded from the original NC/DCPC and may require authentication/authorisation
8 earth-system-discipline As per Annex 1 of resolution 1 Cg-Ext-2021
9 earth-system-discipline-subcategory As proposed by domain experts and further approved by INFCOM

Add satellite topics

Add all non-commercial with a status of operational, standby or commissioning as defined in the OSCAR Satellite DB using the following topics.

  • weather > space-based-observations > satellite > instrument
  • space-weather > space-based-observations > satellite > instrument

satellite-topics-final-draft.xlsx

* weather > space-based-observations > orbit-type > sensor-type > satellite-name > data-type
* space-weather > space-based-observations > orbit-type > sensor-type > satellite-name > data-type

add parent column for country to centre-id

There is a constraint between countries and centre-id's where one country may have 0..n associated centre-id's.

We need to represent this relationship to prevent possibilities such as:
origin/a/wis2/usa/eccc-msc/...

We need to represent this relationship to be able to map centre-ids to countries.

One option can be adding a parent column to centre-id.csv would help express this constraint between these two levels.

  1. are there other options we should consider?
  2. are there similar situations between other parent/child levels?

Add topics for climate earth system discipline

Proposed initial topics for climate domain (and mapping of GTS headers, see #147).

Climate

name description source
monthly Monthly values from land stations, e.g. CLIMAT
daily Daily values from land stations, e.g. DAYCLI
sub-daily Reprocessed hourly and other sub-daily observations from land stations

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.