sdmx-twg / sdmx-im Goto Github PK
View Code? Open in Web Editor NEWSDMX Information Model - UML model and functional description, definition of classes, associations and attributes
SDMX Information Model - UML model and functional description, definition of classes, associations and attributes
Greatly improve usability of SDMX information model and maintainability of SDMX artefacts through a strengthened and simplified versioning and dependency management based on semantic versioning principles (semver.org).
In https://sdmx.org/wp-content/uploads/SDMX_3-0-0_SECTION_2_FINAL-1_0.pdf, starting at line 2116 the document list the different objects related to constraints, which do not seem consistent with the definitions in SDMX-ML, the KeySet is defined as:
I believe that this should be clarified and made consistent.
The specs also say in line 2092: "Note that in all cases the "operator" on the value is deemed to be "equals", unless the wildcard character is used '%'. In the latter case the "operation" is a partial matching, where the percentage character ('%') may match zero or more characters." It is not sufficiently explicit if this only applies to CubeRegions, which are mentioned just before, or also to KeySets.
This issue refers to #4 (Multiple measures) and #5 (Multi-valued attributes).
There have been cases in SDGs and other Statistical Domains where multiple measures are available for the same indicator.
Currently SDMX does not support secondary measures.
The current workaround is to create additional attributes and name them according to a convention.
This would need to be addressed by the TWG to formalise the modeling of multiple measures.
(on behalf of Eurostat)
Initial request:
Domain SDMX request - Multiple flagging.pdf
A short summary of the request:
The request is about adding support somehow for data attributes to have more than one code in datasets.
So if we have a Codelist with 3 codes {r, u, e}
, as an example, it should be possible to have, in a dataset , within an observation attribute (e.g. OBS_FLAG
) for one specific observation one of the following values:
r
{r,u}
{r,e}
{u,e}
{r,u,e}
See example below, where OBS_FLAG
attribute uses 2 codes r
and u
for a specific observation using space as a separator (
):
<generic:Obs>
<generic:ObsDimension value="2014" />
<generic:ObsValue value="5.06944075E8" />
<generic:Attributes>
<generic:Value id="OBS_FLAG" value="r u" />
</generic:Attributes>
</generic:Obs>
After le last TWG physical meeting (Dec 2019 - LU), this issue (renamed to "Arrays for Attribute values" to reflect better the use case) is being considered together with #4 , since it impacts the dataset message and needs to be combined with the multiple measures issue, for considering also attaching Attributes to Measures.
The proposed idea, at the time of this writing, is to introduce an extension to the current way SDMX Attributes are reported. This means that SDMX Attributes will be reported as now, when a single value is reported; when an array of values needs to be reported for an Attribute, e.g. many status values for Attribute OBS_STATUS
, then an XML array will be included under the corresponding dataset element that the Attribute is reported, named after the Attribute. For the OBS_VALUE
example, when reported under the <Obs>
element:
<Obs ... >
<OBS_STATUS>
<value>D</value>
<value>B</value>
<value>Q</value>
</OBS_STATUS>
The possibility of defining sub-domains (subsets of a Codelist) of a particular domain (Codelist) and being able to refer to it is very appealing and eases modeling in many situations.
This issue may be related to #2 in the sense that a subset could potentially be used to define a new codelist extension.
The HCL are not to be used for this as what we intend to model is not a hierarchy of any kind but individual groups with no hierarchical relations to each other.
Allow for many to one and many to many mappings.
Enable code list maps that can be cross-referenced between different dataset mappings.
More details attached.
To consider renaming the HCL artefact to something more fitting, because it's not a code list.
Be clear on difference between HCL and StructureSet.
An HCL can useful to define code sets that are used for data dissemination
It would be useful to be able to attach these to a Dataflow so that we can define hierarchies and additional visualisation codes that are not in the Codelist
Also to enable standard codelists to be used but with extensions (codes from other codelists)
This will facilitate the use of common codelists
Introduction
This proposal advocates for the deprecation of nested metadata attributes within the SDMX Information Model.
Metadata attributes in a metadata structure definition are references to concepts. They can come from any number of concept schemes and can form a hierarchy that is specific to the metadata attribute.
This feature introduces unnecessary complexity that can be addressed through alternative solutions if it is truly needed at all.
Arguments for Deprecation
CONTACT_EMAIL
as recommended by the Metadata Common Vocabulary and its successor. Rendering software would needlessly need to support variations like CONTACT_EMAIL
, CONTACT.CONTACT_EMAIL
, and CONTACT.EMAIL
.Proposed Alternatives
Benefits of Deprecation
Deprecating nested metadata attributes will simplify the SDMX Information Model, leading to:
Conclusion
Nested metadata attributes introduce unnecessary complexity into the SDMX Information Model. This proposal advocates for their deprecation in favor of alternative solutions that leverage concept scheme references and flattened attributes. These changes will promote interoperability, improve data clarity, and reduce development burdens within the SDMX community.
(In doubt, please handle this as a public review comment on SDMX 3.1 once the comment period begins.)
Following the deprecation of the MeasureDimension
in the context of issue #6, the idea to simplify the DSD by using only simple Dimensions with roles, has been put on the table.
In the case of the MeasureDimension
the role MEASURE
is proposed and similarly, in the case of the TimeDimension
the role TIME
could be used.
Doing that, and considering also #6, the Components of a DSD may be either of the following:
The impact of this decision would be:
Hello,
In the SDMX 3.0.0 technical specifications at Figure 6: Example Metadata Set, the XML uses tags that are not in the SDMX-ML 3.0.0 XSD.
There have been cases in SDGs and other Statistical Domains where multiple measures are available for same indicator. Currently SDMX does not support secondary measures.
Workaround at the moment is to create the attributes.
The documentation for FixedValueMap
(p. 114) provides the following description: "Links a Component (source or target) to a fixed value.".
However, the Relationship diagram on page 111 has the following cardinality between FixedValueMap
and Component
: 0..*
. Should it not be 1
instead?
The XML schemas are in line with the description (i.e. only one component is referenced):
<xs:complexType name="FixedValueMapType">
<xs:annotation>
<xs:documentation>FixedValueMapType defines the structure for providing a fixed value for a source or target component.</xs:documentation>
</xs:annotation>
<xs:complexContent>
<xs:extension base="common:AnnotableType">
<xs:sequence>
<xs:choice>
<xs:element name="Source" type="common:IDType"/>
<xs:element name="Target" type="common:IDType"/>
</xs:choice>
<xs:element name="Value" type="xs:string" minOccurs="0" maxOccurs="unbounded">
<xs:annotation>
<xs:documentation>The fixed value for the component.</xs:documentation>
</xs:annotation>
</xs:element>
</xs:sequence>
</xs:extension>
</xs:complexContent>
</xs:complexType>
Thanks!
New features on SDMX 3.X will provide microdata support, this is useful for NSOs to gather information based on SDMX structures that can be used to validate the data during the work field. The problem is the following:
A data constraint which contains a DataKeySet allows for the definition of constrained keys or wildcard keys, for example
Constrained Series
A.UK.EMP.M
A.FR.EMP.M
Constrained Wildcard Series
A..EMP.M
The problem comes when the constraint needs to constrain mulitple values in a dimension but not all values (i.e. a wildcard is too wide a scope) For example:
//constrain a set of countries for employed male
A.UK.EMP.M
A.FR.EMP.M
A.DE.EMP.M
//constrain a set of countries for employed female
A.UK.EMP.F
A.FR.EMP.F
A.DE.EMP.F
When mulitple dimensions are involved with multiple values in each dimension the number of series grows as a cartesian product. It is better to have a series constraint with multiple values in the key part which can collapse multiple rules into one:
A.UK+FR+DE.EMP.M+F
The SDMX Schmea defines a Key as having 1-many KeyValue, and a KeyValue is an id/value pair. If the KeyValue was modified to allow one or more values this use case would be supported.
The main use case is the ability to extend a Cross-Domain Code List (CDCL) with extra codes. The ability to override CDCL codes should also be considered (though it may be considered an invalid practice).
For information:
It is possible to do code list inheritance in HCLs but it can’t be used for data exchange in SDMX 2.1 as a DSD cannot reference an HCL, therefore;
COLLECTIONOFPOINTERSTOCODESINCODELISTSWITHHIERARCHICALINFORMATIONTHINGY
Allow "H" code in time period for half-yearly/semi-annual data. This would require a change in the common schema to permit flexibility on being able to specify either "S" or "H" for half-yearly data in time period.
Received from IMF
The header carries basic information about the SDMX file such as when it was created and the source.
Despite serving the same function in all SDMX file types there are 8 different header types leading to needless complications in building and reading SDMX files.
We recommend reducing the number of header types.
We have noticed that in the registry, a schema has a duplication of attributes at sibling and series level even when the DSD (in 2.0 SDMX-ML format) actually mentions a group level attachment:
SDMX-ML 2.0 DSD
<str:Attribute assignmentStatus="Mandatory" attachmentLevel="Group" codelist="CL_UNIT" codelistAgency="ECB" codelistVersion="1.0" conceptSchemeAgency="ECB" conceptSchemeRef="ECB_CONCEPTS" conceptRef="UNIT" conceptVersion="1.0">
<str:AttachmentGroup>Group</str:AttachmentGroup>
</str:Attribute>
When generating the SDMX-ML 2.1 DSD and schema the attributes are at the same time attached to the series and the group... This apparently was done so in order to optimize for the streaming when exchanging data.
The problem is that it is subject to interpretation in the specification and should (/would need to) be clarified.
It appears to be popular to use HTML fragments in information model items that are supposed to be plain text, including that some existing implementations happily render them as HTML instead of plain text. Random example:
SDMX implementations would struggle to interoperate for this kind of message. There is nothing unusual about plain text looking like it might be HTML, say % a<b and b>c
could be rendered as-is, or as "% ac" depending on any given SDMX implementations preference or configuration.
There should be a feature for this, with suitable encoding in the SDMX formats, to allow interpretation of such texts as intended and interoperably where the format is properly indicated.
This probably also applies especially to annotations, which so far unfortunately lack format.dataType
which would allow setting the format to XHTML for this, as can be done for dimensions, attributes, and measures.
(In doubt, please handle this as a public review comment on SDMX 3.1 once the comment period begins.)
A user highlighted an inconsistency in the documentation between the IM and the message format implementations (SDMX-ML and SDMX-JSON), see: sdmx-twg/sdmx-json#124. Indeed, when constructing the message formats, sometimes some changes are made to generalise the approach and make it more coherent. Also, sometimes certain aspects may not have been seen when writing the IM document.
More specifically, you can see here and here that the maxOccurs and minOccurs parameters were moved from the component definitions (as described in the IM) into the SDMX-ML RepresentationType, thus into CoreRepresentation (inside concepts) and LocalRepresentation (inside the component definitions in the DSD). The same was done in SDMX-JSON.
It would seem to me that this inconsistency should be addressed by updating the IM document.
By the way, there is also an inconsistency about the meaning of minOccurs. In some places, the IM still says wrongly that minOccurs=0 means that a component is optional. In other places it mentions the finally agreed approach where mandatory/optional is a separate property.
Reported by Abdulla Gozalov 20 Dec 2023
SDMX 3.0.0 Section 6 guide line 541 currently reads "Limit: 1" but should be "Limit: 12".
In the SDMX STANDARDS: SECTION 6 TECHNICAL NOTES Version 3.0, page 28, line 897 it currently says:
A Constraint may reference many Dataflows or Metadataflows, the addition of more references to flow objects does not version the Constraint. This is because the Constraints are not properties of the flows – they merely make references to them.
I believe that the statement "the addition of more references to flow objects does not version the Constraint" is incorrect. The given justification "they merely make references to them" would imply that any artefact that references other artefacts would not be versionable, which is fully in opposite with the purpose of semantic versioning.
Actually, whenever a constraint changes (even because it references (constrains) other artefacts) then it absolutely needs to increase its version because this would be the only mean to indicate to others that a change has happened and that an another system wishing to synchronise its artefact copies needs to get the new version!
The SDMX-TWG TF2 working document on "Reorganising Constraints" actually stated:
Any changes that include the following, result into minor version increase:
– Adding new Constrainables. ...
(on behalf of Eurostat)
Initial request:
Domain SDMX request - Multiple flagging.pdf
A short summary of the request:
The request is about adding support somehow for data attributes to have more than one code in datasets.
So if we have a Codelist with 3 codes {r, u, e}
, as an example, it should be possible to have, in a dataset , within an observation attribute (e.g. OBS_FLAG
) for one specific observation one of the following values:
r
{r,u}
{r,e}
{u,e}
{r,u,e}
See example below, where OBS_FLAG
attribute uses 2 codes r
and u
for a specific observation using space as a separator (
):
<generic:Obs>
<generic:ObsDimension value="2014" />
<generic:ObsValue value="5.06944075E8" />
<generic:Attributes>
<generic:Value id="OBS_FLAG" value="r u" />
</generic:Attributes>
</generic:Obs>
data:
dataConstraints:
- id: CN_SERIES_SDG_GLC
name: SDG Series Level content constraints
description: SDG Series Level content constraints for the Country Global
Dataflow. They define which dimensions/codes are enabled for each
individual series.
version: "1.17"
agencyID: IAEG-SDGs
role: Allowed
constraintAttachment:
dataflows:
- urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=IAEG-SDGs:DF_SDG_GLC(1.17)
dataKeySets:
- isIncluded: true
keys:
- keyValues:
- id: SERIES
value: SI_POV_DAY1
- id: UNIT_MEASURE
value: PT
- id: UNIT_MULT
value: "0"
- id: COMPOSITE_BREAKDOWN
value: _T
- id: INCOME_WEALTH_QUANTILE
value: _T
- id: PRODUCT
value: _T
- id: ACTIVITY
value: _T
- keyValues:
- id: SERIES
value: SI_POV_DAY1
- id: UNIT_MEASURE
value: PT
- id: UNIT_MULT
value: "0"
- id: COMPOSITE_BREAKDOWN
value: MS_MIGRANT
- id: INCOME_WEALTH_QUANTILE
value: _T
- id: PRODUCT
value: _T
- id: ACTIVITY
value: _T
- keyValues:
- id: SERIES
value: SI_POV_DAY1
- id: UNIT_MEASURE
value: PT
- id: UNIT_MULT
value: "0"
- id: COMPOSITE_BREAKDOWN
value: MS_NOMIGRANT
- id: INCOME_WEALTH_QUANTILE
value: _T
- id: PRODUCT
value: _T
- id: ACTIVITY
value: _T
- keyValues:
- id: SERIES
value: SI_POV_DAY1
- id: UNIT_MEASURE
value: PT
- id: UNIT_MULT
value: "0"
- id: COMPOSITE_BREAKDOWN
value: MS_EUMIGRANT
- id: INCOME_WEALTH_QUANTILE
value: _T
- id: PRODUCT
value: _T
- id: ACTIVITY
value: _T
- keyValues:
- id: SERIES
value: SI_POV_DAY1
- id: UNIT_MEASURE
value: PT
- id: UNIT_MULT
value: "0"
- id: COMPOSITE_BREAKDOWN
value: MS_NONEUMIGRANT
- id: INCOME_WEALTH_QUANTILE
value: _T
- id: PRODUCT
value: _T
- id: ACTIVITY
value: _T
In order to allow multiple values for COMPOSITE_BREAKDOWN
everything else is also repeated.
That's silly and impossible to maintain by hand.
There must be a better way to encode this information, like allowing an array of values?
(In doubt, please handle this as a public review comment on SDMX 3.1 once the comment period begins.)
Decision has been taken to migrate the documentation to readthedocs.org.
We should take this opportunity to consolidate and improve the current documentation.
This repository will details the Information Model and serve as the glue with the other repositories under sdmx-twg.
Several sets of category items (a variant) can be associated to one classification, but only one is used in a DF or PA, depending of the context. (e.g. ISIC versions in CL_ACTIVITY).
The way this can be achieved is using content constraints attached to the dataflow or provision agreement.
However, the constraint specification must be done “by extension”, thus enumerating all the codes to be included or excluded. The length of this list can be a problem.
For example, the list of codes to be included in a content constraint to include the items corresponding to the ISIC Rev. 4 variant from the CL_ACTIVITY code list is composed of approximately 770 items.
Using a common prefix to identify all codes belonging to a variant helps in managing long code lists with many variants, like CL_ACTIVITY (e.g. ISIC4_, ISIC3_, NACE2_, AGG_, etc.)
It has been proposed to allow the use of regular expressions in the creation of constraints in order to reduce the length and complexity of their definition.
Nevertheless, to deal with the issue of multiple variants, and considering the adoption of the “prefixed codes” practice, just a “wild card” character would be enough, and make the use of regular expressions not advisable as it would “overload” the solution
It is suggested to simply use the percentage sign (%) as the wildcard character.
Following the same example mentioned ut-supra, the more than 770 items can be reduced to a single wildcarded element.
This is a proposal to add a feature that would allow maintaining translations independent of nameable-and-identifiable artefacts.
While supranational statistics organisations might have official translations for their artefacts that are maintained as part of an artefact definition, even they might want to sidestep questions about how adding or updating a translation should affect the version number of an artefact (and possible downstream consequences). They might also want to have different access control restrictions for translation maintenance and artefact maintenance. Both use cases are easier to implement if the some or all translations are maintained apart from the artefact definition.
Smaller or less international organisations might have to rely on external resources for translations, like a user community or machine translation services, but might want third-party contributions kept separate from official text (for instance, to make clear that translations are provided on a best-effort basis, which they cannot with the current scheme because there is no option to annotate translations, short of using inline disclaimers, or something like comments in SDMX-ML messages, a feature not available in SDMX-JSON).
Consider this use case: Wikipedia contains many lists, tables, and charts that are based on official statistics. They are often out of date and unmaintained. If the Wikipedia community were to automatically update them using data from SDMX web services, they would face the issue that translations for many languages are not provided by the web services, and they might want to use their existing translation infrastructure to add them, perhaps with custom logic that merges SDMX messages with official data with their community translations.
Furthermore, maintaining all translations as part of the artefact definition precludes grouping translations by language, which you might want to do for reviews by language experts or a translation service.
As it is, these groups of SDMX users cannot use existing SDMX features (short of duplicating all artefacts into different namespaces) to address this.
Proposed Syntax:
In a SDMX-JSON structure message, it could like this:
data:
translations:
- target: urn:...(2.0+.0).Example
lang: en-US
name: ...
description: ...
annotations: ...
Benefits:
(In doubt, please handle this as a public review comment on SDMX 3.1 once the comment period begins.)
Hello,
Could you please clarify the meaning of TimeRangeValueType/BeforePeriod and TimeRangeValueType/AfterPeriod especially versus TimeRangeValueType/StartPeriod and TimeRangeValueType/EndPeriod in ContentConstraints?
The IM documentations says:
Also see: https://github.com/sdmx-twg/sdmx-ml/blob/import21files/schemas/SDMXCommon.xsd#L903
What is justifying the existence of these different approaches? Could this be simplified (by deprecating one way of doing)?
Hello!
Working with the Infomodel, we have stumbled upon one particular thing:
IM describes (section 5.3) data attributes and measures of DSD having both minOccurs/maxOccurs
and usage
attributes. And it's the usage
attribute which dictates whether the measure (or data attribute) is mandatory. But the question which arises, is how implementors should handle situations when there is a structure which declares, say, minOccurs=0
and usage=mandatory
? They seem to contradict to each other.
At the same time, in the description of metadata structure definition - metadata attributes specifically, there is no usage
property and the requirement of the attribute is derived from the minOccurs
value, i.e 0
standing for optional and >=1
standing for mandatory. Should DSD be similar to this?
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.