Coder Social home page Coder Social logo

magda-csw-connector's Introduction

Magda CSW Connector

CI Workflow Release

Magda connectors go out to external datasources and copy their metadata into the Registry, so that they can be searched and have other aspects attached to them. A connector is simply a docker-based microservice that is invoked as a job. It scans the target datasource (usually an open-data portal), then completes and shuts down.

Magda CSW Connector is created for crawling data from CSW(Catalog Service for the Web) data source.

Release Registry

Since v2.0.0, we use Github Container Registry as our official Helm Chart & Docker Image release registry.

It's recommended to deploy connectors with as dependencies of a Magda helm deployment.

dependencies:
  - name: magda-csw-connector
    version: "2.0.0"
    alias: connector-xxx
    repository: "oci://ghcr.io/magda-io/charts"
    tags:
      - connectors
      - connector-xxx

Requirements

Kubernetes: >= 1.21.0

Repository Name Version
oci://ghcr.io/magda-io/charts magda-common 2.1.1

Values

Key Type Default Description
config.basicAuthEnabled bool false Whether or not to send basic auth header.
config.basicAuthPassword string nil basic auth password. You can also passing this value via secret. To do so, set basicAuthSecretName to the secret name.
config.basicAuthSecretName string nil You can set this value to supply basic auth username & password. The secret must have two keys: username & password.
config.basicAuthUsername string nil basic auth username. You can also passing this value via secret. To do so, set basicAuthSecretName to the secret name.
config.id string "default-csw-connector"
config.name string nil Friendly readable name. Compulsory.
config.outputSchema string http://www.isotc211.org/2005/gmd Desired output schema to be requested from CSW service as URI
config.pageSize int 100 When crawling through from beginning to end, how big should the individual requests be in records?
config.schedule string "0 14 * * 6" i.e. 12am Sydney time on Sunday Crontab schedule for how often this should happen.
config.sourceUrl string nil The base URL of the place to source data from. Compulsory.
config.typeNames string gmd:MD_Metadata Record type expected to be returned from CSW service as XML tag name
config.usePostRequest bool false Whether or not use POST request to call getRecords API
defaultImage.imagePullSecret bool false
defaultImage.pullPolicy string "IfNotPresent"
defaultImage.repository string "ghcr.io/magda-io"
defaultSettings.includeCronJobs bool true
defaultSettings.includeInitialJobs bool false
defaultTenantId int 0
global.connectors.image object {}
global.image object {}
image.name string "magda-csw-connector"
resources.limits.cpu string "100m"
resources.requests.cpu string "50m"
resources.requests.memory string "30Mi"

magda-csw-connector's People

Contributors

alexgilleran avatar chloeleichen avatar kring avatar maxious avatar mwu2018 avatar steve9164 avatar t83714 avatar tristochief avatar

Watchers

 avatar  avatar  avatar

magda-csw-connector's Issues

Dataset should have publisher

Describe the bug
Some datasets harvested by this connector have empty publisher property. E.g. sources:

  • Australian Urban Research Infrastructure Network
  • Australian Oceans Data Network
  • Mineral Resources Tasmania

To Reproduce
E.g. the following request will have publisher = "".
https://data.gov.au/api/v0/registry/records/ds-aurin-aurin:datasource-AU_Govt_ABS-UoM_AURIN_DB_3_abs_ihad_lga_2016?optionalAspect=source&optionalAspect=dcat-dataset-strings&optionalAspect=dcat-distribution-strings&dereference=true

{
"aspects": {
  "dcat-dataset-strings": {
  "contactPoint": "GeoServer",
  "description": "...",
  "keywords": [
  "socio-economic"
  ],
  "languages": [],
  "publisher": "",
  "spatial": "POLYGON((96.81 -43.75000004080834, 159.11000000000004 -43.75000004080834, 159.11000000000004 -9.140000000990348, 96.81 -9.140000000990348, 96.81 -43.75000004080834))",
  "themes": [],
  "title": "ABS - Index of Household Advantage and Disadvantage (IHAD) (LGA) 2016"
  },
  "source": {
  "id": "aurin",
  "name": "Australian Urban Research Infrastructure Network",
  "type": "csw-dataset",
  "url": "https://openapi.aurin.org.au/public/csw?service=CSW&version=2.0.2&request=GetRecordById&elementsetname=full&outputschema=http%3A%2F%2Fwww.isotc211.org%2F2005%2Fgmd&typeNames=gmd%3AMD_Metadata&id=aurin%3Adatasource-AU_Govt_ABS-UoM_AURIN_DB_3_abs_ihad_lga_2016"
  }
},
"id": "ds-aurin-aurin:datasource-AU_Govt_ABS-UoM_AURIN_DB_3_abs_ihad_lga_2016",
"name": "ABS - Index of Household Advantage and Disadvantage (IHAD) (LGA) 2016",
"sourceTag": "60eda22a-11ff-4ae9-9def-0f12bef8f179",
"tenantId": 0
}

Besides,
If adding optionalAspect=dataset-distributions to the above query, the values of all accessURL will have double slash after hostname.

accessURL: "https://openapi.aurin.org.au//public/wfs?request=getFeature&version=1.0.0...

Expected behavior

  • Organisation search API relies on the existence of publisher property.
  • An accessURL should be correct (without extra slash).

magda-csw-connector generates date string that doesn't match JSON schema

Describe the bug

magda-csw-connector generates date string that doesn't match JSON schema for dcat-dataset-strings aspect

Failed to PUT data registry record with ID "dist-aims-ef67ebe0-61ac-4a0a-a5a5-52a12b4d727c-1". 1 retries left. Status code: 400, body:
{
  "message": "#/issued: [2019-09-12] is not a valid date-time. Expected [yyyy-MM-dd'T'HH:mm:ssZ, yyyy-MM-dd'T'HH:mm:ss.[0-9]{1,9}Z, yyyy-MM-dd'T'HH:mm:ss[+-]HH:mm, yyyy-MM-dd'T'HH:mm:ss.[0-9]{1,9}[+-]HH:mm]"
}

Make sure validateJsonSchema option is true to turn json validation on

To Reproduce

Deploy magda-csw-connector with config:

id: aims
name: Australian Institute of Marine Science
sourceUrl: https://geo.aims.gov.au/geonetwork/srv/eng/csw
pageSize: 100

Basic Auth support & access getRecords via POST request

Why

Some CSW endpoint might require basic auth and we might also want to (optionally) send getRecords request via HTTP post when the server doesn't support HTTP GET well.

Sample POST request body:

<?xml version="1.0"?>
<csw:GetRecords xmlns:csw="http://www.opengis.net/cat/csw/2.0.2"
                xmlns:gmd="http://www.isotc211.org/2005/gmd"
                service="CSW" version="2.0.2"
                outputSchema="http://standards.iso.org/iso/19115/-3/mdb/2.0"
                resultType="results"
                startPosition="10"
                maxRecords="20">
  <csw:Query typeNames="gmd:MD_Metadata">
    <csw:Constraint version="1.1.0">
      <Filter xmlns="http://www.opengis.net/ogc"/>
    </csw:Constraint>
  </csw:Query>
</csw:GetRecords>

Those parameters are currently sent via query string here:

public GetRecordsParameters = {
service: "CSW",
version: "2.0.2",
request: "GetRecords",
constraintLanguage: "FILTER",
constraint_language_version: "1.1.0",
resultType: "results",
elementsetname: "full",
outputschema: "http://www.isotc211.org/2005/gmd",
typeNames: "gmd:MD_Metadata"
};

should be included in the POST body instead.

Acceptance Criteria

  • Add a new helm chart config getRecordsWithPostRequest.
    • By default, the connectors still send requests in GET. But users should be able to set it to true to make requests sent in POST
  • All current GET parameters should be moved to POST request body
  • Add a new helm chart config basicAuthSecretName to allow users to specify the secret name of the basic auth secret.
    • the secret is expected to have two fields: "username" & "password"
    • when basicAuthSecretName is empty, no auth headers will be sent

Many connectors create invalid datasets

Describe the bug
At the moment I'm noticing that all the following connectors are going very slowly because they're creating datasets that the registry doesn't think are valid:

connector-actmapi-1584194400-9qzjk           1/1     Running     0          6h35m
connector-aims-1584194400-8wr9b              1/1     Running     0          6h35m
connector-aodn-1584194400-4gknb              1/1     Running     0          6h35m
connector-dap-1584799200-84cht               1/1     Running     0          6h35m
connector-ga-1584194400-nlpdq                1/1     Running     0          6h35m
connector-logan-1584194400-99z7b             1/1     Running     1          16h
connector-marlin-1584194400-57xz9            1/1     Running     0          6h35m
connector-sdinsw-1585404000-44hd6            1/1     Running     0          6h35m

Test support output schema http://standards.iso.org/iso/19115/-3/mdb/2.0 for some servers

Is your feature request related to a problem? Please describe.

Some GeoNetwork servers support newer output schema: http://standards.iso.org/iso/19115/-3/mdb/2.0

e.g. aodn:

https://catalogue.aodn.org.au/geonetwork/srv/eng/csw?service=CSW&version=2.0.2&request=GetRecords&constraintLanguage=FILTER&constraint_language_version=1.1.0&resultType=results&elementsetname=full&outputschema=http%3A%2F%2Fstandards.iso.org%2Fiso%2F19115%2F-3%2Fmdb%2F2.0&typeNames=gmd%3AMD_Metadata&startPosition=1&maxRecords=3

Our default output schema used in the config: http://www.isotc211.org/2005/gmd works. However, some key fields (e.g. mdb:metadataLinkage) won't be available.

We need to:

  • write a few test cases for the new output schema: http://standards.iso.org/iso/19115/-3/mdb/2.0
  • pave the schema gaps if any
  • tested it in the dev cluster

Test Data

Didn't capture license info correctly from AODN CSW registry

Didn't capture license info correctly from AODN CSW registry

The CSW connector currently can't correctly capture license info from the AURIN CSW registry.

sample response: https://catalogue.aodn.org.au/geonetwork/srv/eng/csw?service=CSW&version=2.0.2&request=GetRecordById&elementsetname=full&outputschema=http%3A%2F%2Fwww.isotc211.org%2F2005%2Fgmd&typeNames=gmd%3AMD_Metadata&id=bc6bb3ec-bfa0-4a2a-ab01-8c3e337a9013

notice the response doesn't tag license info with codeListValue attribute == "license" --- that's probably why our code missed it.

Having said that, it seems it's still possible to extract license info out with hardcoded logic

Failed to capture license info for some datasets

Org ID generation issue

Org ID generation issue

For some CSW servers, we are unable to generate org name with source metadata id.

And we will have to create id like:

  • org-aodn-Australian Oceans Data Centre Joint Facility
  • org-aodn-Australian Ocean Data Centre Joint FacilityData Centre Joint Facility
  • org-aodn-Australian Ocean Data Centre Joint Facility
  • org-aodn-Australian Ocean Data Centre joint Facility

This might cause issues to the registry as they look same case-insensitively.

Some relevant errors:

Failed to PUT data registry record with ID "org-aodn-Australian Ocean Data Centre Joint Facility". 7 retries left. Status code: 500, body:   
ERROR] [06/02/2023 04:34:52.609] [[registry-api-akka.actor.default-dispatcher-3](http://registry-api-akka.actor.default-dispatcher-3/)] [RecordsService(akka://registry-api)] Encountered an exception when putting a reco │
│ rd                                                                                                                                                                  │
│ scalikejdbc.TooManyRowsException

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.