adamreichold / umwelt-info Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 830 KB

umwelt.info metadata index

Home Page: https://umwelt.info

License: GNU Affero General Public License v3.0

Rust 89.61% HTML 8.69% Shell 0.47% Python 1.23%

umwelt-info's People

Contributors

Stargazers

Watchers

umwelt-info's Issues

Add an Elasticsearch harvester

We have some data sources like geoportal.de which expose Elasticsearch's /_search endpoint which we could harvest directly.

Add a SmartFinder harvester

Some sources like geoportal.bafg.de or gis.uba.de expose the /select endpoint of SmartFinder server which we could use to harvest metadata.

Add basic scraper for Undine and WasserBLIcK

Undine provides a large collection of continuously updated measurements and WasserBLIcK provides a large collection documents. Both are complex hierarchically structured web sites which will most likely need correspondingly complex scrapers to extract individual datasets.

Make tags into a facet

After merging #66, the tags field of the datasets should also be handled as a facet and exposed via a suitable extension of the UI.

Send harvest summary via electronic mail

The harvester should send an electronic mail summarizing its results after each run. This should include at least the fields begin, end, duration, datasets, transmitted, failed for each source and summed over all sources as well as the number of sources considered.

Implement faceted search based on license field

After investing some more into homogenization, the license fields seems like a good candidate to adapt Tantivy's facted search examples to our setting.

Extract snippets from the results

Extracting snippets to preview where the query terms where found can often be helpful for humans to determine if a search result is actually useful or not.

Explicitly handle duplicate dataset identifiers per source

It appears that some sources allocate duplicate dataset identifiers which we currently implicitly handle via last-write-wins. This should be replaced a explicit handling that decides which version should be used.

Use LZ4 compression for datasets

This is currently not worth it without an actual metadata schema for the datasets but might become worthwhile when the sizes of the datasets increase. This should be a simple changes as Tantivy already depends on lz4_flex internally and the Dataset type already encapsulates serialization.

Consider transparent fallback for response replay

While #57 provides a functional implementation of "reharvesting" via response replay, it is not clear whether the usability gain of providing a transparent fallback to the network if a response is not stored on disk outweighs the increase in complexity of sourcing some responses from disk and some from the network, c.f. #57 (comment)

Map all geographic locations to a well-known catalog

The Wasser-DE portal uses its own list of REGION_IDs which is not particularly helpful in the long run. I would like to map the names given in REGION_NAMES to an established catalog, like www.geonames.org as suggested by DCAT-AP (https://www.dcat-ap.de/def/dcatde/2.0/implRules/#angaben-zur-geografischen-abdeckung)

Implement pagination of search results

Use TopDocs::with_limit(..).and_offset(..) to implement pagination of search results.

Add "Bauwerke an Gewässern" as manual dataset

The existing pull request #87 should be extended with an actual example for manually imported dataset, namely the "Bauwerke an Gewässern" map published by BfG.

Deduplicate contact information

Some datasets (at Wasser-DE) seem to have redundant contact information which we should either deduplicate or amend by role information.

How to handle changes to index schema?

Currently, re-indexing fails when the schema used by the indexer does not correspond to that stored on disk. We should determine how to handle this, e.g. by deleting or renaming the index directory.

Add faceted search

To augment the user experience using the search, we should add faceted search that allows simple filter-by-click to be applied to the results.

As a starter, we can use the license and source fields, as these are already provided an have a limited number of options.

Retry requests during harvesting

Requests which do not fail due to malformed responses but due to network or server errors should be retried for a finite number of times with exponential back-off to improve consistency.

Extend CSW harvester to fetch record details

We currently use GetRecords requests to fetch summary records but for an extended schema, we should get the record details for each using separate GetRecordById requests for each record or extend or GetRecords request to fetch the full metadata by specifying a different Query element.

Fold indexer into harvester

For initial development, it is useful to have the indexer as a separate step than can be executed independently of the long-running harvester, but w.r.t. efficiency it would seem preferable to already handle the indexing during harvesting itself which is network-bound and hence leaves us with CPU cycles to spare.

Identify duplicate datasets

It is quite likely that we will harvest datasets from multiple sources, e.g. "Zoo Leipzig Jahreszahlen" can be harvested from govdata.de and opendata.leipzig.de under different ID.

The DCAT-AP.de implementation guide describes how to identify duplicates based on dct:identifier field which in this case forwards the ID from opendata.leipzig.de into the catalogue at govdata.de via a CKAN "extra" field called identifier. (Additionally, its full URL is available via the guid field.)

Since this will only work for catalogues participating in DCAT-AP.de pipelines, it might be simpler to resolve duplicates based on the URL of the data itself, e.g. https://statistik.leipzig.de/opendata/api/values?kategorie_nr=11&rubrik_nr=4&periode=y&format=csv in this case which should identify the dataset independently of any intermediaries publishing and identifying it.

Extend persistent stats with term frequencies

Using the extracted terms introduced in #70, we should extend server::stats::Stats with a histogram of how often which terms were included in queries passed in via the /search route.

This should then be displayed on the metrics page in addition to the accesses.

Expand the internal metadata scheme to cover all properties provided by the harvested sources

We do not need to implement a full set of properties of a given standard like DCAT-AP.de, but rather have to implement the subset of properties actually used in the datasets harvested.

A first step would be to collect information about the properties currently harvested but not transformed.

Consider reharvesting

It might be nice to store the raw data within each dataset so that we can repeat the harvesting process without repeating the network access.

It is not yet clear whether this is really worth it may include more network requests than the one transmitting the metadata itself and it could become complicated to store the "raw data" for a single dataset which is e.g. actually a subtree of XML elements from a larger XML document.

A middle ground might be to store the response bodies which can be parsed as if they were transmitted over the network. This might be more complicated w.r.t. the implementation though as ideally, all HTTP requests would be replayed automatically driving the same code.

Separate collection of metrics from harvesting

A clear separation allows to collect and try different metrics without having to re-harvest every time.

Provide an OpenAPI compatible specification for the JSON-based API

If we want to use the JSON-based API (accessed via content negotiation), we should at least provide an OpenAPI-compatible specification that is shipped with the server binary and served via an endpoint.

Add group setting to sources

Add a group setting to sources on which the harvester will filter them, e.g. if given a group via command line or environment variable, so that different periodically triggered harvester systemd services can be defined so that e.g. some sources are harvested daily while others are harvested only weekly without having to implement a general purpose scheduler (and the accompanying need for persistent state) in the harvester.

Provide Prometheus-compatible metrics

At least request counts for all routes, ideally also histograms for their response times

adamreichold / umwelt-info Goto Github PK

umwelt-info's People

Contributors

Stargazers

Watchers

umwelt-info's Issues

Recommend Projects

Recommend Topics

Recommend Org