Coder Social home page Coder Social logo

umwelt-info's People

Contributors

adamreichold avatar jakobdeller avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

umwelt-info's Issues

Add an Elasticsearch harvester

We have some data sources like geoportal.de which expose Elasticsearch's /_search endpoint which we could harvest directly.

Add a SmartFinder harvester

Some sources like geoportal.bafg.de or gis.uba.de expose the /select endpoint of SmartFinder server which we could use to harvest metadata.

Add basic scraper for Undine and WasserBLIcK

Undine provides a large collection of continuously updated measurements and WasserBLIcK provides a large collection documents. Both are complex hierarchically structured web sites which will most likely need correspondingly complex scrapers to extract individual datasets.

Make tags into a facet

After merging #66, the tags field of the datasets should also be handled as a facet and exposed via a suitable extension of the UI.

Send harvest summary via electronic mail

The harvester should send an electronic mail summarizing its results after each run. This should include at least the fields begin, end, duration, datasets, transmitted, failed for each source and summed over all sources as well as the number of sources considered.

Extract snippets from the results

Extracting snippets to preview where the query terms where found can often be helpful for humans to determine if a search result is actually useful or not.

Use LZ4 compression for datasets

This is currently not worth it without an actual metadata schema for the datasets but might become worthwhile when the sizes of the datasets increase. This should be a simple changes as Tantivy already depends on lz4_flex internally and the Dataset type already encapsulates serialization.

Consider transparent fallback for response replay

While #57 provides a functional implementation of "reharvesting" via response replay, it is not clear whether the usability gain of providing a transparent fallback to the network if a response is not stored on disk outweighs the increase in complexity of sourcing some responses from disk and some from the network, c.f. #57 (comment)

Deduplicate contact information

Some datasets (at Wasser-DE) seem to have redundant contact information which we should either deduplicate or amend by role information.

How to handle changes to index schema?

Currently, re-indexing fails when the schema used by the indexer does not correspond to that stored on disk. We should determine how to handle this, e.g. by deleting or renaming the index directory.

Add faceted search

To augment the user experience using the search, we should add faceted search that allows simple filter-by-click to be applied to the results.

As a starter, we can use the license and source fields, as these are already provided an have a limited number of options.

Retry requests during harvesting

Requests which do not fail due to malformed responses but due to network or server errors should be retried for a finite number of times with exponential back-off to improve consistency.

Extend CSW harvester to fetch record details

We currently use GetRecords requests to fetch summary records but for an extended schema, we should get the record details for each using separate GetRecordById requests for each record or extend or GetRecords request to fetch the full metadata by specifying a different Query element.

Fold indexer into harvester

For initial development, it is useful to have the indexer as a separate step than can be executed independently of the long-running harvester, but w.r.t. efficiency it would seem preferable to already handle the indexing during harvesting itself which is network-bound and hence leaves us with CPU cycles to spare.

Identify duplicate datasets

It is quite likely that we will harvest datasets from multiple sources, e.g. "Zoo Leipzig Jahreszahlen" can be harvested from govdata.de and opendata.leipzig.de under different ID.

The DCAT-AP.de implementation guide describes how to identify duplicates based on dct:identifier field which in this case forwards the ID from opendata.leipzig.de into the catalogue at govdata.de via a CKAN "extra" field called identifier. (Additionally, its full URL is available via the guid field.)

Since this will only work for catalogues participating in DCAT-AP.de pipelines, it might be simpler to resolve duplicates based on the URL of the data itself, e.g. https://statistik.leipzig.de/opendata/api/values?kategorie_nr=11&rubrik_nr=4&periode=y&format=csv in this case which should identify the dataset independently of any intermediaries publishing and identifying it.

Extend persistent stats with term frequencies

Using the extracted terms introduced in #70, we should extend server::stats::Stats with a histogram of how often which terms were included in queries passed in via the /search route.

This should then be displayed on the metrics page in addition to the accesses.

Consider reharvesting

It might be nice to store the raw data within each dataset so that we can repeat the harvesting process without repeating the network access.

It is not yet clear whether this is really worth it may include more network requests than the one transmitting the metadata itself and it could become complicated to store the "raw data" for a single dataset which is e.g. actually a subtree of XML elements from a larger XML document.

A middle ground might be to store the response bodies which can be parsed as if they were transmitted over the network. This might be more complicated w.r.t. the implementation though as ideally, all HTTP requests would be replayed automatically driving the same code.

Add group setting to sources

Add a group setting to sources on which the harvester will filter them, e.g. if given a group via command line or environment variable, so that different periodically triggered harvester systemd services can be defined so that e.g. some sources are harvested daily while others are harvested only weekly without having to implement a general purpose scheduler (and the accompanying need for persistent state) in the harvester.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.