Coder Social home page Coder Social logo

datopian / ckanext-sweden Goto Github PK

View Code? Open in Web Editor NEW
7.0 18.0 4.0 2.83 MB

CKAN extension for Öppnadata.se, the Swedish data management platform

License: GNU Affero General Public License v3.0

Shell 1.27% Python 68.87% JavaScript 2.47% HTML 20.57% CSS 4.25% Makefile 2.58%

ckanext-sweden's Introduction

Build Status Coverage Status

ckanext-sweden

CKAN extension for Öppnadata.se, the Swedish data management platform.

Blog Plugin

To enable, activate your CKAN virtual environment and then:

  1. Add sweden_blog to ckan.plugins.

  2. Install the blog plugin's requirements:

     pip install -r ckanext/sweden/blog/requirements.txt
    
  3. Run the paster command to initialize the blog's database tables:

     paster --plugin=ckan sweden_blog_init -c /etc/ckan/default/development.ini
    
  4. Restart CKAN.

DCAT Harvesting

To enable, activate your CKAN virtual environment and then:

  1. Install Redis, gcc and libffi-dev:

     sudo apt-get install redis-server build-essential libffi-dev
    
  2. Install sweden_dcat_rdf_harvester requirements

     pip install -r ckanext/sweden/dcat/requirements.txt
    
  3. Install ckanext-harvest:

     git clone https://github.com/ckan/ckanext-harvest
     cd ckanext-harvest
     git checkout stable
     pip install -r pip-requirements.txt
     python setup.py develop
    
  4. Install ckanext-dcat:

     git clone https://github.com/ckan/ckanext-dcat
     cd ckanext-dcat
     pip install -r requirements.txt
     # tmp
     pip install lxml
     python setup.py develop
    
  5. Add dcat_rdf_harvester sweden_dcat_rdf_harvester harvest to ckan.plugins ensuring harvest is listed after sweden_dcat_rdf_harvester

  6. Restart CKAN.

You should see the harvest pages at /harvest and Generic DCAT RDF Harvester listed as a type on /harvest/new.

Configuration options

The following configuration options can be used with regards to the validation of remote DCAT documents:

  • ckanext.sweden.harvest.use_validation (default: True): Whether to use validation at all
  • ckanext.sweden.harvest.validation_service (default: http://validator.dcat-editor.com/service): The URL of the validation service to use. The harvester will POST the contents of the remote DCAT file to this endpoint.
  • ckanext.sweden.harvest.stop_on_validation_errors (default False): Whether to stop the datasets import if validation errors were found.

Theme

To enable the theme:

  1. Add sweden_theme to ckan.plugins

To modify the theme of the ckanext-sweden theme you'll need to:

  1. Install Node (apt-get install node) and Bower (npm install -g bower)

  2. Install the front end dependancies: cd ./ckanext/sweden/theme/ && npm i && bower update

  3. Re-compile assets: gulp (gulp watch will regenerate them on the whenever a change happens.)

  4. Once you've made your changes make sure you commit the changes in ./ckanext/theme/resources

Sweden Plugin and Eurovoc categories

To enable Eurovoc categories:

  1. Install ckanext-eurovoc::

    pip install ckanext-eurovoc

  2. Enable the Eurovoc and Sweden plugins by adding eurovoc and sweden to ckan.plugins.

Custom API endpoints

DCAT related API Endpoints

The sweden plugin adds the following API endpoints:

  • dcat_organization_list: returns a list of all organizations that have DCAT harvesting set up. Returns a list of objects, one per organization, each of them with the following keys:

    - `id`: CKAN organization id
    - `url`: Organization website (unique across organizations)
    - `dcat_metadata_url`: DCAT output for the organization datasets (generated by CKAN)
    - `original_dcat_metadata_url`: The remote DCAT datasets that were harvested into Oppnadata.se
    - `dcat_validation`: Boolean showing whether the DCAT validation passed or not
    - `dcat_validation_date`: Date and time in which the DCAT validation last took place
    - `dcat_validation_url`: URL to the DCAT validation results
    
  • dcat_validation: returns the validation output for the last harvest job of the organization harvest source. Requires an id parameter with the organization name or id.

Dataset Stats API Endpoints

The sweden_theme extension adds a number of additional API endpoints to retrieve data about datasets in the site.

  • total_datasets_by_week: the cumulative total number of datasets by week.
  • weekly_dataset_activity: the number of updates to datasets per week.
  • weekly_dataset_activity_new: the number of new datasets per week.

e.g.:

curl http://127.0.0.1:5000/api/3/action/weekly_dataset_activity -H "Authorization:<your-api-key>"

Hide 'Groups'

Groups aren't used and can be hidden with the ckanext- hidegroups extension::

pip install -e 'git+git://github.com/okfn/ckanext-hidegroups.git#egg=ckanext-hidegroups'

Then add hidegroups to ckan.plugins.

Script for automated organizations creation

The extension includes a standalone script to automate the creation of organizations on the portal. For details, check the scripts folder.

Tests

To run the tests, first install the dev requirements (and Redis, see above):

pip install -r dev-requirements.txt

Then do:

nosetests --nologcapture --ckan --with-pylons=test.ini

To run the tests with coverage, first install coverage (pip install coverage) then do:

nosetests --nologcapture --ckan --with-pylons=test.ini --with-coverage --cover-package=ckanext.sweden --cover-inclusive --cover-erase --cover-tests

ckanext-sweden's People

Contributors

amercader avatar brew avatar joetsoi avatar seanh avatar johnmartin avatar nigelbabu avatar

Stargazers

 avatar Vara avatar Muhammad Ismail Shahzad avatar Dragan Avramovic avatar Christopher avatar yanik avatar Hannes Ebner avatar

Watchers

 avatar adam mcgreggor avatar Darwin Peltan avatar  avatar James Cloos avatar Anders avatar Michael Bauer avatar Sam Leon avatar Ira avatar Jonathan Gray avatar Hannes Ebner avatar  avatar Tryggvi Björgvinsson avatar Oscar Montiel avatar  avatar  avatar Georgiana Bere avatar Christopher avatar

ckanext-sweden's Issues

Theme issues after 2.3 upgrade

  • "close" link at the bottom of the facets on the search page
  • "Filter results" button on dataset listings (search page, organization page, ...)

Translations

  • Update with latest on Transifex
  • Combine extensions translations

DCAT Harvesting

Harvest datasets from DCAT sources into CKAN. We'll be following http://spec.datacatalogs.org/ and using https://github.com/ckan/ckanext-dcat

We can assume a /datasets/dcat endpoint to harvest from if we want.

Checklist

  • DCAT fields (not just standard CKAN dataset fields) should be harvested. We may also need to harvest DCAT-AP fields not just DCAT ones. Exactly what dataset fields do we need to harvest?
  • Harvesting should happen automatically once per week
  • How frequently automatic harvesting happens should be configurable. On a per-harvest source basis or just site-wide?
  • Datasets that have been deleted from the source site should be deleted from CKAN the next time that site is harvested
  • Harvested datasets should be immutable - can only be changed by re-harvesting (if the source dataset has changed)
  • Users need to be able to add new URLs to harvest from (the CKAN harvester already supports this)

Show DCAT validation output

Context

On the Swedish open data portal datasets are harvested from DCAT metadata dumps like this one. This is parsed by the ckanext-dcat harvester and CKAN datasets are created.

There is a CKAN organization and a CKAN harvest source for each remote organization that has its datasets imported into CKAN.

The DCAT files are validated using an external validation service:

https://validator.dcat-editor.com/

This service only supports POST requests. For example, called with the DCAT file linked before it returns this output.

curl -X POST [email protected] https://validator.dcat-editor.com/service

We are hooking up with the validation service at this point:

https://github.com/okfn/ckanext-sweden/blob/master/ckanext/sweden/dcat/plugin.py#L21

This is called after the remote file is downloaded and before the contents are parsed and datasets created. Note that we are returning an array with validation errors. These are stored as harvest errors, more specifically GatherErrors, linked to a Harvest Job (which is linked to a Harvest Source, linked to an Organization). For instance, these errors are displayed in the harvest report page).

What's needed

On the custom dcat_organization_list action we need a dcat_validation key in the with the value http://{host}/organization/{id}/dcat_validation

This endpoint should point to a custom action that returns the validation errors for the last harvest done for this organization (more precisely, errors occurred during the last harvest job of the organization harvest source).

The actual output can be:

  1. Something generated from the harvest errors we are already storing. Cons: we need some cumbersome queries to get the relevant harvest gather errors with just the org id (we need to link org id > harvest source > last job > gather errors). Pros: @joetsoi already did some work to store the validation errors as JSON, so that might make things easier to build the whole output.
  2. Store the whole output coming from the validation service in the database at this point and just dump whatever was returned. Pros: we don't need to worry about parsing it. Cons: we need to create a custom db table, linked to the owner org, and make sure it doesn't grow too much (eg by keeping only the most recent report per organization)

harvester error

2015-07-20 19:18:41,307 DEBUG [ckanext.harvest.model] Harvest tables defined in memory
2015-07-20 19:18:41,310 DEBUG [ckanext.harvest.model] Harvest tables already exist
2015-07-20 19:18:41,403 DEBUG [ckanext.harvest.queue] Gather queue consumer registered
2015-07-20 19:18:41,404 DEBUG [ckanext.harvest.queue] Received harvest job id: 64887a99-18c8-436f-bd4b-e580d3ed632d
2015-07-20 19:18:41,410 DEBUG [ckanext.dcat.harvesters.rdf] In DCATRDFHarvester gather_stage
2015-07-20 19:18:41,410 DEBUG [ckanext.dcat.harvesters.base] Getting file http://www.arjang.se/datasets/dcat
2015-07-20 19:18:41,620 ERROR [ckanext.harvest.harvesters.base] Could not get content. Server responded with 404
2015-07-20 19:18:41,621 INFO  [ckanext.sweden.dcat.plugin] after download
2015-07-20 19:18:41,922 INFO  [ckanext.sweden.dcat.plugin] 400
2015-07-20 19:18:41,932 INFO  [ckanext.sweden.dcat.plugin] {"errors": true, "rdfError": "No RDF detected."}
2015-07-20 19:18:41,947 ERROR [ckanext.harvest.harvesters.base] The validation service returned an error: 400
2015-07-20 19:18:41,960 ERROR [ckanext.harvest.queue] Gather stage failed
2015-07-20 19:18:41,961 DEBUG [ckanext.harvest.queue] Received harvest job id: eed84020-529c-4bf5-a7f3-e2b1923ad122
2015-07-20 19:18:41,967 DEBUG [ckanext.dcat.harvesters.rdf] In DCATRDFHarvester gather_stage
2015-07-20 19:18:41,967 DEBUG [ckanext.dcat.harvesters.base] Getting file http://www.arjeplog.se/datasets/dcat
2015-07-20 19:18:45,088 ERROR [ckanext.harvest.harvesters.base] Could not get content. Server responded with 404
2015-07-20 19:18:45,089 INFO  [ckanext.sweden.dcat.plugin] after download
2015-07-20 19:18:45,255 INFO  [ckanext.sweden.dcat.plugin] 400
2015-07-20 19:18:45,258 INFO  [ckanext.sweden.dcat.plugin] {"errors": true, "rdfError": "No RDF detected."}
2015-07-20 19:18:45,273 ERROR [ckanext.harvest.harvesters.base] The validation service returned an error: 400
2015-07-20 19:18:45,278 ERROR [ckanext.harvest.queue] Gather stage failed
2015-07-20 19:18:45,279 DEBUG [ckanext.harvest.queue] Received harvest job id: 6ae38d4d-5f60-462e-abd6-e882cb5e7319
2015-07-20 19:18:45,286 DEBUG [ckanext.dcat.harvesters.rdf] In DCATRDFHarvester gather_stage
2015-07-20 19:18:45,287 DEBUG [ckanext.dcat.harvesters.base] Getting file http://www.arkitekturmuseet.se/datasets/dcat
2015-07-20 19:18:45,396 ERROR [ckanext.harvest.harvesters.base] Could not get content because a
                                connection error occurred. HTTPConnectionPool(host='www.arkitekturmuseet.se', port=80): Max retries exceeded with url: /datasets/dcat (Caused by <class 'socket.gaierror'>: [Errno -2] Name or service not known)
2015-07-20 19:18:45,397 INFO  [ckanext.sweden.dcat.plugin] after download
2015-07-20 19:18:45,570 INFO  [ckanext.sweden.dcat.plugin] 400
2015-07-20 19:18:45,572 INFO  [ckanext.sweden.dcat.plugin] {"errors": true, "rdfError": "No RDF detected."}
2015-07-20 19:18:45,580 ERROR [ckanext.harvest.harvesters.base] The validation service returned an error: 400
2015-07-20 19:18:45,588 ERROR [ckanext.harvest.queue] Gather stage failed
2015-07-20 19:18:45,590 DEBUG [ckanext.harvest.queue] Received harvest job id: 0b674a60-fccd-43a5-8c73-20b4cf687389
2015-07-20 19:18:45,599 DEBUG [ckanext.dcat.harvesters.rdf] In DCATRDFHarvester gather_stage
2015-07-20 19:18:45,600 DEBUG [ckanext.dcat.harvesters.base] Getting file http://www.arn.se/datasets/dcat
2015-07-20 19:18:45,811 INFO  [ckanext.sweden.dcat.plugin] after download
2015-07-20 19:18:46,056 INFO  [ckanext.sweden.dcat.plugin] 400
2015-07-20 19:18:46,058 INFO  [ckanext.sweden.dcat.plugin] {"errors": true, "rdfError": "No RDF detected."}
2015-07-20 19:18:46,063 ERROR [ckanext.harvest.harvesters.base] The validation service returned an error: 400

Traceback (most recent call last):
  File "/usr/lib/ckan/default/bin/paster", line 9, in <module>
    load_entry_point('PasteScript==1.7.5', 'console_scripts', 'paster')()
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/script/command.py", line 104, in run
    invoke(command, command_name, options, args[1:])
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/script/command.py", line 143, in invoke
    exit_code = runner.run(args)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/script/command.py", line 238, in run
    result = self.command()
  File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/commands/harvester.py", line 135, in command
    gather_callback(consumer, method, header, body)
  File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/queue.py", line 230, in gather_callback
    harvest_object_ids = harvester.gather_stage(job)
  File "/usr/lib/ckan/default/src/ckanext-dcat/ckanext/dcat/harvesters/rdf.py", line 167, in gather_stage
    parser.parse(content)
  File "/usr/lib/ckan/default/src/ckanext-dcat/ckanext/dcat/processors.py", line 129, in parse
    self.g.parse(data=data, format=_format)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/rdflib/graph.py", line 1035, in parse
    parser.parse(source, self, **args)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/rdflib/plugins/parsers/rdfxml.py", line 577, in parse
    self._parser.parse(source)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 207, in feed
    self._parser.Parse(data, isFinal)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 349, in end_element_ns
    self._cont_handler.endElementNS(pair, None)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/rdflib/plugins/parsers/rdfxml.py", line 160, in endElementNS
    self.current.end(name, qname)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/rdflib/plugins/parsers/rdfxml.py", line 331, in node_element_end
    self.error("Repeat node-elements inside property elements: %s"%"".join(name))
TypeError: sequence item 0: expected string, NoneType found

Harvest now trigger

An owner of a DCAT resource (which we harvest from) should be able to log in and trigger immediate harvesting ("Harvest now").

This is of course only possible if the user has a user account on the CKAN instance and the user email is that same as the DCAT maintainer email.

@matthiaspalmer @ebner did I understand this correctly?

Set up a CKAN repository we can use for staging

Since we might possibly need to make changes to CKAN core as part of our job we need a forked CKAN repository to work from (via feature branches) and merge into a development branch from which we can run staging with our automatic deployment.

  • Fork CKAN for Sweden clone with improvements
  • Set up staging to point to the development branch of the new fork (@nigelbabu takes care of that when @seanh has found a good place for it)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.