ckan / ckanext-dcat Goto Github PK

View Code? Open in Web Editor NEW

163.0 163.0 142.0 1.18 MB

CKAN ♥ DCAT

Python 99.51% HTML 0.49%

ckan ckanext dcat dcat-ap linked-data rdf schemaorg

ckanext-dcat's Introduction

CKAN: The Open Source Data Portal Software

CKAN is the world’s leading open-source data portal platform. CKAN makes it easy to publish, share and work with data. It's a data management system that provides a powerful platform for cataloging, storing and accessing datasets with a rich front-end, full API (for both data and catalog), visualization tools and more. Read more at ckan.org.

Installation

See the CKAN Documentation for installation instructions.

Support

If you need help with CKAN or want to ask a question, use either the ckan-dev mailing list, the CKAN chat on Gitter, or the CKAN tag on Stack Overflow (try searching the Stack Overflow and ckan-dev archives for an answer to your question first).

If you've found a bug in CKAN, open a new issue on CKAN's GitHub Issues (try searching first to see if there's already an issue for your bug).

If you find a potential security vulnerability please email [email protected], rather than creating a public issue on GitHub.

Contributing to CKAN

For contributing to CKAN or its documentation, see CONTRIBUTING.

Mailing List

Subscribe to the ckan-dev mailing list to receive news about upcoming releases and future plans as well as questions and discussions about CKAN development, deployment, etc.

Community Chat

If you want to talk about CKAN development say hi to the CKAN developers and members of the CKAN community on the public CKAN chat on Gitter. Gitter is free and open-source; you can sign in with your GitHub, GitLab, or Twitter account.

The logs for the old #ckan IRC channel (2014 to 2018) can be found here: https://github.com/ckan/irc-logs.

Wiki

If you've figured out how to do something with CKAN and want to document it for others, make a new page on the CKAN wiki and tell us about it on the ckan-dev mailing list or on Gitter.

Copying and License

It is open and licensed under the GNU Affero General Public License (AGPL) v3.0 whose full text may be found at:

http://www.fsf.org/licensing/licenses/agpl-3.0.html

ckanext-dcat's People

Contributors

$leofrachet avatar$

Stargazers

Watchers

Forkers

datagovuk poguez mxabierto dmfenton takuya310 pduchesne trickvi ogdch publicamundi dieterdepaepe dlax sciamlab microcomp memaldi gsplatform joetm aquacross keitaroinc ygversil jjediny logilab opendata-swiss vegvesen datagovau vgammieri kata-csc saravanan-blsk geosolutions-it insertjokehere metaodi tdipisa mdkan vaquer open-data linpianshang ximenesuk ccca-dc opengov-opendata smartcolumbus-ide cezio govdataofficial datalocale mgeh opendatazurich gjackson12 bor8 vrk-kpa civiclabtz mattfullerton jgulic engerrs servercode stefina etj transparenzportalhamburg camfindlay cuuats csiro-enviro-informatics derilinx neogeo-technologies sebastiendarocha 908inc zafarsyed edmondchuc leofrachet dassolkim ocha-dap readmehong alphagov qld-gov-au nmchau italia ethel-ecc salusutis nicholascar mpolidori mbr001 datopian datashades geological-survey-of-queensland aidig fagan2888 styhar sowmya-debug localidata reicharm n3nc quaxsze stream-project rue-a ayumihellena naturalhistorymuseum crawld danatec1 javicond3 letdos apteksdi htagtsc himeshph knudmoeller

ckanext-dcat's Issues

Allow a closed set of fields to be incorporated in output

As our static file is currently about 17Mb, it would be useful if there was a way to cut down the number of fields contained in the output. For a catalog record, it's possible to get away with a lot less and have them used as simple pointers to the actual data.

writing own profile

Hi, i need to write my own rdf profile - especially graph_from_dataset part for rdf endpoint. To which file i have to save it?

Thank you

Clarify what this is about

I think it would be good if in the README there would be some sentence to explain the relation to the existing RDF functionality of CKAN: http://docs.ckan.org/en/latest/maintaining/linked-data-and-rdf.html

Missing dcat:Catalog

There is no dcat:Catalog equivalent in the JSON representation of the datasets. Perhaps this would be a useful place to store information about the total number available to also solve #6

Querying the catalog with Sparql

Is it possible, or planned for this extension, to query the catalog using Sparql?

The readme explains how to get the metadata using the catalog endpoint, but filtering seems to be limited to the modified_date. It would be great to be able to query the catalog using Sparql.

RDF based parsing, supporting profiles

The current parsers for RDF require a particular format for both XML and JSON serializations. In the XML case the parsing is based on xpath which when dealing with RDF is both very limiting and fragile.

As part of work on the upcoming new version of the Swedish Open Data portal Öppnadata.se, a new RDF based parsing supporting multiple profiles has been developed, which will be ported to this extension. Current code is here:

https://github.com/okfn/ckanext-sweden/tree/master/ckanext/sweden/dcat

A full DCAT to CKAN mapping has been defined and a base profile that supports all its fields implemented. The parsing is done using rdflib. This has been hooked up into a harvester to allow remote importing, but can be also accessed to integrate in other scripts and applications. The base profile is mostly based in the DCAT AP for data portals in Europe but is generic enough to be used out of the box or be extended with custom profiles.

All this areas will be fully documented and tested.

This will also address previous issues #1 and #13.

JSON-LD not advertised in HTML

In all the templates at https://github.com/ckan/ckanext-dcat/tree/master/ckanext/dcat/templates there is no link to the JSON-LD resources. Is that just an oversight? If so, I'm happy to create a pull request to add that.

I also noticed in home/index.html there are two application/rdf+xml links, one going to .rdf, the other to .xml but their content is identical. In the package/read_base.html template this is not the case. But then again in the package/search.html template there are two xml ones. This should be fixed as well, right?

Unhandled NotFound exception

From demo.ckan.org, NotFound is raised by not handled which results in a crash.

The fix is probably to catch and handle it at https://github.com/ckan/ckanext-dcat/blob/master/ckanext/dcat/controllers.py#L58

 File "/usr/lib/ckan/demo/src/ckanext-dcat/ckanext/dcat/controllers.py", line 35, in read_dataset
    'format': _format})
  File "/usr/lib/ckan/demo/src/ckan/ckan/logic/__init__.py", line 429, in wrapped
   result = _action(context, data_dict, **kw)
 File "/usr/lib/ckan/demo/src/ckanext-dcat/ckanext/dcat/logic.py", line 24, in dcat_dataset_show
  dataset_dict = toolkit.get_action('package_show')(context, data_dict)
 File "/usr/lib/ckan/demo/src/ckan/ckan/logic/__init__.py", line 429, in wrapped
   result = _action(context, data_dict, **kw)
 File "/usr/lib/ckan/demo/src/ckan/ckan/logic/action/get.py", line 923, in package_show
    raise NotFound
NotFound

Map Distribution formats to something that CKAN understands

Harvested DCAT metadata will have defined as media types (eg application/xml). In order for previews to work in CKAN, resource formats need to be defined as xml.

Eventually CKAN core should support media types (see ckan/ckan#1350 and ckan/ckan#1336), but in the meantime the extension could do the mapping automatically, perhaps using this:

https://github.com/ckan/ckan/blob/master/ckan/config/resource_formats.json

Exception: `BaseController` not found in plugins toolkit

This exception gets raised when using the dcat_json_interface. We're running with CKAN 2.0.1, has BaseController been added to the toolkit later than this?

Organization required while harvesting with create_unowned_dataset=true

I removed all organizations, set ckan.auth.create_unowned_dataset = true, could successfully deassign the origanization from a dataset (I tried that before I had removed the organizations). But, now I tried to harvest from a DCAT source again and it complains with Create validation Error: {'Owner org': 'A organization must be supplied'}. This must be a bug, but where?

DCAT Profile Support

Does this support DCAT profiles? For example, can you validate a harvest source against a DCAT profile that defines certain required fields?

It looks like there's some documentation for a Europe-wide DCAT profile at:
https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final

Should adms:version & adms:contactpoint be owl:versionInfo & dcat:contactpoint?

Hi,
Should adms:version be owl:versioninfo (as per http://www.w3.org/TR/vocab-adms/#owl-versioninfo) and adms:contactpoint be dcat:contactpoint (as per http://www.w3.org/TR/vocab-dcat/)?
Cheers

RDF output for distributions broken

All distributions of a dataset are merged together in a single Distribution. An example: http://ckan-demo.melodiesproject.eu/dataset/test-dataset-for-testing-distributions.rdf

This is a severe bug and happens in all RDF output formats. What's going on here?

JSON interface page count?

It appears that because of the limits that are in place for the number of datasets downloaded per request, paging is in place. The only way to tell how many requests are to be made, is to keep incrementing the page number until an empty list is returned.

There's obviously no place in the json currently for total count and number pages etc. Is there another way (short of package_list) of determining how many results are to be expected and therefore how many pages to retrieve?

rdf harvester takes title as name, potentially non unique

when harvesting a DCAT feed, the harvester takes the dcat:dataset title as CKAN name. (cf rdf.py#187)
But the name is unique in the CKAN model, while titles are potentially not, leading to harvest errors.

_gen_new_name function apparently takes care of that (appending random suffix to ensure uniqueness), but this does not work as it checks against already committed datasets, not those part of the same harvest job.

Error parsing distribution formats

When I parse a DCAT with distribution formats, parser doesn't read format properly and assign a different (and strange) value to format.

For the example above, instead of application/rdf

First time I get: "format": "N13e858a1e0e048d8b093e52f2b310c6a"
Next time I get: "format": "N37e70a18f243496fa0a9b95e49a188e6"
(...)

My DCAT file is https://www.dropbox.com/s/p6n2xf4gm6p8hao/dcat-mini.rdf

Thanks in advance

Missing Owner Org when harvesting RDF source

While running an import from a DCAT RDF source, the harvester tells the owner org is missing:

2015-03-26 01:54:24,757 INFO [ckanext.harvest.queue] Received harvest object id: 68d9d538-313c-40f8-83e4-711bcc91afd7 2015-03-26 01:54:24,793 DEBUG [ckanext.dcat.harvesters.rdf] In DCATRDFHarvester import_stage 2015-03-26 01:54:24,836 DEBUG [ckanext.harvest.harvesters.base] Create validation Error: {'Owner org': 'Missing value'}

I am trying to import the RDF example : https://github.com/ckan/ckanext-dcat/blob/master/examples/catalog_datasets_list.rdf

When configuring the harvester, I used the field "organization" in the for to specify an organization in CKAN. At some point I also tried to put owner_org: my-organization in the configuration textarea.

now my configuration fields look like that (I don't know if it's of any use...):
{ "default_groups":["a-group"], "user":"admin", "read_only": false }

After looking at the code, I don't see any place where the fetch process tries to read the configuration.

Installation instructions incomplete

The second command pip install -r ckanext-dcat/requirements.txt won't work since the git repo is not checked out yet.

Query API with DCAT responses

It seems the only main thing missing with CKAN and DCAT is to have a query API that actually returns a DCAT catalogue in some RDF format, JSON-LD preferably.

Does CKAN provide a way to integrate that via extensions? If yes, how much effort would it be?

Unit vs integration tests

Since I'd like to make some contributions I was looking for an easy way to run unit tests, but it seems this whole testing setup requires a running ckan instance incl postgres etc. I find this way too complex since this is the kind of stuff that integration tests may require but not simple unit tests which can live with mocks most of the time.

Was this ever raised as an issue and could this be improved somehow? It would lower the entry barrier in my opinion. Also, could ckan run off an in-memory database that is just created dynamically before running the tests?

KeyError: 'user' when harvesting ESRI DCAT document

I'm not completely sure if this is a ckanext-dcat issue or further upstream, so I'm starting at the bottom :)

This is running CKAN 2.2 on a 64bit AWS Linux AMI base and ckanext-harvest commands are being invoked by hand e.g.

paster --plugin=ckanext-harvest harvester fetch_consumer --config='/etc/ckan/default/production.ini'

et cetera

In using ckanext-dcat to harvest a DCAT document (http://vicroadsopendata.vicroadsmaps.opendata.arcgis.com/data.json) from ESRI (see: http://doc.arcgis.com/en/open-data/provider/federating-with-ckan.htm) ckanext-harvest's gather_consumer is retrieving 18 records successfully.

...
2015-05-14 06:06:31,794 DEBUG [ckanext.dcat.harvesters.base] Getting file http://vicroadsopendata.vicroadsmaps.opendata.arcgis.com/data.json?page=2
2015-05-14 06:06:32,787 DEBUG [ckanext.dcat.harvesters.base] Empty document, no more records
2015-05-14 06:06:32,790 DEBUG [ckanext.harvest.queue] Received from plugin gather_stage: 18 objects (first: [u'305d43cb-951d-4b1b-9bf8-40fa62d5f184'] last: [u'991585f9-ef83-4c1c-9922-4f19bc5c7c03'])
2015-05-14 06:06:32,793 DEBUG [ckanext.harvest.queue] Sent 18 objects to the fetch queue

But I'm running into this error from ckanext-harvest's fetch_consumer component.

Traceback (most recent call last):
  File "/usr/lib/ckan/default/bin/paster", line 9, in <module>
    load_entry_point('PasteScript==1.7.5', 'console_scripts', 'paster')()
  File "/usr/lib/ckan/default/lib/python2.6/site-packages/paste/script/command.py", line 104, in run
    invoke(command, command_name, options, args[1:])
  File "/usr/lib/ckan/default/lib/python2.6/site-packages/paste/script/command.py", line 143, in invoke
    exit_code = runner.run(args)
  File "/usr/lib/ckan/default/lib/python2.6/site-packages/paste/script/command.py", line 238, in run
    result = self.command()
  File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/commands/harvester.py", line 136, in command
    fetch_callback(consumer, method, header, body)
  File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/queue.py", line 294, in fetch_callback
    fetch_and_import_stages(harvester, obj)
  File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/queue.py", line 311, in fetch_and_import_stages
    success_import = harvester.import_stage(obj)
  File "/usr/lib/ckan/default/src/ckanext-dcat/ckanext/dcat/harvesters/base.py", line 347, in import_stage
    package_id = p.toolkit.get_action('package_create')(context, package_dict)
  File "/usr/lib/ckan/default/src/ckan/ckan/logic/__init__.py", line 420, in wrapped
    result = _action(context, data_dict, **kw)
  File "/usr/lib/ckan/default/src/ckan/ckan/logic/action/create.py", line 187, in package_create
    model.repo.commit()
  File "/usr/lib/ckan/default/lib/python2.6/site-packages/vdm/sqlalchemy/tools.py", line 102, in commit
    self.session.commit()
  File "/usr/lib/ckan/default/lib/python2.6/site-packages/sqlalchemy/orm/scoping.py", line 114, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/usr/lib/ckan/default/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 656, in commit
    self.transaction.commit()
  File "/usr/lib/ckan/default/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 314, in commit
    self._prepare_impl()
  File "/usr/lib/ckan/default/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 290, in _prepare_impl
    self.session.dispatch.before_commit(self.session)
  File "/usr/lib/ckan/default/lib/python2.6/site-packages/sqlalchemy/event.py", line 291, in __call__
    fn(*args, **kw)
  File "/usr/lib/ckan/default/src/ckan/ckan/model/extension.py", line 112, in before_commit
    methodcaller('before_commit', session)
  File "/usr/lib/ckan/default/src/ckan/ckan/model/extension.py", line 92, in notify_observers
    func(observer)
  File "/usr/lib/ckan/default/src/ckan/ckan/model/modification.py", line 47, in before_commit
    self.notify(obj, domain_object.DomainObjectOperation.new)
  File "/usr/lib/ckan/default/src/ckan/ckan/model/modification.py", line 79, in notify
    observer.notify(entity, operation)
  File "/usr/lib/ckan/default/src/ckan/ckanext/datapusher/plugin.py", line 103, in notify
    'resource_id': entity.id
  File "/usr/lib/ckan/default/src/ckan/ckan/logic/__init__.py", line 420, in wrapped
    result = _action(context, data_dict, **kw)
  File "/usr/lib/ckan/default/src/ckan/ckanext/datapusher/logic/action.py", line 51, in datapusher_submit
    user = p.toolkit.get_action('user_show')(context, {'id': context['user']})
KeyError: 'user'

I'm not nearly deep enough into how CKAN extensions knit together with the CKAN core to determine which extension should be supplying the user in this instance.

Relation to ckanext-spatial

Does ckanext-dcat work together with ckanext-spatial? Especially considering the supported formats of the spatial fields in both. I think this should be remarked somewhere in the readme.

Manage downloadURL or accessURL

CKAN resources have just a single url property. We could rely on the resource type or format to decide if this is direct download.
For harvested datasets, we can probably store what kind of URL was defined and reuse it later on.

DCAT theme != literal string

The themes of a DCAT dataset are represented in this extension as a list of URLs only.. On export, you get something like:

    "http://www.w3.org/ns/dcat#theme": [
      {
        "@value": "http://inspire.ec.europa.eu/theme/of"
      }
    ]

This is wrong. Theme is not a literal value but a skos:Concept which means that it has to be:

    "http://www.w3.org/ns/dcat#theme": [
      {
        "@id": "http://inspire.ec.europa.eu/theme/of"
      }
    ]

DCAT and CKAN disagree on license location

DCAT defines the license element as a Distribution property (i.e. part of a Resource in CKAN terms), but CKAN defines the license at the Dataset level.
This is a structural mismatch that's probably beyond the scope of this component, but it is still an issue in its usage. Can ckanext-dcat offer a pragmatic, workaround solution, like supporting a dct:license field on dcat:Dataset, or take the first license of a resource as the dataset license ?

Allow publisher to be the organization in ckan_to_dcat

Currently publisher info is retrieved from the dcat_ extras and then from the maintainer field if not found.

It should also allow for the publisher to be a link to a CKAN organization (possibly in preference to the maintainer).

Install only de DCAT Harvester

I'm very interested in this Extension, but specialy in the DCAT Harvester.
Would it be possible to only install this function, and how can I do it?
Thanks in advance.

process profiles in the order of declaration

Currently profiles are taken in the order of their hosting extensions as sorted by the system (cf https://github.com/ckan/ckanext-dcat/blob/c8c7314c8f36b3c5c9ae136f36d48e1142744551/ckanext/dcat/parsers.py#L80#L80)
They should be ordered according to the config declaration ('ckanext.dcat.rdf.profiles').

Duplicates the datasets

When I try to harvest this XML-RDF: http://datos.madrid.es/egob/catalogo.rdf
the process inserts the datasets twice. Insted of 101, it appears 202 datasets.

I've also tried whit this one: http://datos.gijon.es/set.rdf
and in this case it works OK.

I think that the problem is with some kind of redirect in the madrid's case. Could it be possible to control this cases?

Thanks in advance!

RDF turtle not being parsed when containing return codes

Having problems processing DCAT type data.
Example
DCAT data source
This data contains a "description" property that contains a return code.
The return code cause processing to fail.
Using this file taking out the return codes works perfectly.

Distribution's format type in DCAT XML

Hello again... Now that its all working great, we see that there are a lot of Distribution formats like this: application/rdf+xml or application/rss+xml, application/gzip.

Then, when it harvest the datasets, the resource's formats are equals in ckan. So, when we try to previsualize or search the resources, the format does not look "good".

Could it be possible to make some kind of mapping for this formats? Or maybe ignore de "application/" and the "+xml" when inserts?

Thanks.

ImportError: No module named lxml

I have reach a ImportError from a clean installation

Support for latin characters.

Is there support for latin characters? I had some issues while working with texts in spanish that contain áéíóúÁÉÍÓÚ.

The error happens with this two occurrences in the same order as the errors in the log

identifier: "valoración-programas"
identifier: "Cartografía de las oficinas"

Error Log

2014-06-23 15:57:11,232 DEBUG [ckanext.dcat.harvesters] In DCATHarvester gather_stage
2014-06-23 15:57:11,235 DEBUG [ckanext.dcat.harvesters] Getting file http://xxxxx/catalogo.json
2014-06-23 15:57:13,564 DEBUG [ckanext.dcat.harvesters] Got identifier: inventario-federal
2014-06-23 15:57:13,581 ERROR [ckanext.harvest.harvesters.base] Error parsing file: 'ascii' codec can't encode character u'\xf3' in position 8: ordinal not in range(128)
2014-06-23 15:57:13,585 ERROR [ckanext.harvest.queue] Gather stage failed
2014-06-23 16:28:13,642 DEBUG [ckanext.harvest.queue] Received harvest job id: 06e4906e-eab1-4069-8fd5-1bc4a17a262c
2014-06-23 16:28:13,653 DEBUG [ckanext.dcat.harvesters] In DCATHarvester gather_stage
2014-06-23 16:28:13,655 DEBUG [ckanext.dcat.harvesters] Getting file http://xxxxx/catalogo.json
2014-06-23 16:28:15,801 DEBUG [ckanext.dcat.harvesters] Got identifier: IVF
2014-06-23 16:28:15,824 ERROR [ckanext.harvest.harvesters.base] Error parsing file: 'ascii' codec can't encode character u'\xed' in position 9: ordinal not in range(128)
2014-06-23 16:28:15,829 ERROR [ckanext.harvest.queue] Gather stage failed

"Fall-back" mapping for dcat generation

The existing mapping rely at lot on extras. For example dct:issued maps to extra:issued while in the internal CKAN schema, there is a "creation date" attribute.

While I understand that for harvesting one might not want to override some important internal field with harvested data (e.g we want to keep track of the creation date in CKAN on top of the creation date of the original remote.

However, for DCAT generation, it might not be that relevant to use all these extra and it will be quite painful to create all those extras when they are not needed (e.g no dcat harvesting). So I would propose that for dcat generation, there is a fallback mechanism: whenever possible search for the extra (e.g extra:issued) and if not available fallback on the corresponding core attribute.

In the same approach, there are several field that could follow this: dcat:theme (use the group), dct:identifier (use the internal identifier), dct:issued and dct:modified (creation and last update), contact name and email (author and author email). For the publisher info, there would probably be some fields at the organisation level.

Dataset license is not serialized

Probably because we didn't want to think about #42, but a pragmatic approach should be taken

Add schemas for ckanext-scheming for DCAT-AP 2.1

We could provide one or two predefined schemas covering all DCAT and DCAT-AP fields ready to be used or customized by instance maintainers.

This will also help in handling multilingual metadata (#55)

We could have:

Basic schema with most common fields (eg mandatory and recommended properties in DCAT-AP)
Full schema with all fields (that can later be slimmed down)
Presets file with validators if needed

What we need:

JSON schema files in the relevant location
API and functional tests
Docs

Support for DCAT-AP 1.1

Options:

Update current default profile
~~Create a new profile~~

Update: I had a look more in depth and the changes are fully backwards compatible, essentially just adding new properties (and updating the controlled vocabularies, but the extension does not handle these for now), so we can just update the default profile.

Harvester fails for JSON-LD

I have a source of DCAT JSON-LD but always get:

Error parsing the RDF file: No plugin registered for (application/ld+json, <class 'rdflib.parser.Parser'>)

Why is this happening? I thought JSON-LD is one of the supported formats?

Add missing DCAT core fields

For datasets:

dct:accrualPeriodicity (dct:Frequency, how do you encode those?)
dct:spatial
dct:temporal
dcat:theme

For catalogs:

dct:accrualPeriodicity (dct:Frequency, how do you encode those?)
dct:spatial
dct:temporal
dcat:theme

URL to individual representations

On the dataset page, link tags (frontend?).

Look into dataset representations on core.

Improve JSON-LD example

With the fantastic feedback from @gkellogg in here : project-open-data/project-open-data.github.io#309 (comment)

RDF harvester doesn't seem to support paged catalogs?

While the ckanext-dcat extension does implement Hydra based paging in the exposed catalogs, the DCAT RDF harvester doesn't seem to support paged catalogs. Is this correct? And if so, is there a way to work around this issue?

foaf:homepage != literal string

Same as #47. foaf:homepage is currently output as a literal string, but has to be a resource.

Harvester picks random language

I have a JSON-LD DCAT catalog which includes translations for some strings (like the dataset title). The harvester seems to randomly pick any of those translations. I know that CKAN doesn't support multilingual fields yet, but then the harvester could at least pick the translation that matches the default language of the CKAN instance, and fall-back to a random choice if there's no match.

Numeric values in keywords

Can't seem to find another issue addressing this, but I've been getting this error with numeric values in keywords which is crashing the harvest process:

raise ValidationError(errors)
ckan.logic.ValidationError: {'tags': [{}, {}, {}, {}, u'Tag "1" length is less than minimum 2', {}, {}, {}, {}]}

Any way around this apart from writing a custom profile that I'm missing?

XML/RDF serialization

What's the most efficient way of doing this?

There is a similar functionality on core now, based on Genshi. How do we relate both?
https://github.com/okfn/ckan/blob/master/ckan/templates_legacy/package/read.rdf

Support for multilingual RDF

Right now, neither the parsers nor the serializers take multilingual metadata into account.

For instance given the following document, a random title among the three will be picked up during parsing time:

@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcat:    <http://www.w3.org/ns/dcat#> .
@prefix dct:     <http://purl.org/dc/terms/> .
@prefix xsd:     <http://www.w3.org/2001/XMLSchema#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix skos:    <http://www.w3.org/2004/02/skos/core#> .

<http://data.london.gov.uk/dataset/Abandoned_Vehicles>
      a       dcat:Dataset ;
      dct:title "Abandoned Vehicles"@en ;
      dct:title "Vehículos Abandonados"@es ;
      adms:versionNotes "Some version notes"@en ;
      adms:versionNotes "Notas de la versión"@es ;

      ...

Parsing

The standard way of dealing with this seems to be to create metadata during the parsing that can be handled by ckanext-fluent when creating or updating the datasets.
This essentially means storing a dict instead of a string, with the keys being the language codes:

{

    "version_notes": {
        "en": "Some version notes",
        "es": "Notas de la versión"
    }
    ...

}

For core fields like title or notes, we need to add an extra field suffixed with _translated:

    "title": "",
    "title_translated": {
        "en": "Abandoned Vehicles",
        "es": "Vehiculos Abandonados"
    }
    ...

TODO: what to put in title?

To support it we can proabably have a variant of _object_value that handles the lang tags and returns a dict accordingly (RDFLib will return a different triple for each language).

Serializing

Similarly, the serializing code could check the fields marked as multilingual to see if they are a string or a dict and create triples accordingly, proabably via a helper function.

Things to think about:

Should this be the default or enabled via config option?
This will probably require using ckanext-scheming as well, otherwise multilingual fields won't be properly stored (#56).

Add license field in most of the examples

The main example: ckan_dataset.json / dataset.json / dataset.rdf could do with a license field to show how that is translated between them. It's a pretty key field.

Dereference URIs while harvesting

Currently (correct me if I'm wrong) the DCAT harvester reads exactly a single file. Now, with the advent of JSON-LD and the exposure of such catalogs as actual simple Web APIs, it will be the case that not all DCAT entries are in a single file, for example:

The following may live at http://my.domain/datasets (when requested with the proper content type):

{
  "@context": {
    "dcat": "http://www.w3.org/ns/dcat#",
    "dct": "http://purl.org/dc/terms/",
    "title":
    {
      "@id": "dct:title"
    },
    "datasets":
    {
      "@id": "dcat:dataset",
      "@type": "@id"
    }
  },
  "@id": "http://my.domain/datasets",
  "@type": "dcat:Catalog",
  "title": "My datasets",
  "datasets": [
    "http://my.domain/datasets/1",
    "http://my.domain/datasets/2",
    "http://my.domain/datasets/3"
  ]
}

And the actual dcat:Dataset entries are accessible by following the given URIs. So at http://my.domain/datasets/1 you might find:

{
  "@context": {
    "dcat": "http://www.w3.org/ns/dcat#",
    "dct": "http://purl.org/dc/terms/",
    "locn": "http://www.w3.org/ns/locn#",
    "geometry": { "@id": "locn:geometry", "@type": "gsp:wktLiteral" },
    "gsp": "http://www.opengis.net/ont/geosparql#",
    "schema": "http://schema.org/",
    "startDate": { "@id": "schema:startDate", "@type": "xsd:date" },
    "endDate": { "@id": "schema:endDate", "@type": "xsd:date" },
    "title": { "@id": "dct:title" },
    "description": { "@id": "dct:description" },
    "issued": { "@id": "dct:issued", "@type": "http://www.w3.org/2001/XMLSchema#dateTime" },
    "spatial": { "@id": "dct:spatial" },
    "temporal": { "@id": "dct:temporal" },
    "distributions": { "@id": "dcat:distribution" },
    "accessURL": { "@id": "dcat:accessURL", "@type": "@id" },
    "downloadURL": { "@id": "dcat:downloadURL", "@type": "@id" },
    "mediaType": { "@id": "dcat:mediaType" }
  },
  "@id": "http://my.domain/datasets/1",
  "@type": "dcat:Dataset",
  "title": "My first dataset",
  "description": "This is a dataset.",
  "issued": "2015-06-02",
  "spatial": {
    "@type": "dct:Location",
    "geometry": "POLYGON((-10.58 70.09,34.59 70.09,34.59 34.56,-10.58 34.56, -10.58 70.09))"
  },
  "temporal": {
    "@type": "dct:PeriodOfTime",
    "startDate": "2005-12-31",
    "endDate": "2006-12-31"
  },
  "distributions": [
    {
      "@type": "dcat:Distribution",
      "title": "GeoSPARQL endpoint",
      "accessURL": "http://my.domain/datasets/1/geosparql",
      "mediaType": "application/sparql-query"
    },
    {
      "@type": "dcat:Distribution",
      "title": "OpenDAP endpoint",
      "accessURL": "http://my.domain/datasets/1/opendap",
      "mediaType": "application/vnd.opendap.org.capabilities+json"
    }
  ]
}

So although there might be value in being able to provide a dump of everything, it may not always be easily possible, and a harvester should support both approaches and follow at least the relevant DCAT terms, not everything obviously. Does that make sense?

EDIT: I guess the same applies if you have a small version of the datasets inlined (some fields missing) but provide the full version only when following the dataset URL ("@id" field). I'm not sure how a crawler would know if the embedded dataset is complete or not, it's a bit tricky.