Coder Social home page Coder Social logo

gsa / ckanext-datajson Goto Github PK

View Code? Open in Web Editor NEW

This project forked from hhs/ckanext-datajson

22.0 38.0 31.0 2.08 MB

A CKAN extension for US-DCAT and /data pages in Project Open Data implementation

Home Page: https://resources.data.gov/schemas/dcat-us/v1.1/

License: Other

Python 97.60% HTML 1.27% Shell 0.26% Dockerfile 0.14% Makefile 0.72%

ckanext-datajson's Introduction

ckanext-datajson

Github Actions PyPI version

A CKAN extension containing plugins datajson. First is used by http://catalog.data.gov/ to harvest data sources from a remote /data.json file according to the U.S. Project Open Data metadata specification (https://resources.data.gov/schemas/dcat-us/v1.1/).

Plugin datajson provides a harvester to import datasets from other remote /data.json files. See below for setup instructions.

And the plugin also provides a new view to validate /data.json files at http://ckanhostname/dcat-us/validator.

Features

  • [:heavy_check_mark:] datajson provides data.json export and DCAT-US metadata UI integration
  • [:heavy_check_mark:] datajson_harvest extends ckanext-harvest to collect metadata fromremote data.json sources
  • [:warning:] cmsdatanav_harvest extends ckanext-harvest to collect metadata from for the CMS Data Navigator catalog
  • [:heavy_check_mark:] datajson_validator provides a web form to validate dcat-us metadata data.json compliance.

Usage

Requirements

All requirements are tracked setup.py when possible. Some CKAN extensions are not on PyPI, so they (and their dependencies) must be tracked in requirements.txt.

CKAN version Compatibility
<=2.7
2.8 ⚠️
2.9.5 ✔️
2.9.6 ✔️

Installation

To install, activate your CKAN virtualenv, install dependencies, and install the module in develop mode, which just puts the directory in your Python path.

. path/to/pyenv/bin/activate
pip install -r requirements.txt
python setup.py develop

Then in your CKAN .ini file, add datajson to your ckan.plugins line:

ckan.plugins = (other plugins here...) datajson

That's the plugin for /data.json output. To make the harvester available, also add:

ckan.plugins = (other plugins here...) harvest datajson_harvest

To make the datajson validator route and web form available, also add:

ckan.plugins = (other plugins here...) datajson_validator

Development

Setup

Build the docker containers.

$ make build

Start the docker containers.

$ make up

CKAN will start at localhost:5000.

Clean up any containers and volumes.

$ make clean

Open a shell to run commands in the container.

$ docker-compose exec app /bin/bash

If you're unfamiliar with docker-compose, see our cheatsheet and the official docs.

For additional make targets, see the help.

$ make help

Testing

They follow the guidelines for testing CKAN extensions.

To run the extension tests, start the containers with make up, then:

$ make test

Lint the code.

$ make lint

Matrix builds

The test development environment drops as many dependencies as possible. It is not meant to have feature parity with GSA/catalog.data.gov. Tests should mock external dependencies where possible.

In order to support multiple versions of CKAN, or even upgrade to new versions of CKAN, we support development and testing through the CKAN_VERSION environment variable.

$ make CKAN_VERSION=2.9.5 test
$ make CKAN_VERSION=2.9 test

Note: When testing patch versions of CKAN, the services may not have patch releases. So, take note of the SERVICES_VERSION variable which tracks the minor release to pull for the db and solr images.

Credit / Copying

Original work written by the HealthData.gov team. It has been modified in support of Data.gov.

As a work of the United States Government, this package is in the public domain within the United States. Additionally, we waive copyright and related rights in the work worldwide through the CC0 1.0 Universal public domain dedication (which can be found at http://creativecommons.org/publicdomain/zero/1.0/).

Ways to Contribute

We're so glad you're thinking about contributing to ckanext-datajson!

Before contributing to ckanext-datajson we encourage you to read our CONTRIBUTING guide, our LICENSE, and our README (you are here), all of which should be in this repository. If you have any questions, you can email the Data.gov team at [email protected].

ckanext-datajson's People

Contributors

adborden avatar alanswx avatar anuveyatsu avatar avdata99 avatar btylerburton avatar chris-macdermaid avatar dano-reisys avatar fuhuxia avatar georgethomas avatar hkdctol avatar jbrown-xentity avatar jin-sun-tts avatar joshdata avatar mogul avatar mwengren avatar nickumia-reisys avatar philipashlock avatar pjsharpe07 avatar robert-bryson avatar rshewitt avatar thejuliekramer avatar zwmcfarland avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ckanext-datajson's Issues

Alembic error when adding this extension to ckan-docker

Hello GSA Team,

I am encountering an Alembic "script_location" error while trying to add this extension to the basic ckan-docker setup. The build completes successfully, but the error occurs when running the container. Previously, I managed to get CKAN running with the Harvest and DCAT extensions, but now I am facing issues with the Datajson extension.

Any help would be appreciated. I am new to CKAN, so please let me know if there are any steps I might be missing to get this working. For a bit of background, NASA is transitioning to CKAN, and I'm hoping to leverage this extension to generate a data.json file so that catalog.data.gov can continue harvesting. Below, I have posted the error I am seeing and my current CKAN Dockerfile.

Error:

ckan-dev-1    | Traceback (most recent call last):
ckan-dev-1    |   File "/usr/bin/ckan", line 8, in <module>
ckan-dev-1    |     sys.exit(ckan())
ckan-dev-1    |   File "/usr/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
ckan-dev-1    |     return self.main(*args, **kwargs)
ckan-dev-1    |   File "/usr/lib/python3.10/site-packages/click/core.py", line 1055, in main
ckan-dev-1    |     rv = self.invoke(ctx)
ckan-dev-1    |   File "/usr/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
ckan-dev-1    |     return _process_result(sub_ctx.command.invoke(sub_ctx))
ckan-dev-1    |   File "/usr/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
ckan-dev-1    |     return _process_result(sub_ctx.command.invoke(sub_ctx))
ckan-dev-1    |   File "/usr/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
ckan-dev-1    |     return ctx.invoke(self.callback, **ctx.params)
ckan-dev-1    |   File "/usr/lib/python3.10/site-packages/click/core.py", line 760, in invoke
ckan-dev-1    |     return __callback(*args, **kwargs)
ckan-dev-1    |   File "/srv/app/src/ckan/ckan/cli/db.py", line 66, in upgrade
ckan-dev-1    |     _run_migrations(plugin, version)
ckan-dev-1    |   File "/srv/app/src/ckan/ckan/cli/db.py", line 124, in _run_migrations
ckan-dev-1    |     repo.upgrade_db(version)
ckan-dev-1    |   File "/srv/app/src/ckan/ckan/model/__init__.py", line 350, in upgrade_db
ckan-dev-1    |     alembic_upgrade(self.alembic_config, version)
ckan-dev-1    |   File "/usr/lib/python3.10/site-packages/alembic/command.py", line 302, in upgrade
ckan-dev-1    |     script = ScriptDirectory.from_config(config)
ckan-dev-1    |   File "/usr/lib/python3.10/site-packages/alembic/script/base.py", line 156, in from_config
ckan-dev-1    |     raise util.CommandError(
ckan-dev-1    | alembic.util.exc.CommandError: No 'script_location' key found in configuration.
ckan-dev-1 exited with code 0

Current dockerfile.dev:

FROM ckan/ckan-dev:2.10

RUN pip3 install -e 'git+https://github.com/ckan/ckanext-harvest.git@master#egg=ckanext-harvest'
RUN pip3 install -r ${APP_DIR}/src/ckanext-harvest/pip-requirements.txt

RUN pip3 install -e 'git+https://github.com/GSA/ckanext-datajson#egg=ckanext-datajson'
RUN pip3 install -r ${APP_DIR}/src/ckanext-datajson/requirements.txt
RUN pip3 install -r ${APP_DIR}/src/ckanext-datajson/dev-requirements.txt
RUN cd ${APP_DIR}/src/ckanext-datajson && python setup.py develop

COPY docker-entrypoint.d/* /docker-entrypoint.d/

COPY patches ${APP_DIR}/patches

RUN for d in $APP_DIR/patches/*; do \
        if [ -d $d ]; then \
            for f in `ls $d/*.patch | sort -g`; do \
                cd $SRC_DIR/`basename "$d"` && echo "$0: Applying patch $f to $SRC_DIR/`basename $d`"; patch -p1 < "$f" ; \
            done ; \
        fi ; \
    done
`

Use CKAN dataset URI as identifier

It would be nice to have the POD identifier be a URL to the dataset in CKAN. This would need to be generated with a Wrapper function, as it's not stored in the package.

It should probably use the raw dataset ID from CKAN. Even if a custom URL or alias has been created for the dataset, CKAN still honors the URL with the raw ID.

Upgrade this extension to CKAN 2.8

Test this extension with CKAN 2.8 and test it.

  • Do a manual test to check if it's working
    • Create a DataJSON harvest source
    • Run the harvestr job
    • Analize if the harvested datasets is what we expect.
  • Import tests from datajson fork
  • Add CircleCI configuration (re-use from datopian)

FYI @adborden

False positive validation error for URLs

The validator is throwing an error for what looks like a well formed URL (from the DOJ's data.json)

### ERROR #1: 
'landingpage':'http://www.icpsr.umich.edu/icpsrweb/NACJD/studies/3074?archive=NACJD&q=3074&permit[0]=AVAILABLE&x=15&y=11' 
is not valid under any of the given schemas.

Use CKAN tags as POD keyword

It would be nice to have a predefined wrapper for the PDO keyword field. CKAN ships with a tagging system, and it would be nice to not have to do double entry for tags.

Non Fork ckanext-datajson GitHub Project

GitHub does not search forked projects. This means that searching for 'ckanext-datajson' does not find this project. Yet, this project is currently maintained while the last edit to the original HHS project has last updated on Mar 15, 2015.

Please consider creating an unforked version of this project to make it easier to find.

Below is the GitHub search that does not show this GSA project.

https://github.com/search?utf8=%E2%9C%93&q=ckanext-datajson

404 errors for DCAT-US are not showing errors properly

Related to Multi#356
Similar to GSA/data.gov#1765

DCAT-US sources with faulty URLs (404 errors) do not show errors properly

How to reproduce

Add a DCAT-US source pointing to http://www.cpsc.gov/data.json (or any other url failing with 404 error)

Expected behavior

The job should be mark as finished and show information about the error

Actual behavior

Failing silently

==> /var/log/gather-consumer.log <==
2020-08-06 13:17:18,492 DEBUG [ckanext.harvest.queue] Received harvest job id: 4ac567eb-b64d-4370-a29c-a7c72d213703
2020-08-06 13:17:18,506 DEBUG [ckanext.datajson.datajson_ckan_28] In <Plugin DataJsonHarvester 'datajson_harvest'> gather_stage (http://www.cpsc.gov/data.json)

Traceback (most recent call last):
  File "/usr/bin/ckan", line 45, in <module>
    load_entry_point('PasteScript', 'console_scripts', 'paster')()
  File "/usr/lib/ckan/lib/python2.7/site-packages/paste/script/command.py", line 102, in run
    invoke(command, command_name, options, args[1:])
  File "/usr/lib/ckan/lib/python2.7/site-packages/paste/script/command.py", line 141, in invoke
    exit_code = runner.run(args)
  File "/usr/lib/ckan/lib/python2.7/site-packages/paste/script/command.py", line 236, in run
    result = self.command()
  File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/commands/harvester.py", line 207, in command
    utils.gather_consumer()
  File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/utils.py", line 322, in gather_consumer
    gather_callback(consumer, method, header, body)
  File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/queue.py", line 354, in gather_callback
    harvest_object_ids = gather_stage(harvester, job)
  File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/queue.py", line 412, in gather_stage
    harvest_object_ids = harvester.gather_stage(job)
  File "/usr/lib/ckan-new/src/ckanext-datajson/ckanext/datajson/datajson_ckan_28.py", line 122, in gather_stage
    source_datasets, catalog_values = self.load_remote_catalog(harvest_job)
  File "/usr/lib/ckan-new/src/ckanext-datajson/ckanext/datajson/harvester_datajson.py", line 35, in load_remote_catalog
    datasets = json.loads(lstrip_bom(urllib2.urlopen(req).read()))
  File "/usr/local/lib/python2.7.16/lib/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/lib/python2.7.16/lib/python2.7/urllib2.py", line 435, in open
    response = meth(req, response)
  File "/usr/local/lib/python2.7.16/lib/python2.7/urllib2.py", line 548, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/local/lib/python2.7.16/lib/python2.7/urllib2.py", line 467, in error
    result = self._call_chain(*args)
  File "/usr/local/lib/python2.7.16/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/usr/local/lib/python2.7.16/lib/python2.7/urllib2.py", line 654, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/usr/local/lib/python2.7.16/lib/python2.7/urllib2.py", line 435, in open
    response = meth(req, response)
  File "/usr/local/lib/python2.7.16/lib/python2.7/urllib2.py", line 548, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/local/lib/python2.7.16/lib/python2.7/urllib2.py", line 473, in error
    return self._call_chain(*args)
  File "/usr/local/lib/python2.7.16/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/usr/local/lib/python2.7.16/lib/python2.7/urllib2.py", line 556, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found

The error is not informed in the dashboard job

image

404 on /data.jsonld and internal server error on /pod/validate (wrong param count for do_validation())

Ckan version: 2.9.a
Python version: 2.7.12

I installed the plugin according to the readme, but I'm getting a 404 on /data.jsonld, and an error on the /pod/validate page.

On /pod/validate I get the following error:

Internal error
Something bad happened: do_validation() takes exactly 3 arguments (2 given)

There's no log output in any file in the the /var/log/apache2/ directory, and I don't know where else I might find logs from the datajson extension.

The /data.json path works fine.

ImportError: cannot import name Feature

ci/circleci failed at install step with this error:

      File "/tmp/pip-build-hPa1iv/Jinja2/setup.py", line 40, in <module>
        from setuptools import setup, Extension, Feature
    ImportError: cannot import name Feature

ImportError: 'module' object has no attribute 'DataJsonPlugin'

Update: I was using an older clone of datajson, upon update and rebuild the "DataJson" error was fixed, but the "DataJsonHarvester" error in the following comment is still an issue. The noted fix when using WSGI does not resolve the issue.

I am getting the following error when datajson plugin is enabled. I can provide more information if needed.
err.datajson.txt

CKAN version 2.6.0
OS: CentOS 7
Python 2.7
Setuptools 30.0.0

I believe another individual is getting a similar error for 'DataJsonHarvester'. He is running ubuntu.

data.json harvest improvements

  • currently we check the individual dataset json object hash to identify if there are any new updates to the dataset.

  • as a first step, we should check the entire json file content hash to make sure there are some updates compared to the previous version, and the next step would be to go into the individual datasets

This additional check improves the harvesting process, and going into the individual dataset hash wont be required if the whole file hash didn't change.

Separate wrapper functions into CKAN core and extra

In order to make it easier for agencies to adapt their own fields, it might be nice to split up the wrapper functions.

Once we start reading from CKAN core fields as much as possible for POD fields, it might make sense to have core CKAN field wrappers and agency-defined wrappers for any extra fields they define. It might be worth a little bit of reorganizing class files in the source code to make it more obvious where to go to extend the core extension fields.

Remove 'tags' and 'extrasRollup' fields

We are getting data.json output with the two fields 'tags' and 'extrasRollup' un-decoded. These are stored in the PackageExtra model within the database (ie 'package_extra' table), and are being straight output to the JSON it appears.

Initialization Requires HTTPS access to project-open-data.cio.gov

For security reasons, I don't want software that I install to reach out to remote web sites during the installation or initialization process. In release v1.1, there is a file called ckanext/datajson/datajsonvalidator.py. This file has the following code starting at line 90:

omb_burueau_codes = set()
for row in csv.DictReader(urllib.urlopen("https://project-open-data.cio.gov/data/omb_bureau_codes.csv")):
    omb_burueau_codes.add(row["Agency Code"] + ":" + row["Bureau Code"])

Can this code be changed to use a local file, perhaps specified using an environment variable? The HTTPS fetch can be moved into the installation instructions. Letting this file be pulled from a local file system should make it easier to test as well.

As a side node, omb_burueau_codes seems to be misspelled.

Wrappers.mime_type_it not identifying formats

Wrappers.mime_type_id never has any formats, and thus, cannot look up the mimetype for an extension. It results in all resources throwing a warning:

Missing mediaType for resource in package

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.