theodi / octopub Goto Github PK

View Code? Open in Web Editor NEW

41.0 7.0 18.0 3.94 MB

Publish data easily, quickly and correctly

Home Page: https://octopub.io/

License: Other

Ruby 71.08% JavaScript 6.99% CSS 0.11% HTML 19.06% SCSS 2.76%

data-publication ruby pusher rails ruby-on-rails

octopub's Introduction

Octopub

Octopub is a Ruby on Rails application that provides a simple and frictionless way for users to publish data easily, quickly and correctly on GitHub.

Summary of features

More information is in the announcement blog post

The live instance of Octopub is running at http://octopub.io/

Follow the public feature roadmap for Octopub

Requirements

These are the tools and services required to get Octopub fully working for development, testing and production environments. We'll explain how to set these up in the next section.

Ruby 2.4
PostgreSQL
Redis/Sidekiq
GitHub account
AWS account
Pusher account
Open Data Certificates account

Setup

Redis/Sidekiq

Sidekiq is used for managing the background proccessing of data uploads. To use Sidekiq just install Redis by following the instructions here, or if you are using Homebrew you can just do brew install redis.

Environment variables

For development Octopub uses the dotenv gem to load environment variables. Create a file called .env in your project root and paste in the variables below. We'll fill these in as we go along.

# GitHub App Client ID & secret
GITHUB_KEY=
GITHUB_SECRET=

# OAuth access token for GitHub API access
GITHUB_TOKEN=

S3_BUCKET=

AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=

PUSHER_APP_ID=
PUSHER_KEY=
PUSHER_SECRET=
PUSHER_CLUSTER=

BASE_URI=
ODC_API_KEY=
ODC_USERNAME=

# production only
SMTP_USERNAME=
SMTP_PASSWORD=
SMTP_SERVER=

GitHub

Create a GitHub application:

Log in to GitHub.
In Settings -> Developer settings -> OAuth applications, create a new OAuth application with a unique name. You can use http://octopub.io for the homepage and for the callback URL use your local server address, i.e. http://localhost:3000. Click on your OAuth application to see your Client ID and Client Secret, and update your .env file:

GITHUB_KEY=<YOUR CLIENT ID>
GITHUB_SECRET=<YOUR CLIENT SECRET>

In Settings -> Developer settings -> Personal access tokens, generate a new token with a sensible description, e.g. octopub_dev_token, and update your .env file:

GITHUB_TOKEN=<Your token>

AWS

Create an S3 bucket:

In AWS go to the S3 service and create a bucket with a sensible name. Make sure the region is set to EU(Ireland) since Octopub uses this.
Click on your bucket and go to the Permissions tab. Click on CORS Configuration and paste in the configuration below. This will allow your local development version of Octopub to make requests to your S3 bucket.

<?xml version="1.0" encoding="UTF-8"?>
<CORSConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<CORSRule>
    <AllowedOrigin>http://localhost:3000</AllowedOrigin>
    <AllowedMethod>GET</AllowedMethod>
    <AllowedMethod>POST</AllowedMethod>
    <AllowedMethod>PUT</AllowedMethod>
    <AllowedHeader>*</AllowedHeader>
</CORSRule>
</CORSConfiguration>

Grant permissions to your bucket:

In AWS go to the IAM (Identity and Access Management page) service.
Click Users.
Add a new user and give it a name, e.g. octopub-development, and for Access Type select Programmatic Access.
For permissions, select Attach existing policies directly - this will open a new tab in your browser.
Click create your own policy and give it a name, e.g. octopub-dev-permissions. Then for the policy document, use the following template, but add your bucket name in place of <BUCKETNAME>.

{
   "Version": "2012-10-17",
   "Statement": [
       {
           "Sid": "AllowAdminAccessToBucketOnly",
           "Action": [
               "s3:*"
           ],
           "Effect": "Allow",
           "Resource": [
               "arn:aws:s3:::<YOUR BUCKET NAME>",
               "arn:aws:s3:::<YOUR BUCKET NAME>/*"
           ]
       }
   ]
}

Click validate policy just to be sure you've not made a typo, then confirm.
Back on the Set permissions page, select the policy you've just created in the table by selecting the checkbox, then click Review and then click Create user.
Download the CSV file containing your Access key ID and Secret access key and update your .env file:

AWS_ACCESS_KEY_ID=<YOUR ACCESS KEY ID>
AWS_SECRET_ACCESS_KEY=<YOUR SECRET ACCESS KEY>
S3_BUCKET=<YOUR BUCKET NAME>

Pusher

Log in to https://pusher.com or create a free account.
Create a new application and call it something sensible.
Select the App Keys tab and use the relevant values there to update your .env file:

PUSHER_APP_ID=
PUSHER_KEY=
PUSHER_SECRET=
PUSHER_CLUSTER=

ODC (open data certificate) setup

Log in to https://certificates.theodi.org/ or create a free account.
Go to your profile page, copy your API token and update your .env file:

ODC_API_KEY=<API TOKEN>
ODC_USERNAME=<YOUR USERNAME (email address you used when signing up)>

Running the full application locally

Assuming you have completed the setup instructions above...

Start Redis with redis-server.
Start Sidekiq with bundle exec sidekiq in the application directory.
Create the postgresql databases specified in config/database.yml and run rails db:migrate.
Start Octopub with rails s in the application directory.
Navigate to the home page.
Sign into octopub with your GitHub account.

Congratulations, you should be signed in! Now try adding some data.

Checking the Sidekiq queue

Start a rails console session and then...

require 'sidekiq/api'
Sidekiq::Queue.new.size
Sidekiq::Queue.new.first

Tests

Octopub uses the rspec test framework and requires the presence of a .env. See earlier section for details as you can (re)use your development variables*

The test suite can be run with bundle exec rspec.

* Note - the tests use VCR or mocking to allow the tests to be run offline without interfacing with the services.

Deployment

A commit to master will trigger a TravisCI run; If successful it will automatically deploy to Heroku.

Caching

The GitHub organisations are cached for the logged in user. They can be cleared from a console with Rails.cache.clear

octopub's People

Contributors

Stargazers

Watchers

Forkers

standardlaw dazzaji odi-fsa pezholio enterstudio ecoblockchain langphil takechann rachelwilson sensecollective testimx62 davetaz bhanditz caiwilliamson global-localhost

octopub's Issues

Add a configuration page for non-hosted instances of octopub

Allowing users to do things like:

Specify custom URL
Add custom header and footer (see #6)
And more!

Expose through API for Comma Chameleon

Allow users to add files to existing datasets via the API

Data Package resource has both "path" and url" but can only have one

Hello :). I noticed an issue with my Data Package:

...
 "resources": [
    {
      "url": "http://danfowler.github.io/my-first-dataset/data/data.csv",
      "name": "Periodic Table",
      "mediatype": "text/csv",
      "description": "Periodic Table of the Elements",
      "path": "data/data.csv",
      "schema": {
        "fields": [
...

There's been a recent change in the Data Package spec: a resource can only have ONE of path, url, or data. Not to worry, though:

The path attribute may also be used for Data Packages located online – in this case it determines the resource file’s URL relative to the datapackage.json’s URL.
http://dataprotocols.org/data-packages/#required-fields-1

@rgrp @pwalsh

Allow users to alter header / footer and logo

At the moment, the webpages are very heavily ODI branded. It'd be nice to allow users to upload their own header / footer, or just specify a logo, so it can tie into their branding.

Github Jekyll build sometimes fails randomly

I'm not sure why, and I don't get an email like I'd expect, so might have to handle the Jekyll page build server side before pushing to Github, then add a .nojekyll file to the repo

Organisation support

Allow a user to optionally add a repo to an organisation that they belong to, rather than their own personal account.

Similarly, allow users to edit datasets in organisations they belong to.

Allow user to search data

Help users find the answer to their question in the data preview more easily, and avoid having to download the data, by providing a search function.

Here's an example from OKI's DataPackage Viewer.

Preview data structure from data package

Show the structure of the data in a human-readable form.
E.g. In a table show for each column:

the column name
title
type
format
description
constraints

Something similar here:

OKI Data Package Viewer
example view

Rename to Simple Data Publisher

It uses Github, but it's not really about Git at all...

Limit filetypes to CSV

Any non-CSV file can cause 'issues' 😕

Generated data from CSV causes "Structural problem: Assumed header" validation of CSVLint

The data package generated using octopub: http://leowmjw.github.io/selangor_adun_2016/data/573ef1d66373766db400013b.html

causes it to fail the "Structural problem: Assumed header" validation: http://csvlint.io/validation/573ef4906373766db400013c

as per described here: Data-Liberation-Front/csvlint.io#176

Solution:

...
Content-Type: text/csv;header=present if your file has a header row or Content-Type: text/csv;header=absent if it does not
...

Submission form wipes all responses

When trying to fill in the metadata fields, if Octopub doesn't like something you have entered - it clears the whole form and you have to start again. Time consuming and frustrating. It also doesn't tell you what the problem was.

Add more information to generated README

Even if it's just Created with Git Data Publisher and a link

Unable to update existing data file

When trying to update https://octopub.io/datasets/80/edit; able to change file in dataset that exist (and is published), but is not reflected in the github repo (even after waiting awhile).

When add the same CSV as a new file into the dataset; it works (after a short delay).

Guidance for how to fill in the upload form

Add DCAT to HTML readme fo certification

Change how we add files

After chatting to Github support, it seems that the issue described in #42 is caused by each file push triggering a build, so things get confused. The correct solution to creating files is therefore to use the Git Data API and:

Create git blobs for each file (using the Git Blobs API)
Create a new tree (using the Git Trees API)
Create a new commit which points to that tree and the previous commit on the branch (using the Git Commits API)
update the branch to point to the new commit (using the Git Refs API)

Also a sensible failsafe is to make sure a build isn't happening first using the pages API

Create CSVw schemas for FSA POC

Abbatoir sites
Assessment code
Full data release

Make embeddable view

Make embeddable view of data that can be easily added to common publishing platforms like WordPress.

Minimum requirement would be:

data table
licence
links to one or more of:
- download csv
- GitHub repository
- open data certificate

Make Publisher mandatory

Looking at the example dataset from the Comma-Chameleon home page, it does not have a Publisher set. When I try to earn an Open Data Certificate for this dataset it just misses out on earning a Bronze certificate because this value is missing.

Consider making Publisher a mandatory value.

Publicly accessible list of all datasets

A publicly accessible list of all datasets added using the tool, with JSON and Atom views

Regenerating files

As mentioned in #116 - I've made some changes to the datapackage to be more in line with the spec. This will fix any newly generated datasets, but not any that were created previously. This raises an interesting question - should we be using Octokit to just generate the files and let the user deal with the repo after that, or should we be regenerating things like the index and the datapackage when we make changes?

For example, there's a few datasets that use an old template, and there are others that are flat out broken. Do we be helpful and regenerate them, or do we just leave them and assume the user will fix?

Proxy CSVs via RawGit

https://rawgit.com/

This will mean CSVs get served with the correct content-type headers

Scheduled job to refresh a user's data repos

More user research

Validate files on upload

Uploading files that aren't CSVs will cause the Github page build to fail, so we need to add some validation to make sure the CSVs are kosher before sending the dataset to Github.

Improve metadata for certification

Brand

Find a name and get a domain

Support CSVw schemas

including links to other CSVs (published by octopub)

[feature]: use a repo also as container of more than one dataset

Hi,
your great git-data-publisher create a repo for every uploaded dataset.

It would be great to have also the possibility to use a repo as general container (my city open data repo) and to have one folder for every dataset (the city bike sharing parks, the city museums, the city bus stops, etc.).

In this way we will have in some way a good data portal.

Thank you

Integration with certificates.theodi.org

Auto generate a certificate for the dataset using the http://certificates.theodi.org API

Use Github issues template

Create an issues template for data published via octopub/comma-chameleon.

Template could include items expected by open data certificates / open data maturity model. E.g.

missing/incorrect documentation
- code list
- data quality statement / provenance / or other metadata
- no attribution guidance
licence not open
data not compliant with standard (e.g. 360Giving)
privacy / sensitive data breach
crowd-source data contribution/correction
etc.

Please add a license to this repository!

Please help other people know how they can use your work: http://choosealicense.com/

Add publish support to comma chameleon using API

Integration with CSVlint

Validate newly added CSV files using http://csvlint.io

Allow users to edit datasets

If a user wants to edit a dataset, they have to go into Github manually and edit, which probably isn't very user friendly. We should allow users to edit their datasets and push the results back up to Github.

license lookup failing

Using odlifier, it's failing to find licenses for some reason. I suspect the license name changes from a while ago, but I don't know why it's only just happened. For some reason it's causing the build to suddenly fail, but even if I roll back a few versions, it still fails, so... I dunno.

Allow user to sort data in each column

Help users find the answer to their question in the data preview more easily and avoid having to download the data.

Leverage Jekyll for final output

Awesome stuff! 🤘

Reading the announcement post, one thing that struck me, the project seems to be reinventing the wheel a bit, and can knock out some open issues (e.g., #6 and #4), by leveraging Jekyll's built in data capabilities.

It also generates an HTML representation of the dataset (with DCAT metadata embedded inside), which is accessed via GitHub pages

I don't know the full context, but if I were to publish data via GitHub, keeping the same exact end-user-facing result, here's how I'd do it:

Store dataset metadata in _config.yml. This way, project metadata is human readable without additional software, easily editable by non-developers, and is more easily diffed if changes are proposed via pull request.
Have Jekyll generate the existing datapackage.json file via a liquid template, pulling in the human-readable information from _config.yml. Again, the output will be identical to an end user.
Store each dataset in the _data folder as a .csv file. Jekyll will automatically read each file in, and expose it via the site.data namespace.
Have Jekyll generate the index.html file via a liquid template (pulling from the _data files). Again, the final output would remain identical. The big advantage here, is as the data changes (e.g., collaboratively via pull request and issues) the output is automatically kept up to date, without the need to re-run the original software.
Expose JSON representations of each file (as simple as {{ site.data.[FILENAME] | jsonify }})

For non-GitHub use cases, you could even package jekyll within the app and simply run jekyll build as an option before publishing (again, resulting in identical output as a fallback).

TL;DR: Jekyll will allow you to save human-readable source data, not machine-readable rendered output, and can empower publishers and consumers to adopt a more open-source workflow.

Re-assuring statements regarding Octopub GitHub permissions

In signing up for Octopub, this stopped me in my tracks:

Can you make some reassuring statements about what Octopub does and why it needs all these permissions.

Allow editing of datasets

Allow editing of datasets within the Git Data Publisher interface, so users can make changes without using GitHub

Octopub claims my CSV is not valid

CSVlint says that the file is valid, Octopub says it's not. Valid data: http://csvlint.io/validation/575acdd463737604c50001b7

Make sure Octopub doesn't break customised Jekyll sites

Allow users to add a file to a dataset via the API

Allow users to specify a schema

Allow specification of a schema on a per-file basis and also validate against this

Support organisations

Allow a user to push to an organisation, if they want to

Stop using Github Pages

Github is great, but using Github pages and Jekyll gives us a point of failure that we have no control over (see #42). Page creation is also not always instant, so could cause user confusion. We could get around this by publishing the pages on our own site (at the url git-data-publisher.herokuapp.com/user/repo), and use Github as the backend file / history store.