Coder Social home page Coder Social logo

theodi / octopub Goto Github PK

View Code? Open in Web Editor NEW
41.0 7.0 18.0 3.94 MB

Publish data easily, quickly and correctly

Home Page: https://octopub.io/

License: Other

Ruby 71.08% JavaScript 6.99% CSS 0.11% HTML 19.06% SCSS 2.76%
data-publication ruby pusher rails ruby-on-rails

octopub's Introduction

Build Status Dependency Status Coverage Status Code Climate License Dependency Status Badges

Octopub

Octopub is a Ruby on Rails application that provides a simple and frictionless way for users to publish data easily, quickly and correctly on GitHub.

Summary of features

More information is in the announcement blog post

The live instance of Octopub is running at http://octopub.io/

Follow the public feature roadmap for Octopub

Requirements

These are the tools and services required to get Octopub fully working for development, testing and production environments. We'll explain how to set these up in the next section.

Setup

Redis/Sidekiq

Sidekiq is used for managing the background proccessing of data uploads. To use Sidekiq just install Redis by following the instructions here, or if you are using Homebrew you can just do brew install redis.

Environment variables

For development Octopub uses the dotenv gem to load environment variables. Create a file called .env in your project root and paste in the variables below. We'll fill these in as we go along.

# GitHub App Client ID & secret
GITHUB_KEY=
GITHUB_SECRET=

# OAuth access token for GitHub API access
GITHUB_TOKEN=

S3_BUCKET=

AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=

PUSHER_APP_ID=
PUSHER_KEY=
PUSHER_SECRET=
PUSHER_CLUSTER=

BASE_URI=
ODC_API_KEY=
ODC_USERNAME=

# production only
SMTP_USERNAME=
SMTP_PASSWORD=
SMTP_SERVER=

GitHub

Create a GitHub application:

  1. Log in to GitHub.
  2. In Settings -> Developer settings -> OAuth applications, create a new OAuth application with a unique name. You can use http://octopub.io for the homepage and for the callback URL use your local server address, i.e. http://localhost:3000. Click on your OAuth application to see your Client ID and Client Secret, and update your .env file:
GITHUB_KEY=<YOUR CLIENT ID>
GITHUB_SECRET=<YOUR CLIENT SECRET>
  1. In Settings -> Developer settings -> Personal access tokens, generate a new token with a sensible description, e.g. octopub_dev_token, and update your .env file:
GITHUB_TOKEN=<Your token>

AWS

Create an S3 bucket:

  1. In AWS go to the S3 service and create a bucket with a sensible name. Make sure the region is set to EU(Ireland) since Octopub uses this.
  2. Click on your bucket and go to the Permissions tab. Click on CORS Configuration and paste in the configuration below. This will allow your local development version of Octopub to make requests to your S3 bucket.
<?xml version="1.0" encoding="UTF-8"?>
<CORSConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<CORSRule>
    <AllowedOrigin>http://localhost:3000</AllowedOrigin>
    <AllowedMethod>GET</AllowedMethod>
    <AllowedMethod>POST</AllowedMethod>
    <AllowedMethod>PUT</AllowedMethod>
    <AllowedHeader>*</AllowedHeader>
</CORSRule>
</CORSConfiguration>

Grant permissions to your bucket:

  1. In AWS go to the IAM (Identity and Access Management page) service.
  2. Click Users.
  3. Add a new user and give it a name, e.g. octopub-development, and for Access Type select Programmatic Access.
  4. For permissions, select Attach existing policies directly - this will open a new tab in your browser.
  5. Click create your own policy and give it a name, e.g. octopub-dev-permissions. Then for the policy document, use the following template, but add your bucket name in place of <BUCKETNAME>.
{
   "Version": "2012-10-17",
   "Statement": [
       {
           "Sid": "AllowAdminAccessToBucketOnly",
           "Action": [
               "s3:*"
           ],
           "Effect": "Allow",
           "Resource": [
               "arn:aws:s3:::<YOUR BUCKET NAME>",
               "arn:aws:s3:::<YOUR BUCKET NAME>/*"
           ]
       }
   ]
}
  1. Click validate policy just to be sure you've not made a typo, then confirm.
  2. Back on the Set permissions page, select the policy you've just created in the table by selecting the checkbox, then click Review and then click Create user.
  3. Download the CSV file containing your Access key ID and Secret access key and update your .env file:
AWS_ACCESS_KEY_ID=<YOUR ACCESS KEY ID>
AWS_SECRET_ACCESS_KEY=<YOUR SECRET ACCESS KEY>
S3_BUCKET=<YOUR BUCKET NAME>

Pusher

  1. Log in to https://pusher.com or create a free account.
  2. Create a new application and call it something sensible.
  3. Select the App Keys tab and use the relevant values there to update your .env file:
PUSHER_APP_ID=
PUSHER_KEY=
PUSHER_SECRET=
PUSHER_CLUSTER=

ODC (open data certificate) setup

  1. Log in to https://certificates.theodi.org/ or create a free account.
  2. Go to your profile page, copy your API token and update your .env file:
ODC_API_KEY=<API TOKEN>
ODC_USERNAME=<YOUR USERNAME (email address you used when signing up)>

Running the full application locally

Assuming you have completed the setup instructions above...

  • Start Redis with redis-server.
  • Start Sidekiq with bundle exec sidekiq in the application directory.
  • Create the postgresql databases specified in config/database.yml and run rails db:migrate.
  • Start Octopub with rails s in the application directory.
  • Navigate to the home page.
  • Sign into octopub with your GitHub account.

Congratulations, you should be signed in! Now try adding some data.

Checking the Sidekiq queue

Start a rails console session and then...

require 'sidekiq/api'
Sidekiq::Queue.new.size
Sidekiq::Queue.new.first

Tests

Octopub uses the rspec test framework and requires the presence of a .env. See earlier section for details as you can (re)use your development variables*

The test suite can be run with bundle exec rspec.

* Note - the tests use VCR or mocking to allow the tests to be run offline without interfacing with the services.

Deployment

A commit to master will trigger a TravisCI run; If successful it will automatically deploy to Heroku.

Caching

The GitHub organisations are cached for the logged in user. They can be cleared from a console with Rails.cache.clear

octopub's People

Contributors

andylolz avatar caiwilliamson avatar davetaz avatar dependabot-support avatar dependabot[bot] avatar floppy avatar jamesjefferies avatar langphil avatar odi-robot avatar olivierthereaux avatar pezholio avatar quadrophobiac avatar rachelwilson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

octopub's Issues

Data Package resource has both "path" and url" but can only have one

Hello :). I noticed an issue with my Data Package:

...
 "resources": [
    {
      "url": "http://danfowler.github.io/my-first-dataset/data/data.csv",
      "name": "Periodic Table",
      "mediatype": "text/csv",
      "description": "Periodic Table of the Elements",
      "path": "data/data.csv",
      "schema": {
        "fields": [
...

There's been a recent change in the Data Package spec: a resource can only have ONE of path, url, or data. Not to worry, though:

The path attribute may also be used for Data Packages located online – in this case it determines the resource file’s URL relative to the datapackage.json’s URL.
http://dataprotocols.org/data-packages/#required-fields-1

@rgrp @pwalsh

Allow users to alter header / footer and logo

At the moment, the webpages are very heavily ODI branded. It'd be nice to allow users to upload their own header / footer, or just specify a logo, so it can tie into their branding.

Github Jekyll build sometimes fails randomly

I'm not sure why, and I don't get an email like I'd expect, so might have to handle the Jekyll page build server side before pushing to Github, then add a .nojekyll file to the repo

Organisation support

Allow a user to optionally add a repo to an organisation that they belong to, rather than their own personal account.

Similarly, allow users to edit datasets in organisations they belong to.

Generated data from CSV causes "Structural problem: Assumed header" validation of CSVLint

The data package generated using octopub: http://leowmjw.github.io/selangor_adun_2016/data/573ef1d66373766db400013b.html

causes it to fail the "Structural problem: Assumed header" validation: http://csvlint.io/validation/573ef4906373766db400013c

as per described here: Data-Liberation-Front/csvlint.io#176

Solution:

...
Content-Type: text/csv;header=present if your file has a header row or Content-Type: text/csv;header=absent if it does not
...

Submission form wipes all responses

When trying to fill in the metadata fields, if Octopub doesn't like something you have entered - it clears the whole form and you have to start again. Time consuming and frustrating. It also doesn't tell you what the problem was.

Change how we add files

After chatting to Github support, it seems that the issue described in #42 is caused by each file push triggering a build, so things get confused. The correct solution to creating files is therefore to use the Git Data API and:

  • Create git blobs for each file (using the Git Blobs API)
  • Create a new tree (using the Git Trees API)
  • Create a new commit which points to that tree and the previous commit on the branch (using the Git Commits API)
  • update the branch to point to the new commit (using the Git Refs API)

Also a sensible failsafe is to make sure a build isn't happening first using the pages API

Make embeddable view

Make embeddable view of data that can be easily added to common publishing platforms like WordPress.

Minimum requirement would be:

  • data table
  • licence
  • links to one or more of:
    • download csv
    • GitHub repository
    • open data certificate

Regenerating files

As mentioned in #116 - I've made some changes to the datapackage to be more in line with the spec. This will fix any newly generated datasets, but not any that were created previously. This raises an interesting question - should we be using Octokit to just generate the files and let the user deal with the repo after that, or should we be regenerating things like the index and the datapackage when we make changes?

For example, there's a few datasets that use an old template, and there are others that are flat out broken. Do we be helpful and regenerate them, or do we just leave them and assume the user will fix?

Validate files on upload

Uploading files that aren't CSVs will cause the Github page build to fail, so we need to add some validation to make sure the CSVs are kosher before sending the dataset to Github.

Brand

Find a name and get a domain

[feature]: use a repo also as container of more than one dataset

Hi,
your great git-data-publisher create a repo for every uploaded dataset.

It would be great to have also the possibility to use a repo as general container (my city open data repo) and to have one folder for every dataset (the city bike sharing parks, the city museums, the city bus stops, etc.).

In this way we will have in some way a good data portal.

Thank you

Use Github issues template

Create an issues template for data published via octopub/comma-chameleon.

Template could include items expected by open data certificates / open data maturity model. E.g.

  • missing/incorrect documentation
    • code list
    • data quality statement / provenance / or other metadata
    • no attribution guidance
  • licence not open
  • data not compliant with standard (e.g. 360Giving)
  • privacy / sensitive data breach
  • crowd-source data contribution/correction
  • etc.

Allow users to edit datasets

If a user wants to edit a dataset, they have to go into Github manually and edit, which probably isn't very user friendly. We should allow users to edit their datasets and push the results back up to Github.

license lookup failing

Using odlifier, it's failing to find licenses for some reason. I suspect the license name changes from a while ago, but I don't know why it's only just happened. For some reason it's causing the build to suddenly fail, but even if I roll back a few versions, it still fails, so... I dunno.

Leverage Jekyll for final output

Awesome stuff! 🤘

Reading the announcement post, one thing that struck me, the project seems to be reinventing the wheel a bit, and can knock out some open issues (e.g., #6 and #4), by leveraging Jekyll's built in data capabilities.

It also generates an HTML representation of the dataset (with DCAT metadata embedded inside), which is accessed via GitHub pages

I don't know the full context, but if I were to publish data via GitHub, keeping the same exact end-user-facing result, here's how I'd do it:

  1. Store dataset metadata in _config.yml. This way, project metadata is human readable without additional software, easily editable by non-developers, and is more easily diffed if changes are proposed via pull request.
  2. Have Jekyll generate the existing datapackage.json file via a liquid template, pulling in the human-readable information from _config.yml. Again, the output will be identical to an end user.
  3. Store each dataset in the _data folder as a .csv file. Jekyll will automatically read each file in, and expose it via the site.data namespace.
  4. Have Jekyll generate the index.html file via a liquid template (pulling from the _data files). Again, the final output would remain identical. The big advantage here, is as the data changes (e.g., collaboratively via pull request and issues) the output is automatically kept up to date, without the need to re-run the original software.
  5. Expose JSON representations of each file (as simple as {{ site.data.[FILENAME] | jsonify }})

For non-GitHub use cases, you could even package jekyll within the app and simply run jekyll build as an option before publishing (again, resulting in identical output as a fallback).

TL;DR: Jekyll will allow you to save human-readable source data, not machine-readable rendered output, and can empower publishers and consumers to adopt a more open-source workflow.

Allow editing of datasets

Allow editing of datasets within the Git Data Publisher interface, so users can make changes without using GitHub

Stop using Github Pages

Github is great, but using Github pages and Jekyll gives us a point of failure that we have no control over (see #42). Page creation is also not always instant, so could cause user confusion. We could get around this by publishing the pages on our own site (at the url git-data-publisher.herokuapp.com/user/repo), and use Github as the backend file / history store.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.