Coder Social home page Coder Social logo

neven7 / data-quality-dashboard Goto Github PK

View Code? Open in Web Editor NEW

This project forked from frictionlessdata/data-quality-dashboard

0.0 2.0 0.0 421 KB

Data Quality Dashboards display statistics on a collection of published data.

License: Other

JavaScript 52.71% CSS 47.29%

data-quality-dashboard's Introduction

Data Quality Dashboard

Data Quality Dashboard provides access to, and displays statistics on, a collection of published data. This collection of data is logically related: for example, data published by a single government department, or a group of departments.

The Data Quality Dashboard has been developed in order to display data quality information on the 25K spend data published by the UK Government on data.gov.uk. You can see and interact with this instance of Data Quality Dashboard here. It is powered by a static database generated with Data Quality CLI that you can find here.

The Dashboard can be used for any published collection of data by following a few key steps.

Local development

# Get the code
git clone https://github.com/okfn/data-quality-dashboard.git

# Install the dependencies
npm install

# Just build the sources
npm run build

# Just run the server
npm run start

# View the app in your browser
open http://localhost:3000/

See the scripts section in package.json for more available commands.

Read on for details.

Application

The Data Quality Dashboard is a Node.js application written in ES6, largely using Express and React.

The app.backend module renders the basic views (using React on the server) and is responsible for preparing the data as JSON by parsing the CSV database. It also provides some simple routes for standard pages like FAQ and About.

The app.ui module is a React-Redux application for displaying the data to the user.

The codebase is written in Node.js-style CommonJS, using ES6 syntax. The app.ui code is bundled by (Webpack)[http://webpack.github.io/], and app.backend is transformed using Babel at runtime.

Remote deployment

We push to Heroku, and a postinstall script ensures that app.ui is bundled before the app is served. Make sure you set NPM_CONFIG_PRODUCTION=false to include devDependencies on Heroku.

Data

The Data Quality Dashboard reads data from a flat file storage, with data written to CSV and JSON. Any publicly available file storage will do, as long as the file naming and data structure of the files is consistent.

Currently, we run the database for the UK Spend Publishing Dashboard from a public repository on GitHub. This gives easy access to the files, and enables a version history of the database.

As GitHub does not support CORS, we then use a proxy that does - RawGit.

When the application loads, it reads the data from the database, parses the content to JSON, and stores the new data representation as JSON. This JSON representation is accessible via an API endpoint that the frontend app uses.

To configure the database, the application needs to know the base path as a URL.

For example:

  • https://rawgit.com/okfn/data-quality-uk-25k-spend/master/data

By default, the application expects to find at that base the following files:

  • instance.json: Basic metadata for the instance
  • sources.csv: The list of data sources that are assessed for quality
  • publishers.csv: The list of publishers that produce these datasources
  • results.csv: The results as found by SPD-Admin
  • performance.csv: The performance as found by SPD-Admin
  • runs.csv: A log of the results run against these resources

Of course, each of these files must conform to a certain datastructure - think of them as tables in a database. As long as you conform to the structure and expected data within that structure, it does not matter how the database is actually produced.

For how to change the database see the Configure database section.

Schema

The Data Quality Dashboard expects the following schema.

instance.json

A single object with the following fields:

  • name: The name of this dashboard
  • admin: The email address of the administrator of this dashboard
  • validator_url: The URL to a GoodTables API endpoint (eg: https://goodtables.okfnlabs.org/api/run)
  • last_modified: Time when the data was last modified. Should be updated before each database deploy.

sources.csv

A CSV with the following columns:

  • id: A unique identifier for this data source.
  • publisher_id: The unique identifier of the publisher this data source belongs to.
  • title: A title for this data source.
  • data: The permalink URL for this data source.
  • format: The file format for this data source.
  • last_modified: The timestamp that indicates when this data source was last modified.
  • period_id: The publication period of the data source.
  • schema: The permalink URL for the schema that this data source should be validated against (if any).

publishers.csv

A CSV with the following columns:

  • id: A unique identifier for this publisher.
  • title: A proper title for this publisher.
  • type: A signifying type for this publisher.
  • homepage: The homepage of this publisher as a URL.
  • contact: The contact person for this publisher.
  • email: The contact email for this publisher.
  • parent_id: The parent publisher for this publisher (nested publishers).

results.csv

A CSV with the following columns:

  • id: A unique identifier for this result.
  • source_id: The identifier for the data source in this result.
  • publisher_id: The identifier for the publisher in this result.
  • period_id: The publication period of this result's data source.
  • score: The score for this result.
  • data: The permalink URL for this result's data source.
  • schema: The permalink URL for this result's data source schema (if any).
  • summary: A summary of this result.
  • run_id: The identifier of the run in which this result was generated.
  • timestamp: The timestamp for this result.
  • report: The base URL to a more detailed report

performance.csv

A CSV with the following columns:

  • publisher_id: The identifier for the publisher.
  • period_id: The time span for the analysis.
  • files_count: Number of files published during the above mentioned time span.
  • score: Score for the above mentioned files.
  • valid: How many of the above mentioned files are valid.
  • files_count_to_date: Total number of files published up to this period.
  • score_to_date: Score of all the files published up to this period.
  • valid_to_date: Number of valid files from all published up to this period.

runs.csv

A CSV with the following columns:

  • id
  • timestamp
  • total_score

Configure database

The database can be configured through the following environment variables:

  • DATABASE_LOCATION: Base URL for the files.
  • PUBLISHER_TABLE: Name of the file containing the publishers (relative to the DATABASE_LOCATION).

Following this pattern, you can also configure SOURCE_TABLE, RUN_TABLE, PERFORMANCE_TABLE and INSTANCE_TABLE.

Tooling

In order to generate the result set for a Data Quality Dashboard, we build a command line utility that is designed to be run by a developer at regular intervals (as relevant for the data being assessed). This tool, Data Quality CLI is configurable to use in assessing data quality based on metrics of:

  • Timeliness
  • Structural Validity
  • Schema Validity

Note that, like the Data Quality Dashboard itself, the CLI has currently only been tested on the UK 25K spend data.

data-quality-dashboard's People

Contributors

danfowler avatar georgiana-b avatar hdurand avatar pwalsh avatar rufuspollock avatar stevage avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.