Coder Social home page Coder Social logo

gsa / project-open-data-dashboard Goto Github PK

View Code? Open in Web Editor NEW
137.0 52.0 119.0 6.92 MB

Project Open Data Dashboard

Home Page: http://labs.data.gov/dashboard/

License: Other

PHP 95.92% HTML 0.31% CSS 1.14% JavaScript 0.31% Shell 1.95% Dockerfile 0.22% Makefile 0.16%

project-open-data-dashboard's Introduction

ARCHIVE NOTICE

This is a repository for a deprecated dashboard formerly maintained by the Data.gov team at GSA. The dashboard, formerly called the “Project Open Data Dashboard,” originated in 2015 with the initial implementation of the Federal Open Data Policy which required federal agencies to maintain comprehensive metadata inventories to be harvested by the federal data catalog at Data.gov. The dashboard provided information on how agencies were complying with the new policy by crawling agency harvesting locations and displaying metrics about the number and frequency of datasets added. The Office of Management and Budget (OMB) expanded the information in the dashboard by reviewing agency performance on several factors on a quarterly basis. Over time, usage and involvement with the dashboard decreased, and there were technical shortcomings with the dashboard that were not addressed due to higher priorities for limited resources available to the Data.gov team. In 2023, the Data.gov team, in consultation with OMB, launched a new report for agency dataset publication on the Data.gov catalog and also an alternate version of a validator previously available on the old dashboard. The report and validator provide basic functions, and the Data.gov team intends to build additional capabilities to these new tools.

Project Open Data Dashboard

CircleCI

The Project Open Data Dashboard provides a variety of tools and capabilities to help manage the implementation of Project Open Data. It is primary used for Federal agencies, but also provides tools and resources for use by other entities like state and local government.

The primary place for the user-facing documentation is https://labs.data.gov/dashboard/docs

Features

  • Dashboard overview of the status of each federal agency's implementation of Project Open Data for each milestone.
  • Permissioned Content Editing for the fields in the dashboard that can't be automated. The fields are stored as JSON objects so the data model is very flexible and can be customized without database changes. User accounts are handled via Github.
  • Automated crawls for each agency to report metrics from Project Open Data assets (data.json, digitalstrategy.json, /data page, etc). This includes reporting on the number of data sets and validation against the Project Open Data metadata schema.
  • A validator to validate Project Open Data data.json files via URL, file upload, or text input. This can be used for testing both data.json Public Data Listing files as well as the Enterprise Data Inventory JSON. The validator can be used both by Federal agencies as well as non-federal entities by specifying the Non-Federal schema.
  • Converters to export existing data from data.gov
  • Changeset viewer to see how a data.json file for an agency compares to metadata currently hosted on data.gov

CLI Interface

In addition to the web interface, there's also a Command Line Interface to manage the crawls of data.json, digitalstrategy.json, and /data pages. This is helpful to run specific updates, but it's primary use is with a CRON job.

From the root of the application, you can update the status of agencies using a few different options on the campaign controller. The syntax is:

$ php public/index.php campaign status [id] [component]

If you wanted to update all components (data.json, digitalstrategy.json, /data) for all agencies, you'd run this command:

$ php public/index.php campaign status all all

If you just wanted to update the data.json status for CFO Act agencies you'd run:

$ php public/index.php campaign status cfo-act datajson

If you just wanted to update the data.json status for agencies being monitored by the OMB you'd run:

$ php public/index.php campaign status omb-monitored datajson

If you just wanted to update the digitalstrategy.json status for the Department of Agriculture you'd run:

$ php public/index.php campaign status 49015 digitalstrategy

There are agencies whose crawls take a long time to complete. These are identified with the id of long-running. You can find a current list of these in this db migration. To initiate a full-scan for these agencies, you'd run:

$ php public/index.php campaign status long-running full-scan

The options for [id] are: all,cfo-act, omb-monitored, long-running or the ID provided by the USA.gov Federal Agency Directory API.

The options for [component] are: all, datajson, datapage, digitalstrategy, download, full-scan.

  • The datajson component captures the basic characteristics of a request to an agency's data.json file (like whether it returns an HTTP 200) and then attempts to parse the file, validate against the schema, and provide other reporting metrics like the number of datasets listed.
  • The digitalstrategy component captures the basic characteristics of a request to an agency's digitalstrategy.json file (like whether it returns an HTTP 200)
  • The datapage component captures the basic characteristics of a request to an agency's /data page (like whether it returns an HTTP 200)
  • The download component downloads an archive of the data.json and digitalstrategy.json files
  • The full-scan component does further validation based on the content of the response
  • As you'd expect, all does all of these things at once.

Development

This is a CodeIgniter PHP application. We use Docker and Docker compose for local development and cloud.gov for testing and production (pending migration from BSP.)

Prerequisites:

By default, the ENVIRONMENT variable is set to production so that error messages will not be displayed. To display these messages while developing, you should edit your .env file to include the variable CI_ENV set to anything other than production. See index.php for more details.

Setup

Install application dependencies

make install-dev-dependencies

Start up the application and database

make up

Run tests

make test

Open your browser to localhost:8000.

Restoring database dumps

If you need a database dump, you can create one following instructions from the Runbook. Clean up the database dump by removing any USE database statement, or CREATE DATABASE statement. Then:

cat cleaned_database.sql |
  docker-compose run --rm database mysql \
  --host=database --user=root --password=mysql dashboard

After a database restore, test by viewing a USDA detail page:

curl http://localhost:8000/offices/detail/49015/2017-08-31

Making database schema changes

To update the schema

Add a new numbered migration class, then change the configured version number to match. To perform the migration, CodeIgniter will automatically run up() migration methods until the schema version in the database matches the configured version.

If you want to invoke the migration explicitly to test that it's working, you can run php public/index.php migrate. Otherwise expect that the migration will be invoked automatically before CodeIgniter will handle any other requests.

To revert the schema

Change the configured version number to match the schema version you want to revert to. CodeIgniter will automatically run down() migration methods until the schema version in the database matches the configured version.

You can invoke the reversion as described for updates above.

Migration requirements

The dashboard uses MySQL for the backend database. MySQL doesn't support transactions around schema-altering statements. If any problems are encountered during a migration, the app is likely to wind up in a confused state where schema-altering statements have been applied, but the version of the schema in the database remains at the previous version. The migration will be attempted over and over again, often exhibiting user-visible errors or other bad behavior until manual intervention happens.

To avoid this, we need to be careful to write migrations that are both idempotent and reversible. (That is, we should be able to run them again without generating errors, and we should be able to downgrade to previous schema versions automatically.)

This requires some care because there's no guaranteed way to make it happen. Whenever we do a PR review that includes a schema change, the answer should be "yes" to all of these questions:

  • Does each of the schema-altering statements happen in its own migration?
  • Does the down() method exist on the migration, and does it undo any schema-changing action performed in the up() method?
  • Does every CREATE TABLE statement use IF NOT EXISTS?
  • Does every DROP TABLE statement use IF EXISTS?
  • Does every ADD/CHANGE/ALTER COLUMN happen via a call to the idempotent add_column_if_not_exists helper?
  • Does every DROP COLUMN happen via a call to the idempotent drop_column_if_exists helper?

CircleCI testing

All pushes to GitHub are integration tested with our CircleCI tests.

Updating composer dependencies

Edit version constraints in composer.json.

make update-dependencies

Commit the updated composer.json and composer.lock.

Deploying to cloud.gov

Quickstart with an empty database

Copy the vars.yml.template file and rename it to vars.yml. Edit any values following the comments in the file.

If you are not logged in for the Cloud Foundry CLI, follow the steps in this guide

Assuming you're logged in for the Cloud Foundry CLI, Run the following commands and replacing ${app_name} with the value in your vars.yml file.

$ cf create-service aws-rds small-mysql-redundant ${app_name}-db

$ cf create-service s3 basic-public ${app_name}-s3

$ cf create-user-provided-service ${app_name}-secrets -p '{
  "ENCRYPTION_KEY": "long-random-string"
}'

$ cf set-env ${app_name} NEWRELIC_LICENSE license-key-obtained-from-newrelic-account

$ cf push --vars-file vars.yml
Waiting for app to start...

name:              app
requested state:   started
routes:            <b><u>app-boring-sable.app.cloud.gov</u></b>
last uploaded:     Wed 28 Aug 10:02:06 EDT 2019
stack:             cflinuxfs3
buildpacks:        php_buildpack

type:            web
instances:       1/1
memory usage:    256M
start command:   $HOME/.bp/bin/start
     state     since                  cpu    memory          disk             details
#0   running   2019-08-28T14:02:25Z   0.3%   24.3M of 256M   301.7M of 512M

You should be able to visit https://<ROUTE>/offices/qa, where <ROUTE> is the route reported from cf push:

Restoring a database backup to cloud.gov:

If you need a database dump, you can create one following instructions from the Runbook. Clean up the database dump by removing any USE database statement, or CREATE DATABASE statement. We'll call this cleaned_database.sql below. Then:

Install the cf-service-connect plugin, e.g., for version 1.1.0 of the plugin on a MacOS system:

cf install-plugin https://github.com/18F/cf-service-connect/releases/download/1.1.0/cf-service-connect-darwin-amd64

Open up a tunnel to the database, and leave the tunnel open for the next step:

$ cf connect-to-service --no-client app database
Host: localhost
Port: NNNN
Username: randomuser
Password: randompass
Name: cgawsbrokerrandomname

In a separate terminal session, use the connection information to make a MySQL connection to restore cleaned_database.sql. When prompted for a password, paste in the password (e.g randompass in this example).

cat cleaned_database.sql | 
  mysql -h 127.0.0.1 -PNNNN -u randomuser -p cgawsbrokerrandomname

After a restore, you should be able to view an agency's detail page, such as: https://<ROUTE>/offices/detail/49015/2017-08-31

CI configuration

Create a GitHub environment for each application you're deploying. Each GH environment should be configured with secrets from a ci-deployer service account.

Secret name Description
CF_SERVICE_AUTH The service key password.
CF_SERVICE_USER The service key username.

Known issues

The agency hierarchy is designed to be populated from the contacts API at https://www.usa.gov/api/USAGovAPI/contacts.json/contact, but that is no longer available, so these following steps no longer work:

  • Federal agencies were seeded using the USA.gov Federal Agency Directory API and the IDs provided by that resource are used as the primary IDs on this dashboard.
  • First populate the top of the agency hierarchy: $ php public/index.php import
  • Second, populate all the subagencies: $ php public/index.php import children
  • If you have an empty database offices table in the database, you'll also want to seed it with agency data by running the import script (/application/controllers/import.php) from a command line. You'll also need to temporarily change the import_active option in config.php to true

Currently this tool does not handle large files in a memory efficient way. If you are unable to utilize a high amount of memory and are at risk of timeouts, you should set the maximum file size that the application can handle so it will avoid large files and fail more gracefully. The maximum size of JSON files to parse can be set with the max_remote_size option in config.php

What about S3?

S3 is used in a few places when config[use_local_storage] is false:

  • for archiving data.json and digitalstrategy (public)

The use_local_storage setting does not impact all uses of the upload class, just those cases above.

The archive_file function will use config[use_local_storage] anytime it's called but the logic doesn't apply when to datajson_lines is set as filetype.

Here's an outline of where S3 is used in the code:

models/Campaign_model.php:

  • archive_file which calls archive_to_s3 when use_local_storage is false
    • the validate_datajson function calls archive_file but sets filetype to datajson-lines so the archive_file function does not store it in S3, regardless of use_local_storage setting.
  • archive_to_s3 which calls put_to_s3 and stores with a PUBLIC acl
  • put_to_s3 which stores private by default
  • get_from_s3 previously used by csv_to_json; unused now

views/office_detail.php:

  • Builds a URL based on values of config/s3_bucket for displaying the "Analyze archive copies" line of Automated Metrics.

S3 changes for cloud.gov*

  • There's a need for one public S3 bucket for archiving data.json from crawls, and fetching them in the office_detail.php.

project-open-data-dashboard's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

project-open-data-dashboard's Issues

Provide validation and analysis of accessURL links in Public Data Listing

  • identify broken accessURL links
  • identify accessURL links that point to HTML or PDFs rather than raw data
  • identify accessURL links that don't match the file format specified by format
  • identify links that point to APIs as format
  • break out proportions of file types to help characterize machine readability. A pie chart would be helpful to visualize this
  • gauge overall data freshness (e.g., percentage of datasets modified within last # months). A chart/histogram would be helpful to visualize this

This is also being tracked for data.gov with GSA/data.gov#459 for broken links and GSA/data.gov#471 and GSA/data.gov#55 for format validation

Simplify and consolidate fields in dashboard table

Current plan is to show these column headings

a. Datasets - #
b. Valid Metadata -#
c. Enterprise Data Inventory- Color based on Overall progress of this milestone
d. Public Data Listing- Color based on Overall progress of this milestone
e. Public Engagement- Color based on Overall progress of this milestone
f. Privacy & Security- Color based on Overall progress of this milestone
g. Human Capital- Color based on Overall progress of this milestone

Display all agencies, not just CFO-Act agencies

Provide a view that displays all agencies and schedule the crawler to refresh them all regularly.

This is actually already in place, but some additional formatting is needed so the dashboard looks correct for agencies that don't have the OMB leading indicators data. We'll also need to update the existing cron job or create a new one to make sure all agencies are getting updated regularly

Show bureau coverage as percentage

Under the Enterprise Data Inventory leading indicators tab, there's an item for "Bureaus Represented" This should have accompanying information on total number of bureaus for the agency and be shown as a percentage.

Empty items

A number of items are totally empty unless you hover over them
image

headings on dashboard

From @ damondavis -

On the dashboard does “Inventory” = Public Data Listing, and “Inventory Superset” = Enterprise Data Listing? If so it might be helpful to just call them those names to match with the open data policy language.

difference between the validators?

It appears that there's a difference in the validation schemes between the inventory.data.gov validator and data.civicagency.org. The USDA ERS datasets passed the former but not the latter. We had to go back and make additional changes to pass the OMB/GSA review.

Provide some background

I assume this is the homepage for the tracker?

http://labs.data.gov/dashboard/offices

If so, it's really unclear to the first time user what this is for.

Should we add text like:

"This is a public dashboard showing how Federal agencies are performing on the Open Data Policy -- formally known as OMB-13-13"

error when generating a csv

http://datafarm.civicagency.org/datagov/csv?orgs=vba-gov is breaking for me. I get the following errors instead of the csv.


A PHP Error was encountered

Severity: Notice

Message: Undefined offset: 0

Filename: controllers/campaign.php

Line Number: 121

A PHP Error was encountered

Severity: Warning

Message: array_keys() expects parameter 1 to be array, null given

Filename: controllers/campaign.php

Line Number: 121

A PHP Error was encountered

Severity: Warning

Message: Cannot modify header information - headers already sent by (output started at /home/philaestheta/datafarm.civicagency.org/system/core/Exceptions.php:185)

Filename: controllers/campaign.php

Line Number: 142

A PHP Error was encountered

Severity: Warning

Message: Cannot modify header information - headers already sent by (output started at /home/philaestheta/datafarm.civicagency.org/system/core/Exceptions.php:185)

Filename: controllers/campaign.php

Line Number: 143

A PHP Error was encountered

Severity: Warning

Message: Cannot modify header information - headers already sent by (output started at /home/philaestheta/datafarm.civicagency.org/system/core/Exceptions.php:185)

Filename: controllers/campaign.php

Line Number: 144

A PHP Error was encountered

Severity: Warning

Message: Cannot modify header information - headers already sent by (output started at /home/philaestheta/datafarm.civicagency.org/system/core/Exceptions.php:185)

Filename: controllers/campaign.php

Line Number: 145

A PHP Error was encountered

Severity: Warning

Message: Cannot modify header information - headers already sent by (output started at /home/philaestheta/datafarm.civicagency.org/system/core/Exceptions.php:185)

Filename: controllers/campaign.php

Line Number: 146

A PHP Error was encountered

Severity: Warning

Message: Cannot modify header information - headers already sent by (output started at /home/philaestheta/datafarm.civicagency.org/system/core/Exceptions.php:185)

Filename: controllers/campaign.php

Line Number: 147

A PHP Error was encountered

Severity: Warning

Message: Cannot modify header information - headers already sent by (output started at /home/philaestheta/datafarm.civicagency.org/system/core/Exceptions.php:185)

Filename: controllers/campaign.php

Line Number: 148

A PHP Error was encountered

Severity: Warning

Message: fputcsv() expects parameter 2 to be array, null given

Filename: controllers/campaign.php

Line Number: 128

"Records"

Would be great to switch this header to "data sets" so that on first glance, a newbie would know what this means.

Inventory superset

Concerned about the clarity of this header. Are there other ideas on phrasing?

Add "Use and Impact" leading indicators tab

To go along with the new CAP Goal requirements for "use and impact"

  • Primary uses of agency data
  • Value or impact of data
  • Primary data discovery channels
  • User suggestions on improving data usability
  • User suggestions on additional data releases

I think these would all be text fields.
also checkmark for submission of 5 users.

Invalid JSON format issue

When using the dashboard it states our file (www.ssa.gov/data.json) is in invalid JSON format. However, using JSONLint and pasting the contents of our file it validates perfect. Using the URL it provides me back and error of "Invalid Character" with no further details. Using Firefox I was able to determine it is in UTF-8 format. I have asked our web server team to update the MIME Type for JSON to "application/json; charset=utf-8".

Could anyone provide me with any direction to take next to resolve the issue?

Support large data.json files

The current approach to parsing data.json files doesn't scale with large files (eg over 5mb), so we need to switch to more of a file streaming or chunking strategy

Show percentage of schema compliance rather than simple pass/fail

Rather than show valid/invalid for compliance with the JSON schema, the status for the schema should instead show the percentage of fields across all entries that are compliant with the schema. So if every field was invalid in one entry of a data.json file with 100 entries, the status for the schema would read: "99% compliant with schema" and if it was only one field that was invalid it would be more like 99.9%

Harvesting from HealthData.gov

for HHS,there's an X for "harvesting" on the dashboard. Why? most of the content in the health domain comes from healthdata.gov, so i'm confused.

Status refresh clears out previous values if they aren't refreshed

The status refresh operation allows you to specify what you want to refresh, but since it doesn't start with the existing object it might clear out previous values if they aren't part of the refresh request - this is based on the order of the refresh stages in the function:

https://github.com/philipashlock/farm-server/blob/master/application/controllers/campaign.php#L347

The fix is to request the current object and run all updates to merge with that

Make clear that boxes are clickable

It's not clear to the new user that they can click on boxes to get more information

Maybe put some text up top like: "Click on a box to get more information"

Color

Two options to help clarify the status in the tracker:

  • Add a key to explain what the color means
  • Remove the colors

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.