Coder Social home page Coder Social logo

smithsonian / osprey Goto Github PK

View Code? Open in Web Editor NEW
6.0 5.0 2.0 9.73 MB

Dashboard that displays the file validation results in mass digitization projects. Digitization Program Office, OCIO, Smithsonian.

License: Apache License 2.0

R 14.31% Python 53.14% JavaScript 2.69% HTML 28.55% CSS 1.30% Batchfile 0.01%
digitization-workflows museum-collections python3 digitization mass-digitization

osprey's Introduction

Osprey

Osprey is a system that checks the images produced by vendors in mass digitization projects by the Collections Digitization program of the Digitization Program Office, OCIO, Smithsonian.

DPO Logo

https://dpo.si.edu/

The system checks that the files pass a number of tests and displays the results in a web dashboard. This allows the vendor, the project manager, and the unit to monitor the progress and detect problems early.

Osprey Dashboard

This repo hosts the code for the dashboard, which presents the progress in each project and highlights any issues in the files.

Main dashboard

Example Project

File Checks

The Osprey Worker runs in Linux and updates the dashboard via an API (see below). The Worker can be configured to run one or more of these checks:

  • unique_file - Unique file name in the project
  • raw_pair - There is a raw file paired in a subfolder (e.g. tifs and raws (.eip/.iiq) subfolders)
  • jhove - The file is a valid image according to JHOVE
  • tifpages - The tif files don't contain an embedded thumbnail, or more than one image per file
  • magick - The file is a valid image according to Imagemagick
  • tif_compression - The tif file is compressed using LZW to save disk space

Other file checks can be added. Documentation to be added.

Setup

The app runs in Python using the Flask module and requires a MySQL database. Install and populate the database according to the instructions in database/tables.sql.

To install the required environment and modules to the default location (/var/www/app):

mkdir /var/www/app
cd /var/www/app
python3 -m venv venv
source venv/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install -r requirements.txt

Then, test the API by running the main file:

./app.py

or:

python3 app.py

which will start the service at http://localhost:5000/.

Update permissions:

deactivate
sudo chown -R apache:apache /var/www/app

Setup apache2/httpd as described in the web_server folder

API

The application includes an API with these routes:

  • /api/: Print available routes in JSON
  • /api/files/<file_id>: Get the details of a file by its file_id
  • /api/folders/<folder_id>: Get the details of a folder and the list of files
  • /api/folders/qc/<folder_id>: Get the details of a folder and the list of files from QC
  • /api/projects/: Get the list of projects in the system
  • /api/projects/<project_alias>: Get the details of a project by specifying the project_alias
  • /api/reports/<report_id>/: Get the data from a project report

Components

The system has two related repos:

  • Osprey Worker - Python tool that runs a series of checks on folders. Results are sent to the dashboard via an HTTP API to be saved to the database.
  • Osprey Misc - Database and scripts.

License

Available under the Apache License 2.0. Consult the LICENSE file for details.

osprey's People

Contributors

villanueval avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

osprey's Issues

Generalize folder flags

From SO request:

  • A flag under the Folder names, that would indicate that the folder images have been sent [...] for transcription
  • Possibly a flag that would indicate that the folder had been transcribed
  • 'Ready for DAMS', etc, made generic

Add alt attributes to images

All img tags must have alt attributes. For collection images, review generic tag such as "image preview of file name XX.jpg" since the description does not exist at time of QC.

MD5 files not recognized

Tested with IS, created the md5 files after all files passed checks. System still said MD5 files were missing.

Can't see preview from bottom of file list

Feedback from JC (SG):

Production page: the further down the page you get, the further away an entry gets from the thumbnail at the top of the page so you have to scroll up the page to see the preview.

Keep contents of right-side panel lined up or scroll only the file list in place.

TIFF files with multiple pages

Need to add a check for tif files with more than 1 page. Castle project results have 2 pages, one a small thumbnail that is causing imagemagick to export 2 images instead of 1.

Navigate from preview image

Feedback from JC (SG):

Production page: any way to advance down the list using the down arrow key? (User has to click on each entry to open the preview image page).

Add navigation in the preview to go to the next/prev file.

Set easier way to see dupes

A duplicate filename says where the other file is, but there should be a button that shows them in a single place, probably the search files page.

QC needs to provide more info

From JPC, the staff thought there was only a single folder left. Need to provide more context.

  • Add processing detail, including folders left to do
  • Add preview of the image so it displays completely on first opening the page, keeping the full-size for zoom in details if needed

Add search to Lightbox

Feedback from JC (SG):

Production page: Search box is terrific—very robust! Add same search box to Lightbox page??

Add search to lightbox. However, these are different since the table search is driven by DT and the lightbox is just a series of divs.

Add option for external image server

Adding the URL for when the image is hosted in an external production system, IDS in SI case.

This can help save local disk by deleting the preview files once the QC/project is completed.

Provide additional instructions

Create and add additional instructions throughout to support guideline criteria's 1.3.3 and 1.4.1

  • New project request form
  • Edit project form
  • Navigation and review of dashboard
  • QC completion with requirements at completion

Add check for metadata fields

Need to check that boilerplate metadata meets the requirements.

Other metadata might use a regex to check if it meets some expected values.

Clicking on file in the dashboard sometimes hangs the system

Clicking on a file to see the preview and details result in the row being selected but nothing shows up in the right-hand panel. When forcing a refresh, the server returns a 502 Proxy Error:

Proxy Error
The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request GET /villanueval/mdpp129/.

Reason: Error reading from remote server

Seems shiny crashed. Need to diagnose the source since this is an important use of the dashboard.

Add QC process

Add QC process to the Dashboard that is based on ISO 2859-1. Will sample the files and have a pass/fail for the set.

Relevant text from virtual workflow for SG MDPP:

2.2 Image Quality Inspection

The Image Quality Inspection is based on ISO 2859-1, “Sampling procedures for inspection by attributes.” This international standard is applied when working with a continuing series of lots (large volumes of items over long periods of time), whereby the following parameters for Image Quality Inspection performed by museum staff are observed:

2.2.1 Lot Size

For the first 10% of the project, the Lot Size shall be a single full day’s worth of production throughput, as estimated by the project Gantt chart, which is 900 items. For the remaining 90% of the project, the Lot Size shall be two days’ worth of production throughput, as estimated by the project Gantt chart, which is 1800 items.

2.2.2 Sample Size

During the first 10% of the project, the sample size, based on Table 2-A in ISO 2859 standard, shall be 80 items, randomly selected. For the remaining 90% of the project, the Sample Size shall be 125 items, randomly selected.

2.2.3 Acceptable Quality Level

The Acceptable Quality Level (AQL) is defined as the worst tolerable process average (mean) that is still considered acceptable. The AQL for critical nonconformities is 0%; for major nonconformities is 1.5%; for minor nonconformities is 4%.

For a sample size of 80 items (from a lot size of 900 items), this determines:

  • Critical nonconformities: 0 items
  • Major nonconformities: 3 items or fewer
  • Minor nonconformities: 7 or fewer

For a sample size of 125 items (from a lot size of 1800 items), this determines:

  • Critical nonconformities: 0 items
  • Major nonconformities: 5 items or fewer
  • Minor nonconformities: 10 items or fewer

For a sample size 200 items (from a lot size of 4500 items), this determines:

  • Critical nonconformities: 0 items
  • Major nonconformities: 7 or fewer items
  • Minor nonconformities: 14 items or fewer

2.2.3.1 Nonconformities

“Nonconformities” are defined as non-fulfillment of a specified requirement, including, but not limited to, the following:

  • Critical: Incorrect filenaming
  • Major: incorrect image orientation or rotation, overly cropped images
  • Minor: Uncropped or skewed

2.2.4 Inspection

Inspection specifications are determined by museum staff and performed visually. After vendor performs technical quality assurance as defined in the Imaging Workflow Design Document, assets will be made available for inspection through the MDPP Validate VFCU Dashboard. For the first 10% of the project, museum staff will inspect daily, as noted above. After the first 10% of the project is complete and accepted, museum staff will inspect every other day, as noted above.

2.2.4.1 Inspection Level

The base inspection level shall be “Normal Inspection,” which is used when the quality level is assumed to be acceptable. This is also known as Level II inspection. Normal inspection for the duration of this project is performed every other day during production (Monday, Wednesday, Friday). Museum staff shall be responsible for 2 of the 3 inspection days, while DPO shall be responsible for 1 inspection day.

2.2.5 Switching Rules

Inspection levels will switch if there is marked change in nonconformities, and is as follows:

2.2.5.1 Normal to Tightened

If 2 of 5 consecutive lots are not accepted, inspection will raise from Normal to Tightened, whereby the criteria for inspection is increased to daily inspection of items, based on Lot and Sample Size. This is also known as Inspection Level III. If 5 lots not accepted under Tightened Inspection, production is discontinued until vendor can improve quality.

2.2.5.2 Tightened to Normal

If 5 consecutive lots are accepted under Tightened Inspection, inspection will resume to Normal Inspection.

2.2.5.3 Normal to Reduced

If 5 consecutive lots are accepted under Tightened Inspection, inspection will drop to Reduced Inspection, also known as Inspection Level I, whereby the criteria for inspected is decreased to weekly inspection of items based on the following Lot and Sample Size:

  • Lot Size: 4500 items
  • Sample Size: 200 items
  • Critical nonconformities: 0 items
  • Major nonconformities: 7 or fewer items
  • Minor nonconformities: 14 items or fewer

2.2.6 Remediation

All nonconformities will be noted and tracked on the Remediation Tracking page in the project’s Confluence space. In the event that critical nonconformities, or an excess of major and/or minor nonconformities are discovered (50% or more), the entire lot is rejected. Upon rejection, museum staff will review the entire lot, and track nonconformities on the Remediation page.

The Remediation Tracking will index all nonconformities, the necessary steps to remediate the issue, and the status of resubmission.

The vendor will be notified of the nonconformities and will remediate as necessary.

2.2.7 Resubmission

After remediation is complete, the vendor shall resubmit corrected items for inspection.

Export reports and images for QC to Dropbox

Since some partners don't have access to our systems, will re-enable:

  • export to spreadsheets of the tables for each folder
  • sample of images per folder for visual QC

To be exported to Dropbox, but any other similar tool can be used (OneDrive, GDrive, S3, etc).

Change table of files to server-side

Folders with 3k+ files render very slow, need to change the DataTables process to use server side and API.

The complication will be keeping the 'Fail' entries to the top and maintain navigation.

Add option for JHOVE to return OK even with WB error

Due to the known limitation of JHOVE, validation fails if the white balance has a value outside the expected. Will add an option to check the messages and ignore the WB and set it as passed unless it finds other issues.

Pending Tasks for Production

To do in DEV:

  • Remediate issues detected by accessibility scan (#43)
  • Remediate issues detected by device vulnerability scan (Nexpose)
  • Remediate issues detected by web vulnerability scan (Qualys; Request RITM0135889)
  • Update Tech Store

Setup Production:

  • Deliver high-level diagram
  • Setup Archer package (Request RITM0135942)
  • Get and mount NFS share for image previews
  • Request VM's
  • Access VM's
  • Request prod db
  • Setup httpd, Python env, load balancer
  • Install and run Osprey using PROD db
  • Setup TLS 1.2 Cert

Finishing PROD of Internal Instance:

  • Add to Web Site Listing table
  • Setup DNS for URL

Finishing External (Public) Instance:

  • Use stats of Internal to plan and tweak config
  • Request VM's
  • Access VM's
  • Setup httpd, Python env, load balancer
  • Install and run Osprey
  • Setup TLS 1.2 Cert
  • Add to Web Site Listing table
  • Setup DNS for URL

Add check for files that disappear

Once a file passes the checks, the script ignores it. Need to add a check that the files are still in the folder, in case it was pulled for a reason or error.

Migrate database to MySQL/generic

To use the SI MySQL servers, will move the database from PostgreSQL to MySQL. Need to test connectors that work in RHEL+Ubuntu.

Maybe SQLAlchemy?

Fit image to screen

JB (ento) suggested that it would be easier to run QC if the image is completely visible without having to scroll.

Close preview button on top-right

Feedback from JC (SG):

Production page: any way of adding a (close file) ‘x’ button at the top of the Preview Image box

The preview modal has 'Dismiss' at the bottom. Add 'x' to close on top-right corner. Seems shiny's showModal doesn't allow for this and would need to add manually.

Accessibility Scan

The new accessibility scan (ticket RITM0090813) found four issues:

  • The tool buttons on the graph of the Summary tab of the homepage does not receive any keyboard focus. This still does not work for keyboard only users. The screenreaders read each of the graph tools as “graphic” or “link” without giving any context as to what those tools are
  • The keyboard focus disappears after the “Summary” tab of the homepage until the footer. The keyboard focus should never disappear from sight
  • For the graph on the Summary tab of the homepage, add a long description or a caption that gives a very brief overview such as the starting data point and the overall trend one sees when looking at a graph. Please add this before the table begins. See this page for more information [https://www.w3.org/WAI/tutorials/images/complex/]
  • Hidden table “Cummulative number of images captured and objects digitized by the Digitization Program Office ...“ on https://dt-vmdpoqc01.si.edu/

Large images trigger Pillow warning

Been seeing warnings from the Pillow module because the files are bigger than the expected default:

osprey      : INFO     Running checks on file USNMENT00335801.tif
/home/villanueval/.local/lib/python3.6/site-packages/PIL/Image.py:2766: DecompressionBombWarning: Image size (101082464 pixels) exceeds limit of 89478485 pixels, could be decompression bomb DOS attack.
  DecompressionBombWarning,

The value can be set using: PIL.Image.MAX_IMAGE_PIXELS

Or the warning disabled using: Image.MAX_IMAGE_PIXELS = None

https://pillow.readthedocs.io/en/latest/reference/Image.html#functions

Documentation: what are the actual "tests" this software performs?

I've read through all the readme files in this project but I can't find the answer to a simple question: what are the tests that are run against the images?

I'm trying to determine if its worth the effort installing and running this software but there is not enough information provided (that I could easily find) to allow me to make this call.

All I found were a number of instances of:

The system checks that the files pass a number of tests and displays the results in a Shiny dashboard.

It would be very useful to have some indication of what these tests are and what sort of errors they identify. Either in terms of listing the various classes of tests or listing them all. I have no idea how many or how detailed they are.

Background: 700K herbarium sheets scanned by Picturae that have not been QA'ed or processed into any downline system yet.

Filechecks need to be case insensitive

file_pair_check() takes the value of settings.raw_files, which could be the wrong case and it won't find the paired file. Need to change this to a case-insensitive check to find the file correctly.

QC - Page ‘bounces’ slightly on occasion

Feedback from JC (SG):

Page ‘bounces’ slightly on occasion. Not sure if this has something to do with loading the assets for preview or some other reason.

I'm guessing its while shiny loads the components of the page.

Add email notice for errors

Research methods that can send an email with a summary of errors that won't get blocked or tagged as spam.

Add sorting to Lightbox

Feedback from JC (SG):

Lightbox page- Thumbnails appear in reverse order, though (last to first).

Add sorting to the lightbox features.

Re-design homepage

The plotly graph doesn't meet accessibility requirements and haven't found other libs that do. Will re-design the homepage and add links to pages that can hold the details for each team and statistics for each project.

Dupe file tagging

The process is tagging 2 files as with errors, the original and the new one. Should not change the original.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.