usnationalarchives / digital-preservation Goto Github PK

View Code? Open in Web Editor NEW

197.0 197.0 16.0 22.97 MB

NARA digital preservation file format risk analysis and preservation plans

License: Other

digital-preservation's People

Contributors

Stargazers

Watchers

Forkers

inspectordidi gmulibdigital eymand edsu phonedude sunfloweringli jyw321 violet26 weaver-lily-mae code4indo hoangclinh kong1383068 scaradim technesparkles gibco hannahlwang

digital-preservation's Issues

Plain-text preferred formats

I noticed that .odt files are listed as a preferred text-based format, but other versions like .rtf and .txt aren't included. Why is that?

Attachments held within PST and EML

In your email guidelines you say attachments may be held as MIME content within the EML. Surely this is incompatible with the rest of the guidelines where preservation actions are required on the attachment files themselves, but this is impossible if the attachment is embedded withing the EML. I would recommend always taking a copy outside the EML for separate preservation.

Similar logic applied to a PST file. I would recommend unpacking the PST into a folder hierarchy mimicking the Outlook view, then processing each message as an EML as above. This makes individual messages discoverable and preservable.

What revisions can you suggest to the proposed processing and preservation actions for the formats? Are the Essential Characteristics for each record type comprehensive enough for digital preservation? Are the proposed preservation actions for the formats technically appropriate?

More detail on transform tools

The list of transform tools is useful and comprehensive, thanks.

However much of the cleverness is in the parameters used, for example in FFMPEG. Will you also be publishing recommended transform parameters or arguments, for example recommended compression when producing JPEG from TIFF for access.

Are the Essential Characteristics for each record type comprehensive enough for digital preservation?

Machine-Readability

The framework should take into account the direction set forth in the OPEN Government Data Act (OGDA) to make records machine-readable by default, based upon standardized schemas. https://www.linkedin.com/pulse/open-gov-data-act-machine-readable-records-owen-ambur/

Digital Cinema preservation files - reference there could be a sound element as well as picture

DPX is listed as a preservation format for digital cinema. However, that is only a picture format. There should be a reference that preservation audio files could be associated to the picture DPX, if the digital cinema content has sound (eg, not a silent motion picture).

Make the policy accessible via PAR

To add to the Machine Readable comment, it would be great if the policy could be read using the Preservation Action Registries protocol. This would ensure it can be read straight into other preservation systems for evaluation and ensure some of the ambiguous statements are clarified.

As a test case we are trying to see how much of this NARA policy can be encoded using PAR which could then be used to send the policy for those organisations that want to copy it or use it as the basis for their own policy. The results so far are promising.

Office Open XML File Formats references

There should be a reference to the ISO standard for Open Office XML

ISO/IEC 29500-1:2016
Information technology — Document description and processing languages — Office Open XML File Formats — Part 1: Fundamentals and Markup Language Reference
https://www.iso.org/standard/71691.html?browse=tc

and further Parts 2-4.
Part 2: https://www.iso.org/standard/61796.html?browse=tc
Part 3: https://www.iso.org/standard/65533.html?browse=tc
Part 4: https://www.iso.org/standard/71692.html?browse=tc

Also: https://www.iso.org/committee/45374/x/catalogue/

Also, this is simultaniously published as ECMA-376:
http://www.ecma-international.org/publications/standards/Ecma-376.htm

Database and Structured Data Preservation is Missing Many DB Types (3 issues)

Multiple issues/concerns:

The documents I reviewed on database and structured data preservation left out a lot of database formats and types in use today in Federal Agencies, including Microsoft SQL Server, MarkLogic (NoSQL), and more.
The guidance for Access says to transform to CSV. This is a bad idea for several reasons, including the preservation of relationships and readability of data in the future.
The guidance for MySQL says to transform to CSV. This is a bad idea for several reasons, including the preservation of relationships and readability of data in the future.

There is an option for MySQL, Access, and other types of structured data archival called SIARD - "Software Independent Archiving of Relational Databases", developed originally by the Swiss Federal Archives. Has NARA looked at SIARD? [https://www.bar.admin.ch/bar/en/home/archiving/tools/siard-suite.html]

According to their documentation, SIARD permits archiving of these database types:

Oracle
Microsoft SQL Server
MySQL
DB/2
Microsoft Access

Clearer distinction between Public Access and archival holding migration suggestions

In the digital audio section, the expressed preference of format is BWF or FLAC, with MP3 being listed as a Public Access format. In some of your format recommendations below the action is "Transform" to BWF or MP3. I think the implication here is that transformation should be to BWF for long term retention, or MP3 for Public Access. If my interpretation is correct, it might be useful to make a clearer distinction between the purposes of the recommendations; if not, then it would be useful to suggest some minimum quality standards for migrating to MP3 as the "archival" copy.

Release documentation in machine readable form

Are the original documents PDFs or is there a more parse friendly source, e.g. Markdown text, or XML? PDFs could be produced from them but it would allow:

direct contributions as PRs, git is horrible at handling PDFs; and
devs to load some of the data into an automated system, e.g. decision support system or registry.

Overall the work is good, I won't be able to finish reviewing today due to personal circumstances but will finish for 8th November 2019. I trust feedback will still prove useful.

TIFF

The use of TIFF is high risk in that it is proprietary to a single company that is not supporting the format and has not done so for years. No common rendering engine exists and variances in header data probably means no common rendering engine will ever exist.

Are the proposed preservation actions for the formats technically appropriate?

Are there appropriate tools for processing and preservation of specific formats that we do not have listed?

Are there other high priority formats we haven’t created plans for yet?

Recommendations on Video Migration

Although AVI is a long established format, migrating to AVI does not seem to be in line with research conducted in Europe under the auspices of the PRE-FORMA project, where the emerging recommendation and focus of work is MKV as a container for the lossless FFV1video codec (http://www.preforma-project.eu/media-type-and-standards.html) . Similarly, MXF seems to be an industry standard on the basis of being an SMPTE supported Standard.

In either case, the fact that all suggested migrations in the Moving Image section seem to be "to AVI" surely makes AVI a de-facto "preferred" format rather than one of several "accepted"?

Essential characteristics

Thank you for sharing your digital preservation framework which is both informative and truly inspiring in terms of community involvement. We sometimes struggle narrowing down essential characteristics. Can you elaborate on how you proceed with this? In short, how do you do it?

Paper records - which format category do they fit under?

There are many formats discussed, but it's not clear which one applies to paper records to be scanned/digitized. They aren't simply "still images" - besides being multi-page, there is the issue of optical character recognition (OCR) and metadata (titles and more). PDF is one format, currently used for the online JFK records, but not the only choice. PDF itself is also an umbrella format which has many internal choices: internal image format, scanning resolution and bit depth, etc.

So it seems clear that scanned documents should be considered as its own topic.

Are there appropriate tools for processing and preservation of specific formats that we do not have listed?

Structured Query Language (SQL)

Structured Query Language (SQL) is listed in the matrix as a low-risk format, but it does not appear to be listed as a file format in the Preservation Action Plan for Structured Data/Database Records. Many Relational Database Management Systems (RDBMS) allow the export of data as SQL, often for backup purposes.

While there are vendor specific extensions to SQL it seems that SQL should be mentioned as a data interchange and preservation format for relational databases. It is easier to deal with than binary formats such as MySQL's frm, myd, myi and ibd files, which are not always backwards compatible (and are mentioned in the Database Action Plan). Also, I imagine that NARA must have accessioned quite a few .sql files already?

One advantage that SQL has over CSV is that key relations are preserved. These are very important for trying to piece together later how the various tables are connected, and queried.

Feedback from your pals at Library of Congress

First and foremost, many congrats on this truly massive achievement! So much work went into this and it shows. And we especially applaud the commitment to transparency both in the level of detail but also using GitHub as a feedback mechanism. Kudos all around!

I'm submitting comments in no particular order on behalf of a small review group at the Library of Congress for you to do with what you will but our overall message to our NARA friends is - keep up the good work! You are definitely pushing the envelope in all the right ways and we appreciate you sharing out to the community.

It might be worth including references for some of the foundational/related work on things like Significant Properties work from JISC and/or Gareth Knight as well as for format evaluation and sustainability (such as http://www.loc.gov/preservation/digital/formats/index.html which is near and dear to my heart). Even if NARA put its own spin on things, it would be worthwhile to show that this is part of a larger ecosystem.
It also might be good to include links to other resources when giving specific examples of technical characteristics. Examples might be IASA TC-04 for audio and the FADGI Still Image Guidelines for Still Images.
Can you explain a bit more about rationale for transformation? For example, the plan for QuickTime with AAC suggest transforming to either BWF or MP3 but those would have very different implications - the former presumably for preservation and the latter for access.
There's a bit of unevenness across the action plans. Some are quite detailed for each entry while others have references across a broader scope - at just the wrapper level or even the content category level.
I'm curious about the explicit naming of specific software, especially proprietary tools, as preferred tools. As feds, we have usually shied away from this for public docs and instead highlight functionality rather than brand names. Was there a testing protocol to determine what is a preferred tool? Also, we love that there is some emphasis on open source tools.
How often would the action plans be updated? I understand from the SOW that the matrix is every 2 years but is it the same for the action plans? With my experience with the Sustainability of Digital Format site, I can attest that this is a really hard row to hoe and keeping up this level of effort and detail is an enormous effort. It will take a village.
What happens when a new format arrives? Would it be added to the matrix/action plans on the 2 year cycle?
Sometimes, PDF/A is the target transformation format (for NF00374 for example) but other times, PDF is recommended. Is there a criteria or maybe this is still to come?
WikiData has a nice mapping from PUIDs to FDDs to the WikiData QID. This could be another data point for the NARA unique number or maybe add the mapping here.

Again - such a nice resource and we look forward to seeing it evolve and grow!

Best from Kate

Kate Murray (she/her)
Digital Collections Management & Services
Library of Congress
[email protected]
202-707-4894

Software and Code plan - where to locate a copy if not possessed by NARA/this agency

Would you consider it worthwhile to add a section to the Software and Code plan related to "where a working copy of the software may be found" if a copy is not in this agency's collections? For NARA, it may be that LoC or NIST or a Presidential library may have secured a contemporaneous copy of software related to the object(s) being described. It may be worthwhile to know either that resource exists, or that a copy may need to be secured by this or another agency.

What can you suggest in terms of appropriate public access versions of the formats?

Consider using IETF "MIME types" to describe document formats

There is an international standard for MIME types, and a public registry maintained by IANA.
MIME provides a mechanism for labelling content as stored or transmitted by email or HTTP.

MIME is widely used and understood. Using MIME type labels offers some kinds of categories and a mechanism for extension and a greater likelihood of adoption and availability of software for interpretation and conversion.

In addition, if there is need for finer granularity of format description, there are even IETF specifications for additional labelling attributes for file formats.

The current NARA descriptions of file formats could use a framework for categorization of media types.

Action plans per format rather than PDF of category

For linking and reference purposes, it would be helpful to have action plans for each format saved to a separate file. It would also be helpful to have a text based format option like markdown or html.

EPUB?

I find it odd that the EPUB specification is not at all referenced in this document. As a W3C Recommendation, as a forthcoming ISO standard, and as a publishing/web format that is widely used in publication distribution, I am surprised to see it not listed in this document.

More information about the current EPUB version 3.2 is available at: https://www.w3.org/publishing/epub32/epub-spec.html

The ISO standardization project (based on 3.0.1) ISO/IEC TS 30135-1:2014
Information technology — Digital publishing — EPUB3 — Part 1: EPUB3 Overview - https://www.iso.org/standard/53255.html

application programming interface (API)

Is an API available for browsing archive data, for example, in Python?

When we reach the point that we are releasing marked up versions of the plans, what type of mark up would be the most useful for the community?

Proper packaging for archival use of Adobe Indesign Files, Architecture files, etc.

I don't see any guidance on maintaining Adobe Indesign files, CAD architecture files, etc. that may have multiple other files or links- ie fonts, and images for InDesign, or plumping drawings, and Maya snips for fly-throughs of architecture projects. Keeping these files as PDF proxies does not solve the issue of then breaking the links that makes that file useful for future need for possible use or access. Any thoughts?

Request for scalable/machine actionable tools

The recommended tool sets are comprehensive and very useful, however in many areas there seems to be a leaning to "fat client" GUI tools. These are incredibly useful for practitioners dealing with individual items, or small batches, and where there is scope for some human resource to perform characterisations/migrations/inspections. From a systems implementation point of view, it would be incredibly useful if these could be supplemented with automatable/batch processing tools that can be machine actioned and scaled to run over much larger datasets. In some cases, the same base tool can be used for either approach (LibreOffice and MediaInfo for example both have GUI and CLI implementations), and again, it would be incredibly useful if it could be indicated where this is true.

Geospatial Formats: GeoPackage

I work for the National Park Service where we are working to publish our digital assets in open, machine-readable formats that can be easily transferred to NARA. In the geospatial domain we are moving content from proprietary formats like ESRI personal geodatabases and aiming for the Preferred or Acceptable formats on the NARA website: https://www.archives.gov/records-mgmt/policy/transfer-guidance-tables.html#geospatialformats

One missing format that has gained a lot of traction across the Department of Interior (and beyond) is the GeoPackage (https://www.geopackage.org/). First approved by the Open Geospatial Consortium in 2014, this open standard supports vector features, tiled imagery, and non-spatial data in a single container, and we've seen it used for many products where a shapefile or similar formats are quite limiting.

Has the GeoPackage received any discussion within NARA? As we provide data publication guidance it would be most useful to know if this format is on the cusp of being an acceptable format (or not). Thank you!

What revisions can you suggest to the proposed processing and preservation actions for the formats?

Digital Video - reference the codec as an underlying vulnerability factor

Digital video formats are listed (MOV, AVI, etc.). However, the video codec stream inside that format also has risk. For example, ProRes, found inside MOV files, is a proprietary codec owned by Apple. While preferred formats are not listed for digital video, there should be a reference to the video codec as an underlying vulnerability factor.

Which identifier to use for formats?

The policy and especially the spreadsheet use NARA identifiers but the de-facto standard is Pronom Unique Identifier (PUID). Sometimes a NARA identifier maps onto several PUIDs which can be confusing. It would be great if the policy could reference the relevant PUID at all times.