usnationalarchives / digital-preservation Goto Github PK
View Code? Open in Web Editor NEWNARA digital preservation file format risk analysis and preservation plans
License: Other
NARA digital preservation file format risk analysis and preservation plans
License: Other
I noticed that .odt files are listed as a preferred text-based format, but other versions like .rtf and .txt aren't included. Why is that?
In your email guidelines you say attachments may be held as MIME content within the EML. Surely this is incompatible with the rest of the guidelines where preservation actions are required on the attachment files themselves, but this is impossible if the attachment is embedded withing the EML. I would recommend always taking a copy outside the EML for separate preservation.
Similar logic applied to a PST file. I would recommend unpacking the PST into a folder hierarchy mimicking the Outlook view, then processing each message as an EML as above. This makes individual messages discoverable and preservable.
The list of transform tools is useful and comprehensive, thanks.
However much of the cleverness is in the parameters used, for example in FFMPEG. Will you also be publishing recommended transform parameters or arguments, for example recommended compression when producing JPEG from TIFF for access.
The framework should take into account the direction set forth in the OPEN Government Data Act (OGDA) to make records machine-readable by default, based upon standardized schemas. https://www.linkedin.com/pulse/open-gov-data-act-machine-readable-records-owen-ambur/
DPX is listed as a preservation format for digital cinema. However, that is only a picture format. There should be a reference that preservation audio files could be associated to the picture DPX, if the digital cinema content has sound (eg, not a silent motion picture).
To add to the Machine Readable comment, it would be great if the policy could be read using the Preservation Action Registries protocol. This would ensure it can be read straight into other preservation systems for evaluation and ensure some of the ambiguous statements are clarified.
As a test case we are trying to see how much of this NARA policy can be encoded using PAR which could then be used to send the policy for those organisations that want to copy it or use it as the basis for their own policy. The results so far are promising.
There should be a reference to the ISO standard for Open Office XML
ISO/IEC 29500-1:2016
Information technology — Document description and processing languages — Office Open XML File Formats — Part 1: Fundamentals and Markup Language Reference
https://www.iso.org/standard/71691.html?browse=tc
and further Parts 2-4.
Part 2: https://www.iso.org/standard/61796.html?browse=tc
Part 3: https://www.iso.org/standard/65533.html?browse=tc
Part 4: https://www.iso.org/standard/71692.html?browse=tc
Also: https://www.iso.org/committee/45374/x/catalogue/
Also, this is simultaniously published as ECMA-376:
http://www.ecma-international.org/publications/standards/Ecma-376.htm
Multiple issues/concerns:
The documents I reviewed on database and structured data preservation left out a lot of database formats and types in use today in Federal Agencies, including Microsoft SQL Server, MarkLogic (NoSQL), and more.
The guidance for Access says to transform to CSV. This is a bad idea for several reasons, including the preservation of relationships and readability of data in the future.
The guidance for MySQL says to transform to CSV. This is a bad idea for several reasons, including the preservation of relationships and readability of data in the future.
There is an option for MySQL, Access, and other types of structured data archival called SIARD - "Software Independent Archiving of Relational Databases", developed originally by the Swiss Federal Archives. Has NARA looked at SIARD? [https://www.bar.admin.ch/bar/en/home/archiving/tools/siard-suite.html]
According to their documentation, SIARD permits archiving of these database types:
In the digital audio section, the expressed preference of format is BWF or FLAC, with MP3 being listed as a Public Access format. In some of your format recommendations below the action is "Transform" to BWF or MP3. I think the implication here is that transformation should be to BWF for long term retention, or MP3 for Public Access. If my interpretation is correct, it might be useful to make a clearer distinction between the purposes of the recommendations; if not, then it would be useful to suggest some minimum quality standards for migrating to MP3 as the "archival" copy.
Are the original documents PDFs or is there a more parse friendly source, e.g. Markdown text, or XML? PDFs could be produced from them but it would allow:
Overall the work is good, I won't be able to finish reviewing today due to personal circumstances but will finish for 8th November 2019. I trust feedback will still prove useful.
The use of TIFF is high risk in that it is proprietary to a single company that is not supporting the format and has not done so for years. No common rendering engine exists and variances in header data probably means no common rendering engine will ever exist.
Although AVI is a long established format, migrating to AVI does not seem to be in line with research conducted in Europe under the auspices of the PRE-FORMA project, where the emerging recommendation and focus of work is MKV as a container for the lossless FFV1video codec (http://www.preforma-project.eu/media-type-and-standards.html) . Similarly, MXF seems to be an industry standard on the basis of being an SMPTE supported Standard.
In either case, the fact that all suggested migrations in the Moving Image section seem to be "to AVI" surely makes AVI a de-facto "preferred" format rather than one of several "accepted"?
Thank you for sharing your digital preservation framework which is both informative and truly inspiring in terms of community involvement. We sometimes struggle narrowing down essential characteristics. Can you elaborate on how you proceed with this? In short, how do you do it?
There are many formats discussed, but it's not clear which one applies to paper records to be scanned/digitized. They aren't simply "still images" - besides being multi-page, there is the issue of optical character recognition (OCR) and metadata (titles and more). PDF is one format, currently used for the online JFK records, but not the only choice. PDF itself is also an umbrella format which has many internal choices: internal image format, scanning resolution and bit depth, etc.
So it seems clear that scanned documents should be considered as its own topic.
Structured Query Language (SQL) is listed in the matrix as a low-risk format, but it does not appear to be listed as a file format in the Preservation Action Plan for Structured Data/Database Records. Many Relational Database Management Systems (RDBMS) allow the export of data as SQL, often for backup purposes.
While there are vendor specific extensions to SQL it seems that SQL should be mentioned as a data interchange and preservation format for relational databases. It is easier to deal with than binary formats such as MySQL's frm, myd, myi and ibd files, which are not always backwards compatible (and are mentioned in the Database Action Plan). Also, I imagine that NARA must have accessioned quite a few .sql files already?
One advantage that SQL has over CSV is that key relations are preserved. These are very important for trying to piece together later how the various tables are connected, and queried.
First and foremost, many congrats on this truly massive achievement! So much work went into this and it shows. And we especially applaud the commitment to transparency both in the level of detail but also using GitHub as a feedback mechanism. Kudos all around!
I'm submitting comments in no particular order on behalf of a small review group at the Library of Congress for you to do with what you will but our overall message to our NARA friends is - keep up the good work! You are definitely pushing the envelope in all the right ways and we appreciate you sharing out to the community.
Again - such a nice resource and we look forward to seeing it evolve and grow!
Best from Kate
Kate Murray (she/her)
Digital Collections Management & Services
Library of Congress
[email protected]
202-707-4894
Would you consider it worthwhile to add a section to the Software and Code plan related to "where a working copy of the software may be found" if a copy is not in this agency's collections? For NARA, it may be that LoC or NIST or a Presidential library may have secured a contemporaneous copy of software related to the object(s) being described. It may be worthwhile to know either that resource exists, or that a copy may need to be secured by this or another agency.
There is an international standard for MIME types, and a public registry maintained by IANA.
MIME provides a mechanism for labelling content as stored or transmitted by email or HTTP.
MIME is widely used and understood. Using MIME type labels offers some kinds of categories and a mechanism for extension and a greater likelihood of adoption and availability of software for interpretation and conversion.
In addition, if there is need for finer granularity of format description, there are even IETF specifications for additional labelling attributes for file formats.
The current NARA descriptions of file formats could use a framework for categorization of media types.
For linking and reference purposes, it would be helpful to have action plans for each format saved to a separate file. It would also be helpful to have a text based format option like markdown or html.
I find it odd that the EPUB specification is not at all referenced in this document. As a W3C Recommendation, as a forthcoming ISO standard, and as a publishing/web format that is widely used in publication distribution, I am surprised to see it not listed in this document.
More information about the current EPUB version 3.2 is available at: https://www.w3.org/publishing/epub32/epub-spec.html
The ISO standardization project (based on 3.0.1) ISO/IEC TS 30135-1:2014
Information technology — Digital publishing — EPUB3 — Part 1: EPUB3 Overview - https://www.iso.org/standard/53255.html
Is an API available for browsing archive data, for example, in Python?
I don't see any guidance on maintaining Adobe Indesign files, CAD architecture files, etc. that may have multiple other files or links- ie fonts, and images for InDesign, or plumping drawings, and Maya snips for fly-throughs of architecture projects. Keeping these files as PDF proxies does not solve the issue of then breaking the links that makes that file useful for future need for possible use or access. Any thoughts?
The recommended tool sets are comprehensive and very useful, however in many areas there seems to be a leaning to "fat client" GUI tools. These are incredibly useful for practitioners dealing with individual items, or small batches, and where there is scope for some human resource to perform characterisations/migrations/inspections. From a systems implementation point of view, it would be incredibly useful if these could be supplemented with automatable/batch processing tools that can be machine actioned and scaled to run over much larger datasets. In some cases, the same base tool can be used for either approach (LibreOffice and MediaInfo for example both have GUI and CLI implementations), and again, it would be incredibly useful if it could be indicated where this is true.
I work for the National Park Service where we are working to publish our digital assets in open, machine-readable formats that can be easily transferred to NARA. In the geospatial domain we are moving content from proprietary formats like ESRI personal geodatabases and aiming for the Preferred or Acceptable formats on the NARA website: https://www.archives.gov/records-mgmt/policy/transfer-guidance-tables.html#geospatialformats
One missing format that has gained a lot of traction across the Department of Interior (and beyond) is the GeoPackage (https://www.geopackage.org/). First approved by the Open Geospatial Consortium in 2014, this open standard supports vector features, tiled imagery, and non-spatial data in a single container, and we've seen it used for many products where a shapefile or similar formats are quite limiting.
Has the GeoPackage received any discussion within NARA? As we provide data publication guidance it would be most useful to know if this format is on the cusp of being an acceptable format (or not). Thank you!
Digital video formats are listed (MOV, AVI, etc.). However, the video codec stream inside that format also has risk. For example, ProRes, found inside MOV files, is a proprietary codec owned by Apple. While preferred formats are not listed for digital video, there should be a reference to the video codec as an underlying vulnerability factor.
The policy and especially the spreadsheet use NARA identifiers but the de-facto standard is Pronom Unique Identifier (PUID). Sometimes a NARA identifier maps onto several PUIDs which can be confusing. It would be great if the policy could reference the relevant PUID at all times.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.