JSON-to-MODS XSLT Transformation

Summary: This file contains information about transforming JSON to XML. Once converted to XML, the metadata is mapped into the MODS 3.7 schema.

Task: Transform Treesearch JSON files to MODS XML records for ingest into Unified Repository.

metadata: Treesearch metadata in JSON format from the United States Forest Service (USFS).

Transformation process

The json_to_mods.xsl utilizes the following formats and schema to transform the JSON format into MODS.

JSON: (JavaScript Object Notation)
XPath 3.1: (XML Path Language)
XSLT 3.0: (Extensible Stylesheet Language Transformations)
MODS 3.7 : (Metadata Object Description Schema)

*Each JSON file is first transformed to XML, upon which the XML produced is mapped its respective MODS element.

Preparing JSON Files for XSLT Transformation

Download and unzip the file to "C:\users\{your-profile-name}\desktop". Open "json-to-mods" file folder.
Locate the "working_directory". Open it.
Locate three shell scripts - These will be run at different times througout the prepartion and procedure.

Shell Scripts List
- (a) merge-json.sh - merges all *.json files within this directory. This is useful for creating a merged view of the json data for analysis. Run this file first
- (b) merge_xml.sh - merges all *.xml files contained within this directory. This is useful after transformation because it is easier to check if a specific element is available in every file. Run this file last
- (c) correct-invalid-chars.sh- this corrects several of the invalid characters found in the original treesearch metadata. It should be run after (a) and before (b).

Run the merge_json.sh
- If data analysis is desired, the generated output from this file may be imported into OpenRefine.
Run the correct-invalid-chars.sh
- corrects all invalid character usage except for the misuse of the less than symbol (i.e. "<") in the abstract of file A-27260.json.
The files are now prepared for transformation.

Preview test (optional)

Use the debugger to a result

Open Oxygen.
Select the debugger view.
Open a "prepared JSON file".
Verify that there are <data> elements at the beginning of the file:
- before the first opening curly brace <data> {
- and at the end of the file }</data>
Suggested file for preview testing: A-3077.json.
Open the XSLT to test.
- json-to-mods_11-16-2021.xsl "...or..." - json-to-mods_11-22-2021.xsl
Use the "transformation tool set" transform the file in to MODS XML.
The setup and result should look like the image above.

Transform Batch Test

Batch Transform JSON Files to MODS XML

Open the project view. Withing the working-directory, highlight all of the file starting with A and matching the pattern
- "A-#####.json"
- (e.g., A-4077.json)
Right click and mouse over "Transformation Scenarios", select "Configure Transformation Scenarios"
Select "New", then select "XML transformation with XSLT"
Choose the json-to-mods.xsl as the transformation stylesheet.

Give the scenario a name.
Choose your transformer from the dropdown list.
Select OK.

*No output should be set.

The setup is completed, click "Apply Selected" to begin batch transformation.
Review the results.

JSON to MODS Workflow

A visualization of the transformation process written in Mermaid.

flowchart LR
A[JSON] --> B((XSLT3.0))
B-->B.1((XPath 3.1))-->C
B --transforms_to--> C{XML}
C --maps to--> D[MODS 3.7]
style A fill:#f9f,stroke:#333,stroke-width:4px 
style B fill:#bbf,stroke:#f66,stroke-width:2px,color:#fff,stroke-dasharray: 5 5
style B.1 fill:#bbf,stroke:#f66,stroke-width:2px,color:#fff,stroke-dasharray: 5 5
style C fill:#9fc5e8, stroke:#f66, stroke-dasharray: 5 5
style D fill:#9fc5e8,stroke:#333,stroke-width:4px

*If the UML does not render, the image below is how the code above would render using Mermaid.

Discussion of element transformation

MODS: Identifier and Location: The primary identifiers found in the Treesearch metadata are: doi, product_id, and treesearch_pub_id.
- from these location elements to the surrogate record and the resource itself are built to provide access.

Issues

ISSUE #1: Page Numbering

Task Complete? Yes. Resolved. Page numbers are not consistently correct.
When the following JSON string key values are present:
- pub_start_page andpub_end_page,
- pub_page.
No issues are present with page numbers.
When they are not, they must be derived from the “pub_publicaton” or citation” key values. - Both of these files aree strings of text, with inconsistent formatting.
- While they mostly do contain some pagination information
- It is difficult to get the correct data from a string of text

ISSUE #2: random "station_id" acronym

Task Complete? Yes. Resolved. Description: Extraneous “station_id” acronym appearing just outside the last author name tag.
- Have not been able to determine it’s origin.
- See screenshot below

Issue 3: invalid characters usage

Issue: Several TreeSearch files contain an invalid characters that will render invalid when the XML processor attempts to transform them The filenames containing the issues are listed below. Some files contain more than one issue so they are listed twice.

Chacters fixed
- & → &
-   →  
Characters still needing work
- < → < | Needs Resolution: The "working draft" (i.e. NOT the one used in this procedure) attempts to fix file A-29760.json by changing.. ..(diameter at breast height <6 bin) with suppressed growth...
- The less than symbol < should be written as < in order to be transformed by the XSLT Processor.
- The shell script does not do this without changing other valid html tags (e.g. ) thus rendering the rest of the JSON document invalid.

Resolution: The shell script provided to add <data></data> tags to the beginning and end of each document, adds a second statement and has resolved 3a. With more time, issues 3b, and 3c can also be resolved.

The following error message is rendered:

Issue 3a: "&"

Filenames:

Description: "The entity name must immediately follow the '&' in the entity reference."

Task Complete? Yes

The shell script responsible for adding <data></data> to the beginning and end of each file, also contains a sed corrects this issue The statement below corrects the invalid ampersand

  sed -i 's/\&[^amp;|^apos;|^quot;|^lt;|^gt;]/\&amp;/gi' "

Issue 3b: "<" usage of less-than

Filename: A-29760.json

Task Complete? No. Possible to fix shell script. Description: The content of elements must consist of well-formed character data or markup.
- The file contains the "<" symbol within the abstract. This is treated as an invalid character and thus renders and error.

abstract: ...(diameter at breast height <6 in) with suppressed> growth...

Once the problem is corrected manually, it produces valid MODS metadata, and a valid JSON archival replica. This issue can be resolved if the shell script is improved to handle preprocess this bad character prior to transformation.

Issue 3c: "` `"

  →

carlosmartinez-usda / json-to-xml Goto Github PK

json-to-xml's Introduction

JSON-to-MODS XSLT Transformation

Transformation process

*Each JSON file is first transformed to XML, upon which the XML produced is mapped its respective MODS element.

Preparing JSON Files for XSLT Transformation

Preview test (optional)

Transform Batch Test

*No output should be set.

JSON to MODS Workflow

Discussion of element transformation

Issues

ISSUE #1: Page Numbering

ISSUE #2: random "station_id" acronym

Issue 3: invalid characters usage

Issue 3a: "&"

Issue 3b: "<" usage of less-than

Issue 3c: "<br>"

json-to-xml's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org

Issue 3c: "`<br>`"