marcusbarnes / mik Goto Github PK

The Move to Islandora Kit is an extensible PHP command-line tool for converting source content and metadata into packages suitable for importing into Islandora (or other digital repository and preservations systems).

License: GNU General Public License v3.0

PHP 93.24% Python 0.14% XSLT 6.19% Twig 0.43%

php islandora php-cli digital-preservation digital-repository repository-tools utility etl migration

mik's Introduction

MIK, the Move to Islandora Kit.

Overview

The Move to Islandora Kit (MIK) converts source content files and accompanying metadata into ingest packages used by existing Islandora batch ingest modules, Islandora Batch, Islandora Newspaper Batch, Islandora Book Batch, and Islandora Compound Batch. In other words, it doesn’t import objects into Islandora, it prepares content for importing into Islandora:

MIK is designed to be extensible. The base classes that convert the source metadata to XML files for importing into Islandora, and that convert the source content files into the required directory structure for importing, can be subclassed easily. MIK also uses plugins (known as "manipulators") and a set of "hook" scripts that allow functionality that can be turned off or on for specific jobs.

MIK was originally developed by staff at Simon Fraser University Library in support of their migration from CONTENTdm to Islandora, but its longer-term purpose is as a general toolkit for preparing content for importing content into Islandora. So MIK should really stand for "Move [content into] Islandora Kit."

Documentation

We are continuing to improve our documentation, which is on the MIK wiki. Please let us know if you have any questions, suggestions or if you would like to assist.

Troubleshooting and support

If you have a question, please open an issue.

Islandora content that has been prepared using MIK

Some collections in Arca
- Emily Carr University's Academic Calendars
- Emily Carr University's Wosk Masterworks Print Collection
- University of the Fraser Valley's Abbotsford Sumas and Matsqui News
- KORA, Kwantlen Polytechnic University's institutional repository
Most of the newspapers in Simon Fraser University Library's Digitized Newspapers site
Most of the collections in Simon Fraser University Library's Digitized Collections
All of the collections initially launched in The Louisiana Digital Library. Some examples include:

Installation

Instructions are available on the wiki.

Usage

Typical workflow is to 1) configure your toolchain (defined below) by creating an .ini file, 2) check your configuration options and then 3) run MIK to perform the conversion of your source content into Islandora ingest packages. When MIK finishes running, you can import your content into Islandora using Islandora Batch, Islandora Newspaper Batch, Islandora Book Batch, or Islandora Compound Batch.

1. Configure your toolchain

In a nutshell, this means create an .ini file for MIK. Details for available toolchaines are provided on the wiki.

2. Check your configuration

To check your configuration options, run MIK and include the --checkconfig (or -cc) option with a value 'all':

./mik --config foo.ini --checkconfig all

You can also check specific types of configuration values as described in this Cookbook entry.

Note: if you are using null mappings for metadata manipulators, --checkconfig will return errors. Use --ignore_null_mappings to avoid these.

3. Convert your source content into Islandora ingest packages

Once you have checked your configuration options, you can run MIK to perform the data conversion:

./mik --config foo.ini

On Windows, you'll need to run:

php mik --config foo.ini

The --config option is required, but you can also add a --limit option if you only want to create a specific number of import packages. This option is useful for testing. For example:

./mik --config foo.ini --limit 10

The above will generate 10 valid packages if possible. Check the problem_records log for any that were skipped due to errors.

Once MIK starts running, it will display its progress:

./mik --config foo.ini
Creating 10 Islandora ingest packages. Please be patient.
===================================================>                          56%

and when finished will tell you where your ingest packages have been saved and where your log file is.

4. Load your content into Islandora

And you're done. In practice, you probably want to do some quality assurance on the Islandora ingest packages before you import them (and MIK provides some helper scripts to do that). If you're not happy with what MIK produced, you can always modify your configuration settings or your metadata mappings file and run MIK again.

Current status

We aim for a 1.0 release of MIK in fall 2017. Please note that the only differences between version 0.9 and 1.0 will be the addition of more features, automated tests, and code cleanup. Version 0.9 is already being used in production.

So far, we have "toolchains" (complete sets of MIK fetchers, metadata parsers, file getters, etc.) for creating Islandora import packages from the following:

CONTENTdm
- single-file objects (images, audio, etc.)
- multi-file PDFs
- books
- newspapers
- non-book and non-newspaper compound objects
CSV
- metadata and content files from a local filesystem for single-file objects (images, audio, etc.)
- metadata and content files from a local filesystem for compound objects
- metadata and content files from a local filesystem for books
- metadata and content files from a local filesystem for newspaper issues
  - We also have an Excel fetcher and a Filesystem fetcher that can be used with CSV toolchains
OAI-PMH
- metadata and one PDF per article from an Open Journal Systems journal
- metadata and one file per resource described in each OAI-PMH record if the record includes the URL to the file

Contributing

We welcome community development partners. Some features that would be really great to see include:

a graphical user interface on top of MIK
tools for creating mappings files (in addition to the Metadata Mappings Helper)
toolchains to migrate from DSpace and other repository platforms to Islandora (the OAI-PMH toolchain may already cover DSpace - testers welcome)
a toolchain to generate Samvera import packages (yes, it's called Move to Islandora Kit but it's flexible enough to create other types of ingest packages and we'd love to collaborate with some Samvera friends)
- we have a sample CsvToJson toolchain that demonstrates that it's possible to write out packages that differ from those Islandora uses

MIK is designed to be extensible. If you have an idea for a useful manipulator or post-write hook script, please let us know.

CONTRIBUTING.md provides guidelines on how you can contribute to MIK. Our Information for Developers wiki page contains some information on coding standards, class structure, etc.

Maintainers/Sponsors

Mark Jordan, Simon Fraser University Library
Brandon Weigel, BC Electronic Library Network
Marcus Barnes

Contributors

mik's People

Contributors

Stargazers

Watchers

Forkers

mjordan mark-cooper sprklinginfo diegopino lsulibraries patdunlavey bondjimbond fruehanp1 lyrasis

mik's Issues

Add function-level documentation

I can't find any reference in the PSR standards for documenting methods/functions (although I didn't look too hard). If PSR doesn't provide any guidance, I recommend we adopt Drupal's documentation style (https://www.drupal.org/coding-standards/docs#functions).

fetchers/Csv must implement queryTotalRec and getItemInfo

Class mik\fetchers\Csv contains 2 abstract methods and must therefore be declared abstract or implement the remaining methods (mik\fetchers\Fetcher::queryTotalRec, mik\fetchers\Fetcher::getItemInfo).

Please note that phpunit will fail until this issue is addressed.

In the CONTENTdm fetcher, account for CONTENTdm's limit of only returning 1024 records per query

As documented with a todo comment.

getRecords should be an abstract method of fetchers/Fetcher.php abstract class

The getRecords method is called within the main MIK script https://github.com/MarcusBarnes/mik/blob/master/mik#L64 and is implemented in both fetchers/Cdm.php and fetchers/Csv.php classes. It may make sense to make getRecords an abstract method with required parameter $limit within the fetchers/Fetcher.php abstract class should anyone decide to extend the existing fetcher classes.

Add more logging, and come up with a more readable log format

We have added Monolog logging in a few places, but there are a lot more where we should be logging the outcomes of file creation, etc. Currently, the log format is the monolog default, e.g.:

[2015-07-01 17:36:31] filemanipulators.INFO: MODS file validates {"file":"/tmp/mik_csv_output/1.xml"} []
[2015-07-01 17:36:32] filemanipulators.INFO: MODS file validates {"file":"/tmp/mik_csv_output/2.xml"} []
[2015-07-01 17:36:32] filemanipulators.INFO: MODS file validates {"file":"/tmp/mik_csv_output/3.xml"} []
[2015-07-01 17:36:33] filemanipulators.INFO: MODS file validates {"file":"/tmp/mik_csv_output/4.xml"} []
[2015-07-01 17:36:34] filemanipulators.INFO: MODS file validates {"file":"/tmp/mik_csv_output/5.xml"} []

We should consider creating a more readable log format as per https://github.com/Seldaek/monolog/blob/master/doc/01-usage.md.

Make metadatamanipulators/MetadataManipulator an abstract class.

Make metadatamanipulators/MetadataManipulator an abstract class to enforce certain expected patterns within metadata manipulors (specifically, that all metadata manipulators should implement their own version of the manipulate method).

Using the configuration file to pass parameters to metadatamanipulators

It may be useful to be able to pass parameters to metadata manipulators via the configuration file. For example, the breakTopicMetadaOnCharacter method of the FilterModsTopic metadatamanipulator class splits topic metadata on a delimiter which is now a parameter (see #26). One way we might consider doing this in the MANIPULATORS section of the configuration file is by following the named metadata manipulator with a piped delimited list of parameters:

[MANIPULATORS]
; One or more metadatamanipulators classes
metadatamanipulators[] = FilterModsTopic|param1|param2|...

Thought and use cases are appreciated.

Fail more gracefully if unable to access directories from config input_directories

Fail gracefully with a message that will make more sense to general users should mik be unable to access directories in the configuration FILE_GETTER input_directories[]. This may happen if one fails to connect to input_directories from a network drive or if the user makes a mistake when providing the input directory path. Currently, mik will issue a fatal error that may cause unnecessary confusion among certain users.

More generally, identify other places where mik should fail gracefully, creating new issues to help track these.

Add CONTENTdm single file fetcher manipulator

Would work similar to the CsvSingleFile manipulator but inspect the extension of the file identified in the 'find' element of the CONTENTdm getItemInfo record.

Need a way to filter specific "content types" in fetchers

Since our filegetters and writers are very specific about the types of files (and directory structures) they deal with, we need a way to handle sources that may contain objects multiple "content types". For example, in CONTENTdm, a single collection can contain still images, movies, newspapers, books, and general compound objects. The Islandora batch loaders can only ingest objects of a single content type (although there are some exceptions to this), so we can't mix and match content of different Islandora contant types in the ingest packages that are produced by an MIK run.

One mechanism we might consider is a "fetchermanipulator" class that will inspect each incoming object source object and ignore it if does not pass specific tests. Or, we could build similar checks into filegetter classes, so that if an incoming object does not pass some tests, it is just skipped. This latter solution might be simpler.

To use CONTENTdm as an example source, for compound objects, CONTENTdm provides a element in each object;s .cpd file. For books, the value is 'Monographs', and for general compound objects, it's 'Document'. I'll try to come up with a comprehensive list.

Review and clean up Mods and CdmToMods classes

Remove duplicated code and other cruft in Mods class that should only be in the child CdmToMods class.

Rename methods in the Mods parent class so that they are more generic, removing specific references to CONTENTdm.

CdmNewspaper writer produces errors when packages already exist in output directory

The CdmNewspaper writer may produce errors when packages (newspaper issues) already exist in the output directory. Adding some checks to see if item-level directory and files already exist should resolve this issue.

Create exception class

We need to be able to catch errors at any point in the class hierarchy and have them bubble up to the main loop in mik so that if an exception is caught, mik can move on to the next record in the loop after logging the error).

Ideally, we'd have a custom MIK Exception class that was extended by a fetcher exception class, a metadata parser exception class, a writer exception class, etc. I'm not sure how that would work however.

@MarcusBarnes would you mind starting this?

Add implementation details to fetchers\Csv queryTotalRec and getItemInfo

Please add implementation specific details to the queryTotalRec and getItemInfo methods in fetchers\Csv class.

Please see the related issue #32.

Remove .php extension from mik.php and add shebang

Let's rename mik.php to mik, and add an 'env' shebang as in this example:

#!/usr/bin/env php
<?php
print "Hi from test\n";
?>

That way if mik is executable we can run:

mik --config=foo.ini

If it's not executable, php mik --config=foo.ini will still work.

In field mappings, only add shared parent wrapper elements once

If we have two mappings that define MODS elements that share a common parent wrapper element, we should only add the parent element to the MODS XML once. For example, in the following mapping file the 'Medium' and 'Work Measurements' source fields map to MODS <form> and <note> elements, respectively:

Calendar name,<titleInfo><title>%value%</title></titleInfo>,
School name,"<name type=""corporate""><namePart>%value%</namePart></name>",
Medium,<physicalDescription><form>%value%</form></physicalDescription>,
Work Measurements,<physicalDescription><note>%value%</note></physicalDescription>,
Publisher,<originInfo><publisher>%value%</publisher></originInfo>,
Year,<originInfo><dateIssued>%value%</dateIssued></originInfo>,
Format type,<genre>%value%</genre>,
President,"<note type=""president"">%value%</note>",
Board members,"<note type=""board members"">%value%</note>",
Administrators,"<note type=""administrators"">%value%</note>",
Instructors,"<note type=""instructors"">%value%</note>",
"Staff(technicians,support staff)","<note type=""staff"">%value%</note>",
Degree/Diplomas/Programs,"<note type=""degree/diplomas/programs"">%value%</note>",
Majors/Concentration,"<note type=""majors/concentration"">%value%</note>",
Honorary Degree Recipients,"<note type=""honorary degree recipients"">%value%</note>",
Scholarships/Awards Recipients,"<note type=""scholarship/award recipients"">%value%</note>",
Notes,<note>%value%</note>,

These two MODS elements share the parent <physicalDescription>. Currently, the XML produced looks like this:

  <physicalDescription>
    <form>Paper</form>
  </physicalDescription>
  <physicalDescription>
    <note>16 x 24.4</note>
  </physicalDescription>

but we probably want:

  <physicalDescription>
    <form>Paper</form>
    <note>16 x 24.4</note>
  </physicalDescription>

Move XML utility functions from child classes into Mods.php

getChildNodesFromModsXMLString(), determineRepeatedWrapperChildElements(), consolidateWrapperElements(), oneParentWrapperElement(), and possibly applyMetadatamanipulators() are duplicated within CdmToMods.php and the new CsvToMods.php metadata parser. Is there any reason these functions can't live within the parent Mods.php class?

Make filegetters/FileGetter.php an abstract class

Refactor filegetters/FileGetter.php as an abstract class to be extended by particular instances (such as filegetter/CdmNewspapers.php).

PHP Abstract Class documentation: http://php.net/manual/en/language.oop5.abstract.php

Handle multiple input directories in FILE_GETTER configuration

In the case that the files that will be used for the OBJ streams for a particular collection are spread across multiple directories, provide a way to handle multiple input directories in the FILE_GETTER configuration. (In particular, see the CdmNewspaper->getIssueMasterFiles method.)

Change CdmSingleFile file getter and/or writer to get the thumbnail from CONTENTdm

The Islandora Batch module will load thumbnails for objects (tested with JPEG2000 objects) if they have a .jpg entension. For example, an object with files 3456.jp2, 3456.xml, and 3456.jpg will be loaded as expected, with the datastreams JP2, MODS, and TN respectively.

Note that if a file with an extension appropriate to the OBJ datastream (.tif in the large image SP example) is missing, Islandora will complain that the OJB is missing. Prevent this from happening by enabling "Defer derivative generation during ingest" at admin/islandora/configure before running the batch ingest job.

The task in this issue to is have the CdmSingleFile filegetter and/or CdmSingleFile writer get the CONTENTdm object's thumbnail and add it to the output directory.

CdmPhpDocuments.php is consuming excessive memory

file_put_contents() at https://github.com/MarcusBarnes/mik/blob/master/src/writers/CdmPhpDocuments.php#L56 is throwing an "Allowed memory size of xxx bytes exhausted" error even if the PHP CLI memory_limit setting is 512M.

The source PDF file's contents are retrieved at https://github.com/MarcusBarnes/mik/blob/master/src/filegetters/CdmPhpDocuments.php#L67. Perhaps a way to avoid keeping the entire source PDF file's contents as a string in memory would be to write the file to disk in this function, then in the writer class simply move it to the destination create by file_put_contents(). This would increase disk activity but that's probably easier to deal with than memory allocation.

If this solution is acceptable, I can take a stab at this.

If the mappings file contains a row for a field that is not in the source collection, a non-fatal error is triggered.

If the mappings file contains a row for a field that is not in the source collection, a series of errors is triggered (for example, in CdmtoMODS). (The mappings are human-made and a row for a field may not match due to spelling differences, absences, or other reasons.)

Suggested fix: add a check (in CdmtoMODS) to ensure that the $CONTENTdmField key is present in the $CONTENTdmFieldValuesArray array, and if it is not, log an error.

Call to $writer->writePackages() in mik should have a $record_id parameter

Passing a $record_id parameter to $writer->writePackages($metadata, $child_pointers); will give us convenient access to a unique ID to use for single-file content model ingest packages. Only backward compatibility issue will be at writers/CdmNewspapers.php line 57, but that is easily fixed by adding the new parameter to the end of writePackages().

To summarize, the call to $writer->writePackages() in mik should be:

$writer->writePackages($metadata, $child_pointers, $record_id);

Generalize $record_id in mik

$record_id in the main loop in mik is specific to CONTENTdm records, which are identified by a "pointer". Before we do any additional work on non-CONTENTdm fetchers (like the CSV fetcher), we should remove CONTENTdm-specific naming and logic from mik, and establish a pattern for other record types.

Use the CLImate library to add a progress bar

CLImate provides a method to add a progress bar (http://climate.thephpleague.com/terminal-objects/progress-bar/) to PHP CLI scripts. Might prove to be a bit more concise than just scrolling print statements off the console.

Make writers/Writer.php an abstract class

Refactor writers/Writer.php as an abstract class to be extended by particular instances (such as writers/CdmNewspapers.php).

PHP Abstract Class documentation: http://php.net/manual/en/language.oop5.abstract.php

CLImate progress bar doesn't work on Windows

There is a known issue with the CLImate library on Windows. As a quick and dirty workaround, I recommend simply printing dots when MIK is run under Windows to indicate that it is doing something.

Metadata manipulators are only being applied to mappings from Subject

The if statement at https://github.com/MarcusBarnes/mik/blob/issue-35/src/metadataparsers/mods/CdmToMods.php#L131 is only applying metadata manipulators to mappings that from "Subject" fields in the source metadata.

Warning issued when no metadatamanipulators set in configuration file.

The code CdmToMods class currently has the assumption built-in that at least one metadata manipulator is set in the configuration file, but this need not be the case. An undefined index notice warnning is issued as a result when no metadata manipulators are set in the configuration file.

Add a file manipulator that validates the MODS file

File manipulators can validate files generated by MIK. We should provide a file manipulator that runs each MODS.xml file through xmllint, or (more portably) using PHP's builtin DOM validation function using a local copy of the MODS schema file. We'd need to figure out how to report invalid MODS files or other errors, a good task for logging.

I can take a stab at this.

General code cleanup

We should remove the echoPhrase() and testMethod() functions from various classes.

CLI parameter --limit no longer handled properly

I've just pulled down the latest commits to master a57cb58 and the CLI --limit parameter appears to no longer function as expected. I used --limit=4 when testing against a newspaper collection, but MIK continued outputting more than 4 newspaper issues.

PHPUnit tests failing

Running phpunit --bootstrap vendor/autoload.php tests creates fatal errors.

Please update the code and related so that phpunit runs successfully. Additionally, identify potential unit tests to add in separate issues.

Use Csv parsing library in CdmToMods class

Use the League\Csv parsing library in the getMappingsArray method of the metadataparsers/mods/CdmToMods.php class. The League\Csv parsing library is already in use in the CSV fetcher. Initial testing by @mjordan indicates that the library handles CSV files exported from various spreadsheet programs more robustly than the current code in the getMappingsArray method.

Ability to set metadata and file manipulators in configuration file

Engineer a way to have metadata and file manipulators processes metadata and files appropriately by simply setting them in the configuration file.

Document simplify mappings file usage

Document the simplified mappings file structure from #7 in the project wiki and or the developer README. Give examples on usage for CONTENTdm and other data sources as appropriate.

Make breakTopicMetadaOnSemiColon in FilterModsTopic more general

In the FilterModsTopic metadatamanipulator, make the breakTopicMetadaOnSemiColon more general, for cases where it may be desirable to break up metadata on another character.

Make HTTP requests more robusts against network connectivity and latency issues

Currently, MIK is uses several methods that use the file_get_contents function. This makes MIK particularly susceptible to network connectivity (say, if the network cuts out momentarily) and latency issues for larger collections. Below are some suggestions for approaches that may make MIK more robust against these network issues when reading remote files:

Use Guzzle - a PHP HTTP client and framework for consuming RESTful web services
Use cURL for its enhanced error reporting, and add appropriate logic and logging to make error handling more robust.
Investigate how Drupal's drupal_http_request function was implemented for ideas on how to create more robust file request methods within MIK.

Other suggestions are welcome. Please comment below.

Make fetchers/Fetcher.php an abstract class

Refactor fetachers/Fetcher.php as an abstract class to be extended by particular instances (such as fetchers/Cdm.php or fetchers/Csv.php). In particular, be sure to include versions of the getItemInfo() and queryTotalRec() methods to force extending classes to define these methods.

PHP Abstract Class documentation: http://php.net/manual/en/language.oop5.abstract.php

After successfully completing this task, review other sections of MIK where a similar abstraction of the parent classes makes sense and create ToDo issues for the tasks.

'metadata_filename' option missing from the WRITER section of the default configuration file

The 'metadata_filename' option missing from the WRITER section of the sample default.config.ini file.

Add support for multiple fetchermanipulators

The current implementation of fetcher manipulators only allows for one. Add support for multiple, like we have for metadata manipulators. We'd probably need to apply each manipulator in a specific order so that each one progressively reduced the fetched record set.

Provide sample CONTENTdm to MODS mapping file

Provide a sample file to populate the mapping_csv_path value in config files. Running mik without a mappings file results in this error:

PHP Warning:  fopen(): Filename cannot be empty in /home/mark/Documents/hacking/mik/src/metadataparsers/mods/CdmToMods.php on line 75
Unable to open file.mark@mark-ThinkPad-X230:~/Documents/hacking/mik$ vi /home/mark/Documents/hacking/mik/src/metadataparsers/mods/CdmToMods.php

Language of field not being populated

Using the format of the mappings file as defined in #7, the Language of field is not being populated. For example, a mappings file with a row like this:

Publisher,eng,<originInfo><publisher>%value%</publisher></originInfo>,

should produce markup like this:

<originInfo>
    <publisher lang="eng">Vancouver School of Art</publisher>
</originInfo>

Currently, the markup being produced is like this:

<originInfo>
    <publisher>Vancouver School of Art</publisher>
</originInfo>

Simplify mappings file and related code

A mapping file will also be necessary for non-CONTENTdm sources.

Task: Simplify the mapping structure. For example, using these columns:

source field label	language of field	target element	note

The 'Language of field' and 'note' columns would be optional.

Thank you to @mjordan for the suggestions.

getChildrenPointers() shold be getChildren()

https://github.com/MarcusBarnes/mik/blob/master/mik#L73 : getChildrenPointers() and source function at https://github.com/MarcusBarnes/mik/blob/master/src/filegetters/CdmNewspapers.php#L88 should not contain CONTENTdm-specific language. getChildren() is more generic.

Add method to Fetcher abstract class to check well formedness of snippets in mapping file

Since snippets in metadata mapping files are XML, manually creating them is error prone. MIK should check each snippet for well formedness before it proceeds with creating import packages, and if it detects an error, quit and tell the user to check the bad snippet(s). Perhaps a good place for this is in the Fetcher abstract class. I can start this feature.

Special characters in metadata field values need to be encoded

&, <, >, etc. in metadata field values need to be encoded or MODS generation breaks.

Monolog logger creates warning message when date.timezone PHP ini setting not set.

A warning message is created by Monolog when date.timezone PHP ini setting is not set. We can check for this at the top of the main MIK and set a default date.timezone (I have some code in a local feature branch) - is this a good approach? Additionally and or alternatively, where's the best place to document this? Is there a way to check for this setting using composer?

Remove src/filegetters/Cdm.php

We should remove src/filegetters/Cdm.php if it is not being used.

Remove createModsXML() from Mods.php

Child classes implement this method so it should be declared as an abstract method.

marcusbarnes / mik Goto Github PK

mik's Introduction

MIK, the Move to Islandora Kit.

Overview

Documentation

Troubleshooting and support

Islandora content that has been prepared using MIK

Installation

Usage

1. Configure your toolchain

2. Check your configuration

3. Convert your source content into Islandora ingest packages

4. Load your content into Islandora

Current status

Contributing

Maintainers/Sponsors

Contributors

mik's People

Contributors

Stargazers

Watchers

Forkers

mik's Issues

Recommend Projects

Recommend Topics

Recommend Org