Coder Social home page Coder Social logo

marcusbarnes / mik Goto Github PK

View Code? Open in Web Editor NEW
34.0 6.0 11.0 1.54 MB

The Move to Islandora Kit is an extensible PHP command-line tool for converting source content and metadata into packages suitable for importing into Islandora (or other digital repository and preservations systems).

License: GNU General Public License v3.0

PHP 93.24% Python 0.14% XSLT 6.19% Twig 0.43%
php islandora php-cli digital-preservation digital-repository repository-tools utility etl migration

mik's Introduction

MIK, the Move to Islandora Kit.

Build Status Contributing Guidelines DOI

Overview

The Move to Islandora Kit (MIK) converts source content files and accompanying metadata into ingest packages used by existing Islandora batch ingest modules, Islandora Batch, Islandora Newspaper Batch, Islandora Book Batch, and Islandora Compound Batch. In other words, it doesn’t import objects into Islandora, it prepares content for importing into Islandora:

MIK overview

MIK is designed to be extensible. The base classes that convert the source metadata to XML files for importing into Islandora, and that convert the source content files into the required directory structure for importing, can be subclassed easily. MIK also uses plugins (known as "manipulators") and a set of "hook" scripts that allow functionality that can be turned off or on for specific jobs.

MIK was originally developed by staff at Simon Fraser University Library in support of their migration from CONTENTdm to Islandora, but its longer-term purpose is as a general toolkit for preparing content for importing content into Islandora. So MIK should really stand for "Move [content into] Islandora Kit."

Documentation

We are continuing to improve our documentation, which is on the MIK wiki. Please let us know if you have any questions, suggestions or if you would like to assist.

Troubleshooting and support

If you have a question, please open an issue.

Islandora content that has been prepared using MIK

Installation

Instructions are available on the wiki.

Usage

Typical workflow is to 1) configure your toolchain (defined below) by creating an .ini file, 2) check your configuration options and then 3) run MIK to perform the conversion of your source content into Islandora ingest packages. When MIK finishes running, you can import your content into Islandora using Islandora Batch, Islandora Newspaper Batch, Islandora Book Batch, or Islandora Compound Batch.

1. Configure your toolchain

In a nutshell, this means create an .ini file for MIK. Details for available toolchaines are provided on the wiki.

2. Check your configuration

To check your configuration options, run MIK and include the --checkconfig (or -cc) option with a value 'all':

./mik --config foo.ini --checkconfig all

You can also check specific types of configuration values as described in this Cookbook entry.

Note: if you are using null mappings for metadata manipulators, --checkconfig will return errors. Use --ignore_null_mappings to avoid these.

3. Convert your source content into Islandora ingest packages

Once you have checked your configuration options, you can run MIK to perform the data conversion:

./mik --config foo.ini

On Windows, you'll need to run:

php mik --config foo.ini

The --config option is required, but you can also add a --limit option if you only want to create a specific number of import packages. This option is useful for testing. For example:

./mik --config foo.ini --limit 10

The above will generate 10 valid packages if possible. Check the problem_records log for any that were skipped due to errors.

Once MIK starts running, it will display its progress:

./mik --config foo.ini
Creating 10 Islandora ingest packages. Please be patient.
===================================================>                          56%

and when finished will tell you where your ingest packages have been saved and where your log file is.

4. Load your content into Islandora

And you're done. In practice, you probably want to do some quality assurance on the Islandora ingest packages before you import them (and MIK provides some helper scripts to do that). If you're not happy with what MIK produced, you can always modify your configuration settings or your metadata mappings file and run MIK again.

Current status

We aim for a 1.0 release of MIK in fall 2017. Please note that the only differences between version 0.9 and 1.0 will be the addition of more features, automated tests, and code cleanup. Version 0.9 is already being used in production.

So far, we have "toolchains" (complete sets of MIK fetchers, metadata parsers, file getters, etc.) for creating Islandora import packages from the following:

  • CONTENTdm
    • single-file objects (images, audio, etc.)
    • multi-file PDFs
    • books
    • newspapers
    • non-book and non-newspaper compound objects
  • CSV
    • metadata and content files from a local filesystem for single-file objects (images, audio, etc.)
    • metadata and content files from a local filesystem for compound objects
    • metadata and content files from a local filesystem for books
    • metadata and content files from a local filesystem for newspaper issues
  • OAI-PMH
    • metadata and one PDF per article from an Open Journal Systems journal
    • metadata and one file per resource described in each OAI-PMH record if the record includes the URL to the file

Contributing

We welcome community development partners. Some features that would be really great to see include:

  • a graphical user interface on top of MIK
  • tools for creating mappings files (in addition to the Metadata Mappings Helper)
  • toolchains to migrate from DSpace and other repository platforms to Islandora (the OAI-PMH toolchain may already cover DSpace - testers welcome)
  • a toolchain to generate Samvera import packages (yes, it's called Move to Islandora Kit but it's flexible enough to create other types of ingest packages and we'd love to collaborate with some Samvera friends)
    • we have a sample CsvToJson toolchain that demonstrates that it's possible to write out packages that differ from those Islandora uses

MIK is designed to be extensible. If you have an idea for a useful manipulator or post-write hook script, please let us know.

CONTRIBUTING.md provides guidelines on how you can contribute to MIK. Our Information for Developers wiki page contains some information on coding standards, class structure, etc.

Maintainers/Sponsors

Contributors

mik's People

Contributors

bondjimbond avatar flummingbird avatar marcusbarnes avatar mark-cooper avatar mjordan avatar whikloj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

mik's Issues

fetchers/Csv must implement queryTotalRec and getItemInfo

Class mik\fetchers\Csv contains 2 abstract methods and must therefore be declared abstract or implement the remaining methods (mik\fetchers\Fetcher::queryTotalRec, mik\fetchers\Fetcher::getItemInfo).

Please note that phpunit will fail until this issue is addressed.

Add more logging, and come up with a more readable log format

We have added Monolog logging in a few places, but there are a lot more where we should be logging the outcomes of file creation, etc. Currently, the log format is the monolog default, e.g.:

[2015-07-01 17:36:31] filemanipulators.INFO: MODS file validates {"file":"/tmp/mik_csv_output/1.xml"} []
[2015-07-01 17:36:32] filemanipulators.INFO: MODS file validates {"file":"/tmp/mik_csv_output/2.xml"} []
[2015-07-01 17:36:32] filemanipulators.INFO: MODS file validates {"file":"/tmp/mik_csv_output/3.xml"} []
[2015-07-01 17:36:33] filemanipulators.INFO: MODS file validates {"file":"/tmp/mik_csv_output/4.xml"} []
[2015-07-01 17:36:34] filemanipulators.INFO: MODS file validates {"file":"/tmp/mik_csv_output/5.xml"} []

We should consider creating a more readable log format as per https://github.com/Seldaek/monolog/blob/master/doc/01-usage.md.

Using the configuration file to pass parameters to metadatamanipulators

It may be useful to be able to pass parameters to metadata manipulators via the configuration file. For example, the breakTopicMetadaOnCharacter method of the FilterModsTopic metadatamanipulator class splits topic metadata on a delimiter which is now a parameter (see #26). One way we might consider doing this in the MANIPULATORS section of the configuration file is by following the named metadata manipulator with a piped delimited list of parameters:

[MANIPULATORS]
; One or more metadatamanipulators classes
metadatamanipulators[] = FilterModsTopic|param1|param2|...

Thought and use cases are appreciated.

Fail more gracefully if unable to access directories from config input_directories

Fail gracefully with a message that will make more sense to general users should mik be unable to access directories in the configuration FILE_GETTER input_directories[]. This may happen if one fails to connect to input_directories from a network drive or if the user makes a mistake when providing the input directory path. Currently, mik will issue a fatal error that may cause unnecessary confusion among certain users.

More generally, identify other places where mik should fail gracefully, creating new issues to help track these.

Need a way to filter specific "content types" in fetchers

Since our filegetters and writers are very specific about the types of files (and directory structures) they deal with, we need a way to handle sources that may contain objects multiple "content types". For example, in CONTENTdm, a single collection can contain still images, movies, newspapers, books, and general compound objects. The Islandora batch loaders can only ingest objects of a single content type (although there are some exceptions to this), so we can't mix and match content of different Islandora contant types in the ingest packages that are produced by an MIK run.

One mechanism we might consider is a "fetchermanipulator" class that will inspect each incoming object source object and ignore it if does not pass specific tests. Or, we could build similar checks into filegetter classes, so that if an incoming object does not pass some tests, it is just skipped. This latter solution might be simpler.

To use CONTENTdm as an example source, for compound objects, CONTENTdm provides a element in each object;s .cpd file. For books, the value is 'Monographs', and for general compound objects, it's 'Document'. I'll try to come up with a comprehensive list.

Create exception class

We need to be able to catch errors at any point in the class hierarchy and have them bubble up to the main loop in mik so that if an exception is caught, mik can move on to the next record in the loop after logging the error).

Ideally, we'd have a custom MIK Exception class that was extended by a fetcher exception class, a metadata parser exception class, a writer exception class, etc. I'm not sure how that would work however.

@MarcusBarnes would you mind starting this?

Remove .php extension from mik.php and add shebang

Let's rename mik.php to mik, and add an 'env' shebang as in this example:

#!/usr/bin/env php
<?php
print "Hi from test\n";
?>

That way if mik is executable we can run:

mik --config=foo.ini

If it's not executable, php mik --config=foo.ini will still work.

In field mappings, only add shared parent wrapper elements once

If we have two mappings that define MODS elements that share a common parent wrapper element, we should only add the parent element to the MODS XML once. For example, in the following mapping file the 'Medium' and 'Work Measurements' source fields map to MODS <form> and <note> elements, respectively:

Calendar name,<titleInfo><title>%value%</title></titleInfo>,
School name,"<name type=""corporate""><namePart>%value%</namePart></name>",
Medium,<physicalDescription><form>%value%</form></physicalDescription>,
Work Measurements,<physicalDescription><note>%value%</note></physicalDescription>,
Publisher,<originInfo><publisher>%value%</publisher></originInfo>,
Year,<originInfo><dateIssued>%value%</dateIssued></originInfo>,
Format type,<genre>%value%</genre>,
President,"<note type=""president"">%value%</note>",
Board members,"<note type=""board members"">%value%</note>",
Administrators,"<note type=""administrators"">%value%</note>",
Instructors,"<note type=""instructors"">%value%</note>",
"Staff(technicians,support staff)","<note type=""staff"">%value%</note>",
Degree/Diplomas/Programs,"<note type=""degree/diplomas/programs"">%value%</note>",
Majors/Concentration,"<note type=""majors/concentration"">%value%</note>",
Honorary Degree Recipients,"<note type=""honorary degree recipients"">%value%</note>",
Scholarships/Awards Recipients,"<note type=""scholarship/award recipients"">%value%</note>",
Notes,<note>%value%</note>,

These two MODS elements share the parent <physicalDescription>. Currently, the XML produced looks like this:

  <physicalDescription>
    <form>Paper</form>
  </physicalDescription>
  <physicalDescription>
    <note>16 x 24.4</note>
  </physicalDescription>

but we probably want:

  <physicalDescription>
    <form>Paper</form>
    <note>16 x 24.4</note>
  </physicalDescription>

Move XML utility functions from child classes into Mods.php

getChildNodesFromModsXMLString(), determineRepeatedWrapperChildElements(), consolidateWrapperElements(), oneParentWrapperElement(), and possibly applyMetadatamanipulators() are duplicated within CdmToMods.php and the new CsvToMods.php metadata parser. Is there any reason these functions can't live within the parent Mods.php class?

Handle multiple input directories in FILE_GETTER configuration

In the case that the files that will be used for the OBJ streams for a particular collection are spread across multiple directories, provide a way to handle multiple input directories in the FILE_GETTER configuration. (In particular, see the CdmNewspaper->getIssueMasterFiles method.)

Change CdmSingleFile file getter and/or writer to get the thumbnail from CONTENTdm

The Islandora Batch module will load thumbnails for objects (tested with JPEG2000 objects) if they have a .jpg entension. For example, an object with files 3456.jp2, 3456.xml, and 3456.jpg will be loaded as expected, with the datastreams JP2, MODS, and TN respectively.

Note that if a file with an extension appropriate to the OBJ datastream (.tif in the large image SP example) is missing, Islandora will complain that the OJB is missing. Prevent this from happening by enabling "Defer derivative generation during ingest" at admin/islandora/configure before running the batch ingest job.

The task in this issue to is have the CdmSingleFile filegetter and/or CdmSingleFile writer get the CONTENTdm object's thumbnail and add it to the output directory.

CdmPhpDocuments.php is consuming excessive memory

file_put_contents() at https://github.com/MarcusBarnes/mik/blob/master/src/writers/CdmPhpDocuments.php#L56 is throwing an "Allowed memory size of xxx bytes exhausted" error even if the PHP CLI memory_limit setting is 512M.

The source PDF file's contents are retrieved at https://github.com/MarcusBarnes/mik/blob/master/src/filegetters/CdmPhpDocuments.php#L67. Perhaps a way to avoid keeping the entire source PDF file's contents as a string in memory would be to write the file to disk in this function, then in the writer class simply move it to the destination create by file_put_contents(). This would increase disk activity but that's probably easier to deal with than memory allocation.

If this solution is acceptable, I can take a stab at this.

If the mappings file contains a row for a field that is not in the source collection, a non-fatal error is triggered.

If the mappings file contains a row for a field that is not in the source collection, a series of errors is triggered (for example, in CdmtoMODS). (The mappings are human-made and a row for a field may not match due to spelling differences, absences, or other reasons.)

Suggested fix: add a check (in CdmtoMODS) to ensure that the $CONTENTdmField key is present in the $CONTENTdmFieldValuesArray array, and if it is not, log an error.

Call to $writer->writePackages() in mik should have a $record_id parameter

Passing a $record_id parameter to $writer->writePackages($metadata, $child_pointers); will give us convenient access to a unique ID to use for single-file content model ingest packages. Only backward compatibility issue will be at writers/CdmNewspapers.php line 57, but that is easily fixed by adding the new parameter to the end of writePackages().

To summarize, the call to $writer->writePackages() in mik should be:

$writer->writePackages($metadata, $child_pointers, $record_id);

Generalize $record_id in mik

$record_id in the main loop in mik is specific to CONTENTdm records, which are identified by a "pointer". Before we do any additional work on non-CONTENTdm fetchers (like the CSV fetcher), we should remove CONTENTdm-specific naming and logic from mik, and establish a pattern for other record types.

Warning issued when no metadatamanipulators set in configuration file.

The code CdmToMods class currently has the assumption built-in that at least one metadata manipulator is set in the configuration file, but this need not be the case. An undefined index notice warnning is issued as a result when no metadata manipulators are set in the configuration file.

Add a file manipulator that validates the MODS file

File manipulators can validate files generated by MIK. We should provide a file manipulator that runs each MODS.xml file through xmllint, or (more portably) using PHP's builtin DOM validation function using a local copy of the MODS schema file. We'd need to figure out how to report invalid MODS files or other errors, a good task for logging.

I can take a stab at this.

General code cleanup

We should remove the echoPhrase() and testMethod() functions from various classes.

CLI parameter --limit no longer handled properly

I've just pulled down the latest commits to master a57cb58 and the CLI --limit parameter appears to no longer function as expected. I used --limit=4 when testing against a newspaper collection, but MIK continued outputting more than 4 newspaper issues.

PHPUnit tests failing

Running phpunit --bootstrap vendor/autoload.php tests creates fatal errors.

Please update the code and related so that phpunit runs successfully. Additionally, identify potential unit tests to add in separate issues.

Use Csv parsing library in CdmToMods class

Use the League\Csv parsing library in the getMappingsArray method of the metadataparsers/mods/CdmToMods.php class. The League\Csv parsing library is already in use in the CSV fetcher. Initial testing by @mjordan indicates that the library handles CSV files exported from various spreadsheet programs more robustly than the current code in the getMappingsArray method.

Document simplify mappings file usage

Document the simplified mappings file structure from #7 in the project wiki and or the developer README. Give examples on usage for CONTENTdm and other data sources as appropriate.

Make HTTP requests more robusts against network connectivity and latency issues

Currently, MIK is uses several methods that use the file_get_contents function. This makes MIK particularly susceptible to network connectivity (say, if the network cuts out momentarily) and latency issues for larger collections. Below are some suggestions for approaches that may make MIK more robust against these network issues when reading remote files:

  • Use Guzzle - a PHP HTTP client and framework for consuming RESTful web services
  • Use cURL for its enhanced error reporting, and add appropriate logic and logging to make error handling more robust.
  • Investigate how Drupal's drupal_http_request function was implemented for ideas on how to create more robust file request methods within MIK.

Other suggestions are welcome. Please comment below.

Make fetchers/Fetcher.php an abstract class

Refactor fetachers/Fetcher.php as an abstract class to be extended by particular instances (such as fetchers/Cdm.php or fetchers/Csv.php). In particular, be sure to include versions of the getItemInfo() and queryTotalRec() methods to force extending classes to define these methods.

PHP Abstract Class documentation: http://php.net/manual/en/language.oop5.abstract.php

After successfully completing this task, review other sections of MIK where a similar abstraction of the parent classes makes sense and create ToDo issues for the tasks.

Add support for multiple fetchermanipulators

The current implementation of fetcher manipulators only allows for one. Add support for multiple, like we have for metadata manipulators. We'd probably need to apply each manipulator in a specific order so that each one progressively reduced the fetched record set.

Provide sample CONTENTdm to MODS mapping file

Provide a sample file to populate the mapping_csv_path value in config files. Running mik without a mappings file results in this error:

PHP Warning:  fopen(): Filename cannot be empty in /home/mark/Documents/hacking/mik/src/metadataparsers/mods/CdmToMods.php on line 75
Unable to open file.mark@mark-ThinkPad-X230:~/Documents/hacking/mik$ vi /home/mark/Documents/hacking/mik/src/metadataparsers/mods/CdmToMods.php

Language of field not being populated

Using the format of the mappings file as defined in #7, the Language of field is not being populated. For example, a mappings file with a row like this:

Publisher,eng,<originInfo><publisher>%value%</publisher></originInfo>,

should produce markup like this:

<originInfo>
    <publisher lang="eng">Vancouver School of Art</publisher>
</originInfo>

Currently, the markup being produced is like this:

<originInfo>
    <publisher>Vancouver School of Art</publisher>
</originInfo>

Simplify mappings file and related code

A mapping file will also be necessary for non-CONTENTdm sources.

Task: Simplify the mapping structure. For example, using these columns:

source field label language of field target element note

The 'Language of field' and 'note' columns would be optional.

Thank you to @mjordan for the suggestions.

Add method to Fetcher abstract class to check well formedness of snippets in mapping file

Since snippets in metadata mapping files are XML, manually creating them is error prone. MIK should check each snippet for well formedness before it proceeds with creating import packages, and if it detects an error, quit and tell the user to check the bad snippet(s). Perhaps a good place for this is in the Fetcher abstract class. I can start this feature.

Monolog logger creates warning message when date.timezone PHP ini setting not set.

A warning message is created by Monolog when date.timezone PHP ini setting is not set. We can check for this at the top of the main MIK and set a default date.timezone (I have some code in a local feature branch) - is this a good approach? Additionally and or alternatively, where's the best place to document this? Is there a way to check for this setting using composer?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.