Coder Social home page Coder Social logo

ami's Introduction

ami

Archipelago Multi Importer. A module of mass ingest made for the masses

ami's People

Contributors

alliomeria avatar diegopino avatar giancarlobi avatar patdunlavey avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ami's Issues

AMI Presets

What?

Option to preserve and reuse Data Transformation and other mappings for an AMI Set

How?

  • On Step 5 of AMI Setup, have a checkbox to 'Save these configurations as an AMI Preset' (@DiegoPino option to name?)
  • On Step 3 of AMI Setup, have an option to select from a list of Presets
  • Inline documentation notes informing users to 'Check these configurations, as your new AMI Set's Source Data may differ significantly from the selected Preset'

What else / future potential issue?

  • On the AMI Sets overview Page, in the Operations menu, option to 'Start a new AMI Set with these settings'

AMI Workflow Enhancements for 0.5.0

What is needed?

Options that enable users to:

  • Define an AMI set label when initially creating a set
    - instead of having to Edit this after the set is created
  • Define ownership of digital objects and collections created from an AMI set
    - autocomplete search against existing user IDs?
    - defaults as checkbox for option such as "default ownership same as AMI set creator"?
  • Determine the separator/delimeter used for certain multiple-value elements
    - main AMI default separator/delimiter set to semicolon ; for Files
    - I7 Solr Importer plugin uses |@| for multiple-value elements
  • For I7 Solr Importer, option to deselect fetched columns/keys/elements
    - potentially as an additional step after initial configuration choices
  • #69
  • #70

Related separate issue discussions/issues:

  • 🌟 #8
  • See all of the Processed ADOs associated with an AMI Set in a list/tab
  • When working in the 'Edit Reconciled LoD' tab of your Ami Set, have an option to Filter by Checked/Unchecked (from Issue #69)
  • In the Processed LoD CSV, include a Notes column to include information about the AMI Set origination and date (from Issue #69)
  • Alphabetic sorting of CSV Header Columns on LoD Reconcile form during configuration #148

@DiegoPino @aksm @karomabiles anything else that should be part of this issue?

Make Solr Batch size aware of the children

What?

I tried to make the Solr import as optimal memory wise as possible but I failed with a real and important use case. We go an get Data From Solr in batches. The default is 50 Rows. But if those 50 Rows are Books and each has hundreds of pages, we still get all in a single call. That is a very very large array and can fill memory

I will have to do some heavy refactor there and return the batch and the actual top offset plus child offset if I pass a certain limit to avoid Memory out on that use case. This is a bug-design problem

@alliomeria @aksm

Drupal 9 shenanigans. ModerationInfo may call a method on NULL while we gather Moderation States

What?

Yes, wrong code on D9 content_Moderation Module + combination of an ADO Content Model that is not under Moderation (has publish/unpublish only)

This method here https://git.drupalcode.org/project/drupal/-/blob/9.2.x/core/modules/content_moderation/src/ModerationInformation.php#L216

/**
   * {@inheritdoc}
   */
  public function getOriginalState(ContentEntityInterface $entity) {
    $state = NULL;
    $workflow_type = $this->getWorkflowForEntity($entity)->getTypePlugin();
    if (!$entity->isNew() && !$this->isFirstTimeModeration($entity)) {
      /** @var \Drupal\Core\Entity\ContentEntityInterface $original_entity */
      $original_entity = $this->entityTypeManager->getStorage($entity->getEntityTypeId())->loadRevision($entity->getLoadedRevisionId());
      if (!$entity->isDefaultTranslation() && $original_entity->hasTranslation($entity->language()->getId())) {
        $original_entity = $original_entity->getTranslation($entity->language()->getId());
      }
      if ($workflow_type->hasState($original_entity->moderation_state->value)) {
        $state = $workflow_type->getState($original_entity->moderation_state->value);
      }
    }
    return $state ?: $workflow_type->getInitialState($entity);
  }

found in /web/core/modules/content_moderation/src/ModerationInformation.php (Drupal 9)

Calls immediately chained accessors like $workflow_type = $this->getWorkflowForEntity($entity)->getTypePlugin() without acknowledging / testing that $this->getWorkflowForEntity($entity) may be NULL because the Entity is not under Moderation.

Solution for us is to call the parent method before we access this and hopefully also report this as an error.
if ($moderationInformation->canModerateEntitiesOfEntityType($entity->getEntityType()) && $moderationInformation->getWorkflowForEntity($entity))

Parsing XML to JSON on AMI Ingest

The webform allows one to upload an XML file that gets nicely parsed out into the SBF JSON using the webform field type: Import Metadata from a File

The need is to have this same parsing/processing done to an XML files when they are ingested via AMI.


pasting below the slack convo from Diego on this:

Diego:
“I see all of the values from my xml very nicely parsed into JSON in the SBF.” makes me happy because XML to JSON is tricky and i had to make quite some acrobatics to generate a decent sized/parseable JSON from XML. But for your use case, no webform element level processing is done via AMI (as we speak). AMI is not even really aware of what webform you may/want to use (you could have many). Reason is because it is a bit tricky because a lot of what Webform does requires JS/Human interaction and AMI can not access JS at all (not client level, server level). Remember your webform module in islandora 7? A lot of mapping and XML forms in Islandora could not even process data/just read/write what you would put there. I think we can find some type of “plugin” level processing for AMI, where some (may need a list?) webform equivalents can be mapped to certain fields to make that happen. It would imply: a new AMI set mapper, some plugins that take input/generate output (and making them 1:1 with webform may be a challenge) and then use the output of the plugin in the Queue Worker to enrich the JSON.

Derek Merleaux :
👍 yep I’ll do that now - it sounds like the XML to JSON processing that is being accessed by the webform is not currently accessible by AMI? That’s why a plugin is needed?

Diego Pino
Yes. Webforms do a lot on their own realm that is outside of the Node Ingest workflow. AMI can not access that (now) because webforms require browser/user interaction we can not fake (easily)
e.g in the past all the “file characterization” was done by us on webform, but that did not work for drush or AMI ingests so i moved it to an event subscriber. That could also be an option. E.g “always process attached XMLs into JSON”
and that would be “general” not AMI specific. Remember you can also ingest objects via DRUSH or even the JSON-API directly
But then some poeple may see XML as a preservation format that does not need to be JSON-i-fied! (edited)
so… issue is, we have again too many choices. Another example. CSV to JSON. I should not “process” every CSV into JSON. May i want process some
This may also be solved outside of AMI
via SBRunners (edited)
Where we have more control/decision making options
And SBRunners apply to every object and can/be/forced/to regenerate
without reingesting

Enable ZIP source file and also full directories processing

What is needed?

I still need to add the extract file from ZIP function. I know, you know, I did this already in IMI, but in this case I want to be super performance/space conscientious and do not want to have temp files around. So the extraction needs to happen during the actual Ingest and not during the preprocess and that also means, some flags/checks in case it fails upfront, or I can check if the file is during preprocessing and flag as "not ingestable" if not present.

Many questions and needs
@alliomeria will tag you when published so you can test. Thanks

Round Trip Export/Import

User Story
As a metadata specialist, I need to export a set of records that I can then edit in a spreadsheet and then import the same sheet to update the records.

url decode remote file names before saving and remove prefix during temp storage

What?

Normally the actual file named saved in the DB and the SBF JSON is not really that relevant, but in particular when dealing with dependent files like a OBJ (3D) and MTL and the referenced Image files the exact naming convention is really important when loading them via JS.

This implies
1.- Change the temp folder structure for remote files and instead of appending the MD5 of the URL to the original file make this a folder
2.- URL decode the filename (which will give us the real one with spaces) and use that. Once SBF takes over the actual storage location/filename will be normalized anyways, but by then the File Entity will contain the right Filename and will restore it even if we delete the as:filetype structure in the JSON

Pull coming, code already changed and works well, but I'm a bit tired. This is the last minute bug fix before release and comes from testing the ready to share deployment machine

@patdunlavey @giancarlobi @alliomeria @aksm

Remote files with extensions are being ignored

What?

Bug. This is on me. I totally programmed to deal with extensions and not extensions with a full set of extension to mime and back but forgot to assume that if the extension was present and the mime type was correct I could use directly the already downloaded file. Gosh.

This was detected by @alliomeria

I double checked that after the fix also non extension endpoints (e.g coming from an Islandora) are dealt with correctly.

Update ingest with file mappings adds new file usages (and does not replace or remove the old file usages)

Steps used to create this problem:

  1. Created an AMI create set with s3:// file mappings and ingested. All looks good.
  2. Took the processed csv file from above (which has the node_uuid column) and used it, unaltered, as the basis for a new AMI update set, with all settings, including file mappings, identical to the create set.
  3. After processing the update set, we end up with two copies of each dr:fid in the json, but each with a unique dr:uuid. Processing the AMI set again results in another copy of each of the file usages.

Additional observations:

  1. If I remove the file field mappings in the AMI update set configuration, it does not create duplicates when processed.
  2. If I change the file paths in the csv provided to the AMI update set, new file usages for the new paths are added to the previously existing ones. The old ones are not removed.

Missing Messenger() trait on Spreadsheet importer and ZIP based Spreadsheet format issues

Silly mistake, bad code. Fixing!

  • Also, many ZIP based Spreadsheet format require that the file provided for the PHPSpreadsheet methods are absolute paths. Streamwrappers won't work. This implied a lot of refactoring since we have to also assume the admin of the repo could have changed the upload location for anything to S3 and that implies downloading the file to temp and cleaning up. I added a Shutdown function to clean up afterwards but on PHP-FPM there is no output during that process so I can "really" not debug if its working? I know its cleaning (can check the folder afterwards) but just in case someone wonders, Xdebug, etc won't even trigger inside that function.

free dates with edtf "[<dates>]" syntax (one or more dates in square brackets) are saved as json arrays rather than original strings

One of EDTF's features is "Set representation", which permits listing several dates enclosed in square brackets or curly braces. date_free values that use set representation are being json_decoded by AMI and saved into the date_free json value as an object or an array, rather than as a string.

I tried a number of tricks in the twig template to (e.g. using escape('js') to get AMI to treat them as strings, however nothing I tried worked - either having the free date saved as an array/object, or throwing an error: "We could not generate JSON via Metadata Display with..."

Digging around in AmiUtilityService.php:1954 I see that AMI is sending json_decoded data to the twig template, which I guess makes sense.

I tried this in my template:

"date_free": {% if data.my_field is iterable %}"{{ data.myfield|json_encode|raw }}"{% else %}{{ data.myfield|json_encode|raw }}{% endif %}

Note that the only difference is that the first iterable version wraps the json encoded value in quotes, while the second relies on json_encode doing the quotes.

I'm not actually sure if this worked as intended - not sure if is iterable is the right test here. At any rate, that's as far as I got, and it didn't work.

I'm not sure if this is an actual bug, or if I'm just not doing my twig right. Thoughts @DiegoPino ?

Permission denied on Directory Prepare on an OS X docker deployment

What?

This is a strange one but could be really related to the fact that the filesystem in an OS X machine is shared and OS X could be touching files during its normal "I'm here, doing stuff" process.

Was detected by @alliomeria after a AMI Set was deleted (not first time a set was deleted neither) and basically blocked any further ingest after that, because we do fetch remote data as CSV files for local processing just before the actual setup/form/AMI settings can happen.

The fix is easy and means basically going from

  if (!$this->fileSystem->prepareDirectory(
      $path,
      FileSystemInterface::CREATE_DIRECTORY
    )) {
      $this->messenger()->addError(
        $this->t('Unable to create directory for CSV file. Verify permissions please')
      );
      return;
    }

TO

  if (!$this->fileSystem->prepareDirectory(
      $path,
      FileSystemInterface::CREATE_DIRECTORY | FileSystemInterface::MODIFY_PERMISSIONS
    )) {
      $this->messenger()->addError(
        $this->t('Unable to create directory for CSV file. Verify permissions please')
      );
      return;
    }

But I may want to apply that to many other places

AMI Solr Importer Plugin Suffix Cleanup

What is needed?

AMI already performs some clean up 🪄 of various commonly-encountered duplicative/redundant Solr fields with particular suffixes:

const SOLR_FIELD_SUFFIX = ['_ms', '_mdt', '_s', '_dt'];

Additional common prefixes that would also be helpful to include in this clean up:

  • _t
  • _mt
  • _mlt

@DiegoPino, @aksm, and all Solr Importer fans, any other prefixes to consider for this process?

Exporting data with hierarchies to csv/sheet

This is a specific requirement for the project I'm working on, and I have seen demand for this in other institutions. This also ties directly to the round-trip export/import functionality (I'll link issue when it's made)

User Story
As a repo mgr, I need to export all the data for a specific set of records into a spreadsheet format.

I need the spreadsheet to have one record per line except where the record has more than one item/part in which case the subset of data for all nested items of a record should appear in the rows immediately below that record.

Example:
AV record, footage from ArchipelagoCon has four nested "items" which are side
a and side b of two VHS tapes (yes I know but this is an example ok?) Within the record's metadata is a set of 20 fields specifically describing that side of that tape (digital asset).

When I export this record to a spreadsheet, there should be one row for the record (that includes the data for the first item) and three more rows with data from the remaining 3 items.

Screen Shot 2020-12-02 at 12 26 24 PM

adding new Google API account - developer key field might need some validation

When creating a new account and being the sort of user that does this kind of thing - I didn't have any idea what to put for "developer key" (Google refused to provide me w/ a value by this name) so rather than leaving it blank, I improvised and stuck my email in there which yielded the following error message. I went back and removed and resaved and no problems - I think the account would have worked fine after that initial error, but just in case this is important.

Warning: assert(): Cannot load the "google_api_client" entity with NULL ID. failed in Drupal\Core\Entity\EntityStorageBase->load() (line 249 of core/lib/Drupal/Core/Entity/EntityStorageBase.php).
Warning: array_flip(): Can only flip STRING and INTEGER values! in Drupal\Core\Entity\EntityStorageBase->loadMultiple() (line 266 of core/lib/Drupal/Core/Entity/EntityStorageBase.php).

Add strict validation for settings passed to Ami Set Processor and expose Content Moderation

What is this?

Currently we are failing to validate enough the configuration passed by the AMI set to the Processor. If a user replaces a CSV the Setup level validation is not enough anymore and important Header Columns and other settings may be missing from the new Source data. This also includes References to Rows that are not there or not longer there (if the CSV has a different ROW count).

Also, by default all ADOs are ingested as "Drafts". Settings simply published/unpublished is not enough since each Bundle (digital object, collection or custom ones) could be under a different Workflow/ Moderation Transition Scheme. Which means we also need to check for each selected "Bundle" the available Moderation Options and display them as Fields in the Process Form. Some bundles (custom) could be unmoderated and would then only have "published/unpublished" Status.

This was reported by @carlj11 today. I already started working on solutions but the Moderation part will take me tonight still a few more hours.

@dmer @alliomeria you both will benefit of this too.

File ordering/sequencing options

What is needed?

Provide additional file ordering/sequencing options for source files in AMI.

Options below resultant from recent discussions on the #archipelago-metadata slack channel:

  1. Files ingested via a URL source will always sort by the order listed within the cell. (help text on the AMI configuration step will let users know this is the case?)

  2. If users select Direct or Template during Step 3:

    • All files will be sorted by natural order by filename.
  3. If users select the “Custom (Expert Mode)” during Step 3:

    • For each type of ADO found in a spreadsheet/import source, users will have an option to specify that file sorting respects the order given in the cell. Selecting this will add a flag of “user provided order for referenced filenames” (or similar phrasing)
    • If users do not select the option to specify by order given in the cell, then files will be sorted by natural order by filename

Devnotes 🍓:

  • the “respect ingest/cell order” option will add a flag ; this flag will persist on manual edits; if the flag is absent, files will be ordered by filename (and/perhaps a message with order by sequence in cell)

@DiegoPino & @giancarlobi, does this match your understanding of the options discussed? If not, please let me know and I will correct these issue notes. Thank you both!

Wrong joining of multi valued fields from Islandora-Solr

What?

My bad. I just checked the code and the order of operation has a bug. Basically if after concatenating values I find another "to be joined" field with a single one, I end resetting the already concat things. Fix is simple and this is an important one

@alliomeria this needs to be fixed bc it also affects your migration workflows. on it!

Remove links to delete processed ADOs for update AMI sets

If you cannot delete processed ADOs because the AMI set is configured as an update set, you should not be presented with links to do so.

This was prompted when we found that one of our contrib modules was whitescreening because it was trying to do a form alter on the delete processed ADOs confirmation page when there was no form to display. That is solved having \Drupal\ami\Form\amiSetEntityDeleteProcessedForm::buildForm return a proper form, rather than just a bit of a render array.

But the main issue is that we should not see the links to delete processed ADOs in the first place when the AMI set is "update". These links appear in the operations column of /amiset/list, as well as a in the menu tabs when viewing an individual AMI set.

The operations links are controlled here: \Drupal\ami\Entity\Controller\amiSetEntityListBuilder::getOperations
The menu tabs are access-controlled here: \Drupal\ami\Entity\Controller\amiSetEntityAccessControlHandler::checkAccess

I will have a proposed PR to address these things shortly.

Missing 'type' column, or missing values in the 'type' column, causes inscrutable errors when creating an AMI set

When an ingest spreadsheet is preprocessed during AMI set creation, if there is no "type" column, or if there are empty values in the "type" column, we see the error message "1 error has been found: Select the data transformation approach for ". The message is supposed to include the type, but since none exists, it looks like this. Further configuration of the AMI set is not possible.

The behavior should be that it provides an error message saying that a "type" column with values is required. I'd also suggest that the "type" column itself be configurable, where you are presented with a list of all column names, and instructed to select one for the object type.

AMI Solr importer things that need resolution

Some things we have discovered 😭

  • THE CSV cleanup algorithm (bad bad) has offsets if Solr returns values in different order or has some fields YES and some NO comparing to the previous 100 item fetched. Its a complex situation of Performance/memory/cleverness (which I lacked)
  • Make all Column Headers lowercase when converted to Twig Context. e.g if the header is mods_name_personal_contributor_namePart the way of accessing that data inside Twig needs to be data.mods_name_personal_contributor_namepart and PID will be data.pid

We need

  • Inline documentation on ADO Mappings for Childobjects to make sure people know they are "suggested" but there is no certainty that they will actually found when harvesting the Children.
  • That the Sum of Top Object CMODELS and Children CMODEL mappings == All the CMODELs present in that Islandora site
  • Inline Documentation Format Strawberryfield to explain the if data is coming from a CSV the headers will be treated as lowercase always.
  • Generate Twig template that is a close match to a CSV coming from Islandora. This template needs to have {# explanations #}
  • Provide an example CSV from an Islandora that matches that Twig template so people do not complain.
  • For LoD Reconciliation
    • if no LoD endpoint is selected for a source, normalize the warning/error message (one instead of many)
    • Attach independent CSV for better table display (top alignment + wrap text)

AMI set, Process immediately v/s enqueue, Questions

What? This are questions for @alliomeria @giancarlobi @pcambra and @patdunlavey

Each AMI set has a "process" tab/form. When you press process right now the only action that is happening is that all Future ADOs are enqueued, independently of the Set ID, into a global AMI Ingest Queue. This queue can then be processed via the UI if the module queue_ui is enabled or can be take over by Hydroponics Service and end ingested eventually

Since I left a checkbox there for "Enqueue but do not process Batch" I'm not now working on the opposite action, the actual process immediately (means you uncheck and you press Process), but I have a few questions here:

I can trigger a direct Batch process. That is simple enough and testing that now hurra for me. Now these are the questions/facts

  1. Since all future ADOs are right now going to the Same queue I would need to do something funny (which I do not like) to only process the ones from the current set. I have to go Item by Item in the queue Claiming Items and checking if they belong to the current Set ID, if so, processing can be done, if not I need to release the Items and keep claiming. This adds some overhead (and some calculation math from my part), but also has the benefit that I can actually process only Certain Sets even if the Repo is being used by multiple users and has a lot of pending Sets. Claiming Items (even if I'm going to release them) may, on concurrent processing mess thing up, because of the Ingest order we try to impose (collections and parents first, then children).
  2. On "process now" action (checkbox unselected) instead of pushing to general Ingest QUEUE i can create "SET ID named" queues that are only valid for a particular set, and manually batch process them. I was thinking this may be a good idea if I can, actually delete the queue completely once the Ingest is done. Since "deleteQueue" actually deletes the table, it would not leave any garbage around, and in case of processing interruption I could even allow the user to keep processing that one at a later time... (not sure I like that... but may be needed). This discrete queue would also allow a full separation. But the ones in that queue (for immediate processing) would never ever go to the main global once, except if processing is run again but without the "checkbox" enabled. This last option (you run once with immediate, then another one with not) may mean that the same ADOs may end enqueue twice. AMI is smart enough to not "double" ingest but pretty sure it can not be smart enough to now "double" update of "double patch"! So next question..
  3. Should, independently of the choice, only one enqueueing be possible per set? and only once all ADOs are processed and claimed and removed, a new enqueuing is possible? This sounds like a great idea, but it is tricky. Because how do I know that, e.g, all items in the Global Queue for a given set, are already processed without doing exactly what I might want avoid in 1.?
    Easy enough if temp/discrete queue per set is used, but not on the global one.
  4. The undo queue, for patch and update. Everyone likes undo's. This may be a killer feature, but again, tricky to implement. Should I enable a discrete/temp queue for undoing patching and updating? Or should we relay on the NODE revisions to simply keep track of those and revert to the "pre" update revision if so? If so, an undo queue, can only exist on per Set at the time and only if we are processing. But there are so many ways we can end with a strange undo queue. Like partially undo? but not all Undo if processing was interrupted?

Side note: in any case i would like to preserve the global ingest one since it allows like "all the enqueue things" to be ingested when the server is idle and in a sequential order FIFO. Still, I'm starting to feel that temp/create and delete Temp queues for immediate ingesting are a good thing...

Does any of these make sense? Should i remove the too many options and try to avoid confusing the user?And if so, what do you think is the best way? I mean, users should not be exposed to these details at all? Thanks

Add a Report Tab to each AMI Set

What?

Right now, given the fact that all runs on background/queue, we are passing errors happening during an AMI set ingest to the watchdog. Bad place for endusers to see/understand what failed/how it failed and what they may need to do to fix missing data/not found files, etc in their Source Data/AMI Set

Need

  • Generate parseable reports that can be seen directly on each AMI Set
  • Generate a Block on each User Landing page that we can use to show then Failed results, etc.

I need to investigate what is the best here. Watchdog is still good for us, but errors need to generate their own entries in each SET and we may want that to be "Views" processable or at least UI/UX facing for good.

@alliomeria will just tag you because this will eventually open some larger discussions about proper ways of reporting to users (in general) background processing and queue outputs, specially in our complex environment

Hitting Excel limits with large JSON in a cell

What?

Discovered yesterday. Excel as a design limit of 32767 chars per cell. CSV has no limit. But because we use the PHPSpreadsheet as generalized way of reading anything the using the Spreadsheet AMI Import Plugin, and somehow someone decide it was a good idea to put a truncation method on read (while would you do that? I mean worst case put it on write back!) large JSONs (not your usual JSONs, we are speaking about hundred of objects in an imported finding EAD) get cut after import and JSON is no longer valid. Of course nothing "fails" it just skips that data.

Solution: Identify the file first, if its a CSV the use the native PHP method (we already have a method in our AMI Service for that) to fetch the data via ::getData()

Google sheet import failing due to autoload failure

We're suddenly having this problem with Google Sheet import. After filling in the Google Sheet ID and cell range and clicking "Next", nothing happens. Watchdog reports...

Error: Class 'Google_Service_Adsense_Resource_Accounts' not found in Google_Service_Adsense->__construct() (line 67 of /var/www/html/vendor/google/apiclient-services/src/Google/Service/AdSense.php)
#0 [internal function]: Google_Service_Adsense->__construct(Object(Google\Client))
#1 /var/www/html/web/modules/contrib/google_api_client/google_api_client.module(135): ReflectionClass->newInstance(Object(Google\Client))
#2 /var/www/html/web/modules/contrib/google_api_client/google_api_client.module(60): _google_api_client_read_scope_info()
#3 /var/www/html/web/modules/contrib/google_api_client/src/Entity/GoogleApiClient.php(179): google_api_client_google_services_scopes(Array)
#4 /var/www/html/web/modules/contrib/ami/src/Plugin/ImporterAdapter/GoogleSheetImporter.php(176): Drupal\google_api_client\Entity\GoogleApiClient->getScopes()
#5 /var/www/html/web/modules/contrib/ami/src/Form/AmiMultiStepIngest.php(393): Drupal\ami\Plugin\ImporterAdapter\GoogleSheetImporter->getData(Array, 0, -1)
#6 [internal function]: Drupal\ami\Form\AmiMultiStepIngest->submitForm(Array, Object(Drupal\Core\Form\FormState))

When I remove the vendor folder and run composer install, composer reports that it is skipping autoload generation for a bunch of Google_Service_Adsense classes due to non-compliance with psr-0 autoloading standards.

AMI Update: Create new revisions

Current behavior is that when AMI updates an existing object, a new revision is not created. This makes using AMI update a very risky proposition.

The feature request here is to either always create new revisions if revisioning is enabled, or provide a setting in the AMI set to enable or disable saving revisions.

Extra credit (another feature request): provide an action to roll back the most recent update process.

Super extra check passed Rows and Row indexes before sending to the actual QUEUE

What?

I'm still figuring out all the places where things can go wrong when data is missing or some sanity check of the provided rows made the inter-rows hierarchy fail, but a few spots where detected already and @alliomeria dumped all the logs for me into a TXT file so I can debug better. (Fyi drush watchdog-show --tail --full --count=50 --type=php ) , but the gist here is that I need to mark all "required" data and index structured/properties of the $data Object that is passed around before enqueuing info. There are a lot of failure points and even if we can not control User input via a Tabular form, we can at least make sure the "needed" data is there and that all our dependencies are met.

More on this later

Find and Replace UI

What?

A place holder so we can talk/discuss (and probably get frustrated) on how to make UI understandable for a power user while doing complex or simple Batch operations.

  • Complexities:

  • How do people select what (from where) they want to replace

  • How do people select Search pattern + replacement patter

  • Same for Remove/Add operations

  • How do they deal with multiple nested elements like Subjects or complex data like Geographic info coming from Nominatim

  • JSONPATCH (for extra power users) can do that already but is NO WAY friendly

  • TEXT search and replace can be daunting if you do not know your metadata

  • New: Webform elements

Improve AMI Set Metadata Display Preview and provide also LoD during this process

What?

Small improvement towards better AMI Set Preview

  • Uses and shows LoD Reconciliated and matching current row, same one available during and actual "Processing"

  • Adds a new Twig Extension Function (DANGER MR. ROBINSON) NOT to be used in realtime rendering (please people) that allows a direct call to LoD endpoint passing a single Label and an endpoint setup and returns NULL or an array with label/uri keys.

  • Small tests with PHP 8.0 and work arounds the very buggy offset/get css row function (core) that needs always a -2 etc depending on the PHP version

image

AMI update: merge sbf json rather than replace

There is a lovely thing we see when editing an object via webform, that existing json data that is not overwritten by the webform is preserved. The previous json and the new json are recursively merged.

It would be nice to have this behavior (optional, or always) when processing an update AMI set.

Without this feature, an update AMI set must provide all data, rather than just the columns that are required (node_uuid, etc) and the columns being edited. Haven't checked, but it might even be necessary currently to provide the automatically generated as: elements.

Fix/Research better VBO integration with Solr Search Results and faceted Views

What?

VBO is great but it is quite buggy. It has some strange ways of managing the List of Content Items to be processed and so far our experience is that it does not respect (specially when using the "Select All" button) the actual results the user is seeing on screen.

In specific for our need this view here: http://localhost:8001/search-and-replace (replace with your domain)
Needs to allow users to facet/filter, then select "All" and get actions applied to the actual UI facing set, which sadly does not work right now.
Only thing that works is there are exposed Filters (not facets) and the View is not Ajax Driven.

There are at least 2 mayor issues (also form a UI standpoint)

  • ViewsBulkOperationsActionProcessor is the one in charge of getting the Views arguments or manual selections and build a list of Content Items to be processed. The way VBO works is that once the user selects an action it "recreates" the View Results (so does not depend on the initial results build, e.g via faceting) and uses also the /argument1/argument2 etc in the URL to feed Views Filters (which might not even be available if e.g the View is Ajax driven. Even if you select all the "UI" text will say "No items selected"
  • Facets which generate GET arguments like http://localhost:8001/search-and-replace?f%5B0%5D=descriptive_metadata_object_types%3ADigitalDocument and that are used by Search API when building the Views Result (query time by altering a View) are ignored once the Action is chosen because the form that drives the submit makes these disappear (does not preserve GET)

What we need is

  • Figure out what is wrong and what can/can not VBO do. See also if VBO has managed to fix some of these issues at all on their releases
  • Maybe (I feel that is the best idea really) create our own version of \Drupal\views_bulk_operations\Plugin\views\field\ViewsBulkOperationsBulkForm which is a Views Field Plugin but also the Form that drives basically all the submit/steps/etc logic, one that takes Facets in account, is smarter and also gives Users a better feedback on what is happening. I see that as the only possible point of "override we have". WE could also of course at a secondary submit handler to that existing form and enrich data by putting it into some key store values (VBO uses temp store extensively to move data between form steps), but we might end again having to override other classes anyway

Do we need some reproducible tests?

@alliomeria @patdunlavey @aksm @karomabiles

Solr Importer to CSV HTML encodes double quotes instead of escaping them

What?

We get this:

The Art Exemplar, "English Typography and Book-Work," columns 117-118, blank verso

WE should get this:

The Art Exemplar, "English Typography and Book-Work," columns 117-118, blank verso

and this in source CSV should be

"The Art Exemplar, "English Typography and Book-Work," columns 117-118, blank verso"

Same with & and other HTML entity-able chars

Where?

here:

https://github.com/esmero/ami/blob/ISSUE-14/src/Plugin/ImporterAdapter/SolrImporter.php#L1115

\Drupal::service('ami.utility')->csv_append()

Maybe go for
htmlspecialchars($value, ENT_COMPAT, 'UTF-8', FALSE);
addslashes($value);

google_api_client error

When visiting /admin/config/services/google_api_client (Google Accounts administration), several warnings are displayed, which stem from the first:

Warning: scandir(/var/www/html/vendor/google/apiclient-services/src/Google/Service/): failed to open dir: No such file or directory in _google_api_client_read_scope_info() (line 113 of modules/contrib/google_api_client/google_api_client.module).

And if I try to edit or add a google api client account, a wsod appears, with the following error message:

The website encountered an unexpected error. Please try again later.
TypeError: Argument 1 passed to Drupal\Core\Form\OptGroup::flattenOptions() must be of the type array, bool given, called in /var/www/html/web/core/lib/Drupal/Core/Field/Plugin/Field/FieldWidget/OptionsWidgetBase.php on line 187 in Drupal\Core\Form\OptGroup::flattenOptions() (line 23 of core/lib/Drupal/Core/Form/OptGroup.php).

I confirmed that the /var/www/html/vendor/google/apiclient-services/src/Google directory does not exist. The src directory does, and contains a number of other directories. Removing the vendor directory and re-running composer install does not restore the missing src/Google/Service directory.

I noticed that the most recent version of google_api_client is 4.0.0, and tried switching to that in archipelago-deployment's composer.json, however, the ami module's composer.json also requires this module, and also specifies its version as ^3.0. So I tried changing the version numbers to ^4.0 in composer.json for both ami and archipelago-deployment (what I actually had to do was much more complicated than that, but that's the gist) - and quickly ran into autoload.php and composer hell. After trying a few things like uninstalling and reinstalling, clearing cache, etc., I backed away slowly.

However I did find that when I manually made a change to the hard-coded path in google_api_client.module to the path that's used in the 8.4.x version of that module, then the errors went away everything seemed to work. Here's that change.

LoD Reconciliation Reuse / Iterative Workflow

What is needed?

Ability to reuse LoD Reconciliation data from other AMI Sets.

  • in order to build upon, iterate over the metadata work completed previously
  • help ensure consistency of LoD mappings in your repository
  • ♻️ reduce the collective energy footprint of metadata work lifecycles

How might this work?

  • On the 'Reconcile LoD' tab of your AMI Set, have an option to "Bring LoD from other AMI Set"
    • Dropdown (or autocomplete?) to select from existing AMI Sets
  • Once your preferred Set is selected, option to "Bring as Checked"
    • If selected, will carry over the "Checked" connotation marked in the existing AMI Set, so you can reference if terms have already been reviewed previously
  • Inline notes should indicate that if using LoD from an existing Set, only terms new to the current AMI Set will pass through reconciliation

For wider AMI workflow considerations

  • When working in the 'Edit Reconciled LoD' tab of your Ami Set, have an option to Filter by Checked/Unchecked
  • In the Processed LoD CSV, include a Notes column to include information about the AMI Set origination and date

@DiegoPino @karomabiles Does this cover what we all discussed for the potential reusable/iterative LoD workflow? Anything missing / not quite right?

"Webform based find and replace Metadata for Archipelago Digital Objects content item" bulk operation does nothing

This feature appears to work great right through the configuration, confirmation and execution steps, but in the end, no changes are made to the selected objects.

From @DiegoPino on slack:

I think i have an idea to get around the issue
basically using a Contextual bar (like we do for the twig preview) with the form there (so outside of the form that VBO provides) that sets a temp value into a tempstore
and is picked then by the actual vbo form on submit later
issue there is that the VBO config form is already wrapped around another one i have no control over
so i can not make it “super dynamic” as i would like to

Linked Data reconciliation processing (updated February 2022)

What is needed?

Linked Data reconciliation processing enhancements 🤓

  • 🌟 Specified during AMI ingest configuration mappings
    • i.e., use values from this column for LoD reconciliation
  • 🌟Option to specify which LoD source to use for reconciliation
    • Library of Congress, Wikidata, Getty, etc.
  • 🌟Option to specify multiple (different) source column-->LoD source pairings

*Moved to separate Issue #69 March 2022
Ability to save preferred LoD reconciliation selections (mapped terms) to a Main / General Use LoD template

  • i.e., for every Subject of "Dogs" encountered in this repository (or collection?), use this preferred Wikidata LoD selection
  • ♻️ help reduce the temporal footprint of metadata work lifecycles

Support group entities as ADO parents

The group module provides the ability associate content with groups, and ability to constrain create, update, view and delete access of/to group digital object content based on the user's membership in a group. In short, we want to use groups like archipelago collections.

The relationship between groups and their content is mediated by the group module through group membership entities - there is no "awareness" in the content itself of its membership in the group.

For manual ADO creation, I think this will "just work": the user (group member) creates an ADO as group content, just as they would for conventional nodes.

However, with AMI, we need to be able to create this relationship as ADOs are created, and modify that relationship as ADOs are updated.

Ingesting large media files from AWS S3 fails

We're testing ingesting objects via AMI that have multiple large media (video, audio) files attached to them which are stored on AWS S3. These frequently fail. When they do...

  • If processing via immediate batch, the batch stops and displays an ajax error message indicating a gateway timeout:
    Screenshot from 2021-10-13 13-56-21
  • If processing via batch queue and the batch is manually run, we see a "Sorry we did all right but failed created the ADO with..." message:
    Screenshot from 2021-10-13 13-58-58

We found that having the files in the local file system seems to work without problems. We're thinking that something is timing out, possibly while fetching the files from AWS S3. These are large-ish files. In one of our test examples, it's six video files totaling over 3GB.

We find no "smoking gun" in either the drupal logs or php logs.

I recognize that this may not be an AMI issue per se, but that's where we're encountering the problem.

Fix loop on re-enqueuing when handling separate processes for files

What?

This is on me, bad nesting of an IF (gosh) but also exposed some timeout/race conditions. Basically if you have a single Object with hundreds of files attached in multiple source fields, the system will try to push the Object to the end, but not after all files are processed but for each file ending in an eternal never ending loop.

  • This also exposed the fact that files that are "permanent" are then not processed further by SBF File Persister leaving files without being moved to S3 (or your configured choice for final storage). Why this never triggered before? Race conditions since the Files where ingested at the same time than the Objects never leaving space for being "actually" permanent on DB. This will lead to new code in SBF to make sure that any file that has as source temporary:// ends being move to final independently of its temporary or permanent state.

  • Finally improvements on File Download and Name handling were added and download timeouts made (for now) a 75% of the total allowed time to run instead of the fixed (YOU can not change it!) 30 seconds that Guzzle uses for CURL.

  • Since we are here, If someone deletes a value (leaves empty) a previously reconciliated Value, the system does not persist it because we do check for empties. That was on design (bad design) and now it will be corrected

More work will be done / different issue/ on this until its "PERFECT"

Add a FCREPO/Solr import plugin to help with Islandora 7 migrations

What is this?

We got a lot of requests for this so its about time. This issue is to explain the design and how I plan on doing this, happy to get feedback, feature request, questions, etc.

How?

The idea here is to add another AMI Source Plugin that can deal with data coming directly from Solr (but not limited to, just as a start).
The plugin will integrate with the current setup form the same way Google Sheets and CSV works right now and will provide the following options

  1. Server URL to your core
  2. A predefined Islandora Profile with an advanced override
    2.1. Collection PID/Top Object PID (e.g book) to import from (one at the time)
    2.2. Filter CMODELS
    2.3. Binary Datastream(s) per CMODEL to fetch files from
    2.3.1 (NEW). HOCR and other derived data streams could be marked as SBFlavors and go into Solr directly. We can start with HOCR first and then think how that fits others. I wonder if we could pass the responsibility (a 'does it apply' method) to each Strawberry Runners Plugin or have one special plugin that automatically takes ingested files of certain criteria and does the work without AMI level settings. Thanks @noahwsmith
    2.4. Automatic build remote URL for datastream fetching
    2.5. Offset and Number of Objects to fetch (this will be per Top object, because why would you want to limit the number of pages per Book?)
  3. Advanced Override will include:
    3.1. Select Membership relations (default profile provides the most common fields already)
    3.2. Do a shallow import (default profile is deep import)
    3.3. custom filter (as a coma separated list of fields/values)
    3.4. What fields to return (default profile is fgs_, datastream data based on selected DSIDs, mods_)
    3.5. Make Compounds/books a single ADO or multiple ADOs. Default is (guess what) a single ADO.
    3.6 Use PID (if UUID based) as UUID for new ADO.
    3.7 Use PID (if UUID based or numeric) as UUID seed for UUIDv5 for new ADO (hashes the PID so every time we ingest the same set we get the same ADO UUID (cool right?)

Any other feature you feel is missing here? Concerns?

Note: Mapping/etc will be the same as in any other AMI setup

Output CSV will already contain "documents, images, label, UUID and parent ship relations computed for you" columns processed by the plugin

Future work: make file URIs/URLs/Paths to be computable via a template. Gives full control for @giancarlobi to override the paths for each file and use, already existing local paths.

@dmer @patdunlavey @giancarlobi @alliomeria please add your suggestion/comments here. Thanks

Save AMI set zip file in permanent instead of temporary storage

What?

Zip files for AMI sets were previously saved in permanent storage but are currently saved in temporary, e.g. /system/temporary?file=ami/files.zip instead of /sites/default/files/2021/11/files.zip.

Per @DiegoPino probably the following needs to be added somewhere:

$file->setPermanent();
$file->save();
// Add to file usage calculation.
\Drupal::service('file.usage')->add($file, 'my_module_name', 'file', $file->id());

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.