Coder Social home page Coder Social logo

strawberry_runners's People

Contributors

alliomeria avatar diegopino avatar giancarlobi avatar patdunlavey avatar pcambra avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

strawberry_runners's Issues

Discard multiple same processor enqueued items after one was persisted

What?

if you save the same ADO multiple times in a row we have no way of knowing/setting that a given processor was enqueued so we keep enqueuing. On process, when e.g 3 run, we generate and store 3 times the derivatives.

The change is not big since we have already a marker inside the Source file but that one gets only updated once one of the queue items is persisted. On pre-persistance we need to check again if a previous one already filled up. If so we discard the processed file, unlink and do nothing

This does not prevent multiple enqueues but ensures that only one gets the proper place in the source ADO.

Remove debug statement

This things really do not need an ISSUE but since we are more people now (and good people) I will roll with one here.

a dpm() that was left over In my many tests

Ensure that ADO cache tags are cleared when indexing OCR content

Search is disabled when an ADO has no OCR content on it so users are not misled. However, when/if the OCR is included afterwards, the search box doesn't appear until the cache of the entity is cleared.

This is what needs to happen on AbstractPostProcessorQueueWorker::processItem

    if ($entity) {
      Cache::invalidateTags($entity->getCacheTagsToInvalidate());
    }

Comment from @DiegoPino:
Maybe we can do it even later, during the actual Solr Index, because at that level maybe the solr document is not findable yet (because of the tracking/index immediately or not settings)

Relates to esmero/format_strawberryfield#118

Mainloop and enqueue logic

MainLoop logic
(see

$loop->addPeriodicTimer($queuecheckPeriod, function () use ($loop, &$cycleBefore_timeout, $queue, $idleCycle_timeout, $max_childProcess, $childQueue_init, $childQueue_started, $childQueue_done, $childQueue_error, $childQueue_output) {
):

  • Loop is executed every $queuecheckPeriod (i.e. 3 s)
  • At each execution loop updates 'strawberryfield_mainLoop_keepalive' = \Drupal::time()->getCurrentTime()
  • If queue empty for $idleCycle_timeout (i.e. 5) consecutive cycles then loop stop
  • If queue not empty then reset counter $cycleBefore_timeout to $idleCycle_timeout

Enqueue logic
(see

pushItemOnQueue($node_id, $jsondata, $flavour);
):

  1. push item to process on the queue then check mainloop is running or to start

  2. does mainloop need wakeup?

  • $submitTime = \Drupal::time()->getCurrentTime(); (time when item was enqueued)

  • set $NxqueuecheckPeriod = 2 * $queuecheckPeriod (safe wait time)

  • wait (A) mainloop executed one time after submit ($submitTime - $lastRunTime < 0)
    OR
    wait (B) for $NxqueuecheckPeriod after submit time

  • If (B) then Mainloop has to be started else if (A) Mainloop already running nothing to do

Fatal error when using NLP

When I have NLP processing enabled, I'm seeing this error:

bash-5.1# drush queue:run strawberryrunners_process_background --items-limit=1
 [error]  TypeError: http_build_query(): Argument #1 ($data) must be of type array, null given in http_build_query() (line 67 of /var/www/html/web/modules/contrib/strawberry_runners/src/Web64/Nlp/NlpClient.php) #0 /var/www/html/web/modules/contrib/strawberry_runners/src/Web64/Nlp/NlpClient.php(67): http_build_query(NULL)
#1 /var/www/html/web/modules/contrib/strawberry_runners/src/Web64/Nlp/NlpClient.php(52): Drupal\strawberry_runners\Web64\Nlp\NlpClient->post_call('/status', NULL, 2)
#2 /var/www/html/web/modules/contrib/strawberry_runners/src/Plugin/StrawberryRunnersPostProcessor/OcrPostProcessor.php(362): Drupal\strawberry_runners\Web64\Nlp\NlpClient->get_call('/status', NULL)
#3 /var/www/html/web/modules/contrib/strawberry_runners/src/Plugin/QueueWorker/AbstractPostProcessorQueueWorker.php(612): Drupal\strawberry_runners\Plugin\StrawberryRunnersPostProcessor\OcrPostProcessor->run(Object(stdClass), 1)

The problem seems to begin here, where the $data argument that is eventually fed into http_build_query here is set to NULL.

This could be a PHP version thing, where previously http_build_query would not have complained?

Is this an appropriate fix?:

          'content' => is_array($params) && !empty($params) ? http_build_query($params) : "",

OCR timeout can cause infinite loop?

We have found that if an ocr process fails, it will throw new RequeueException('I am not done yet. Will re-enqueu myself');, and continue to retry and fail, creating an infinite loop, as seen in this sample watchdog output:
strawberry_runner_ocr_looping.csv

In our case, I believe the issue is that the command is timing out and may be solved by increasing the ocr processor's timeout setting.

Might the general solution, in part at least, be to switch to using DelayedRequeueException?

Allow System Binary Processor to use URLs instead of only making local files availability required

What?

Some system binaries (exiftool, ffmpeg) can stream from a remote source without having to generate a local temporary file. The knowledge of which one can/should do this depends on the user to explore but to allow this to happen (e.g fromS3) we need to generate as argument (instead of %file we can use %url) a signed URL.

This can be a checkbox and allows that processing since e.g Droid will never be able to read from a remote stream

I think this can be done via SF3S or using the AWS S3 client directly, but I need to research it deeper

@giancarlobi @aksm @alliomeria

Post processor Plugin for Archivematica/AIP

what?

Generate a AIPs from ADOs. Do the Atom/Mets thing, bundle, add assets, push to the SWORD api (or any other newer option this people have).

There is legacy code we can look at here:

https://github.com/Islandora-Labs/archidora/blob/7.x/includes/archivematica.inc

We can also build a base API deposit Base Class for these type of plugins so in the future any other final/transformed destination can be built.

@giancarlobi if you have any other deposit/preservation needs let me know. This module is basically your child!

Blank page with empty hOCR hangs last version of ocrhighlight plugin

@DiegoPino Probably When a blank page of a PDF with an empty hOCR is feed to Solr doc using ocrhighlight plugin (last release >0.5.0) Solr return an error. As you told me, we cannot present an empty field for miniocr because the XML validation fails.
I can make some checks next days adding here

foreach ($page->xpath('.//ns:span[@class="ocr_line"]') as $line) {
a flag to catch empty hOCR and add almost one word, probably also a single space might be sufficient.

VBO action to re-trigger SBRs

What?

Sometimes you need to clear queues. Or have a set of objects to reprocess their failed OCR (out of space, random service errors, who know).

We need a VBO action that does exactly this:

public function onEntitySave(StrawberryfieldCrudEvent $event) {
/* @var $plugin_config_entities \Drupal\strawberry_runners\Entity\strawberryRunnerPostprocessorEntity[] */
$plugin_config_entities = $this->entityTypeManager->getListBuilder('strawberry_runners_postprocessor')
->load();
$active_plugins = [];
foreach ($plugin_config_entities as $plugin_config_entity) {
// Only get first level (no Parents) and Active ones.
if ($plugin_config_entity->isActive() && $plugin_config_entity->getParent() == '') {
$entity_id = $plugin_config_entity->id();
$configuration_options = $plugin_config_entity->getPluginconfig();
$configuration_options['configEntity'] = $entity_id;
/* @var \Drupal\strawberry_runners\Plugin\StrawberryRunnersPostProcessorPluginInterface $plugin_instance */
$plugin_instance = $this->strawberryRunnerProcessorPluginManager->createInstance(
$plugin_config_entity->getPluginid(),
$configuration_options
);
$plugin_definition = $plugin_instance->getPluginDefinition();
// We don't use the key here to preserve the original weight given order
// Classify by input type
$active_plugins[$plugin_definition['input_type']][$entity_id] = $plugin_instance->getConfiguration();
}
}
// We will fetch all files and then see if each file can be processed by one
// or more plugin.
// Slower option would be to traverse every file per processor.
$entity = $event->getEntity();
$sbf_fields = $event->getFields();
// First pass: for files, all the as:structures we want for, keyed by content type
/* check your config
"source_type" => "asstructure"
"ado_type" => "Document"
"jsonkey" => array:6 [โ–ผ
"as:document" => "as:document"
"as:image" => 0
"as:audio" => 0
"as:video" => 0
"as:text" => 0
"as:application" => 0
]
"mime_type" => "application/pdf"
"path" => "/usr/bin/pdftotext"
"arguments" => "%file"
"output_type" => "json"
"output_destination" => array:3 [โ–ผ
"plugin" => "plugin"
"subkey" => 0
"ownkey" => 0
]
"timeout" => "10"
"weight" => "0"
"configEntity" => "test"
]*/
if (isset($active_plugins['entity:file'])) {
foreach ($active_plugins['entity:file'] as $activePluginId => $config) {
if ($config['source_type'] == 'asstructure') {
$askeys = array_filter($config['jsonkey']);
foreach ($askeys as $key => $value) {
$askeymap[$key][$activePluginId] = $config;
}
}
}
}
foreach ($sbf_fields as $field_name) {
/* @var $field \Drupal\Core\Field\FieldItemInterface */
$field = $entity->get($field_name);
if (!$field->isEmpty()) {
$entity = $field->getEntity();
$entity_type_id = $entity->getEntityTypeId();
/** @var $field \Drupal\Core\Field\FieldItemList */
foreach ($field->getIterator() as $delta => $itemfield) {
// Note: we are not touching the metadata here.
/** @var $itemfield \Drupal\strawberryfield\Plugin\Field\FieldType\StrawberryFieldItem */
$flatvalues = (array) $itemfield->provideFlatten();
// Run first on entity:files
$sbf_type = [];
if (isset($flatvalues['type'])) {
$sbf_type = (array) $flatvalues['type'];
}
foreach ($askeymap as $jsonkey => $activePlugins) {
if (isset($flatvalues[$jsonkey])) {
foreach ($flatvalues[$jsonkey] as $uniqueid => $asstructure) {
if (isset($asstructure['dr:fid']) && is_numeric($asstructure['dr:fid'])) {
foreach ($activePlugins as $activePluginId => $config) {
// Never ever run a processor over its own creation
if ($asstructure["dr:for"] == 'flv:' . $activePluginId) {
continue;
}
$valid_mimes = [];
//@TODO also split $config['ado_type'] so we can check
$valid_ado_type = [];
$valid_ado_type = explode(',', $config['ado_type']);
$valid_ado_type = array_map('trim', $valid_ado_type);
if (empty($config['ado_type']) || count(array_intersect($valid_ado_type, $sbf_type)) > 0) {
$valid_mimes = explode(',', $config['mime_type']);
$valid_mimes = array_filter(array_map('trim', $valid_mimes));
if (empty($asstructure['flv:' . $activePluginId]) &&
(empty($valid_mimes) || (isset($asstructure["dr:mimetype"]) && in_array($asstructure["dr:mimetype"], $valid_mimes)))
) {
$data = new \stdClass();
$data->fid = $asstructure['dr:fid'];
$data->nid = $entity->id();
$data->asstructure_uniqueid = $uniqueid;
$data->asstructure_key = $jsonkey;
$data->nuuid = $entity->uuid();
$data->field_name = $field_name;
$data->field_delta = $delta;
// Get the configured Language from descriptive metadata
if (isset($config['language_key']) && !empty($config['language_key']) && isset($flatvalues[$config['language_key']])) {
$data->lang = is_array($flatvalues[$config['language_key']]) ? array_values($flatvalues[$config['language_key']]) : [$flatvalues[$config['language_key']]];
}
else {
$data->lang = $config['language_default'] ?? NULL;
}
// Check if there is a key that forces processing.
$force = isset($flatvalues["ap:tasks"]["ap:forcepost"]) ? (bool) $flatvalues["ap:tasks"]["ap:forcepost"] : FALSE;
// We are passing also the full file metadata.
// This gives us an advantage so we can reuse
// Sequence IDs, PDF pages, etc and act on them
// @TODO. We may want to have also Kill switches in the
// main metadata to act on this
// E.g flv:processor[$activePluginId] = FALSE?
// Also. Do we want to act on metadata and mark
// Files as already send for processing by a certain
// $activePluginId? That would allow us to skip reprocessing
// Easier?
$data->metadata = $asstructure;
// @TODO how to force?
// Can be a state key, valuekey, or a JSON passed property.
// Issue with JSON passed property is that we can no longer
// Here modify it (Entity is saved)
// So we should really better have a non Metadata method for this
// Or/ we can have a preSave Subscriber that reads the prop,
// sets the state and then removes if before saving
$data->force = $force;
$data->plugin_config_entity_id = $activePluginId;
// See https://github.com/esmero/strawberry_runners/issues/10
// Since the destination Queue can be a modal thing
// And really what defines is the type of worker we want
// But all at the end will eventually feed the ::run() method
// We want to make this a full blown service.
\Drupal::queue('strawberryrunners_process_index', TRUE)
->createItem($data);
}
}
}
}
}
}
}
}
}
}
$current_class = get_called_class();
$event->setProcessedBy($current_class, TRUE);
if ($this->account->hasPermission('display strawberry messages')) {
$this->messenger->addStatus($this->t('Post processor was invoked'));
}
}
}

Probably the way to go here is:

  • A new EventSubscriber and the VBO action emits an event (not insert/not save). We don't want to reuse events because this should not trigger (eventually it might happen but not always) a change in the ADO
  • A Force Option so even ADOs that already have post processing present will re-generate those

A small info about "which ones will run/not" as a result at the end of the VBO action

If a file referenced in a queue does not exist anymore the lease should be remove

What?

@giancarlobi I just noticed that if someone adds an object, and a PDF is added to queue and then goes and removes the File and changes it for a new one, the queue item gets stuck with an error and keeps adding itself

We may want to make a missing file an action that triggers a queue removal and lease release

e.g this is stuck in the queue right now

{#881 โ–ผ
+"fid": 76
+"nid": "21"
+"asstructure_uniqueid": "urn:uuid:b70b71bf-21f1-49d9-8988-d7b29746ed92"
+"asstructure_key": "as:document"
+"nuuid": "5d6b756b-a246-4105-9fe2-8b706a6384c4"
+"field_name": "field_descriptive_metadata"
+"field_delta": 0
+"metadata": array:14 [โ–ผ
"url" => "s3://525/application-nyhs-v01-n01-quarterly-bulletin-april-1917-1-b70b71bf-21f1-49d9-8988-d7b29746ed92.pdf"
"name" => "NYHS v01 n01, Quarterly Bulletin, April 1917_1.pdf"
"tags" => []
"type" => "Document"
"dr:fid" => 76
"dr:for" => "documents"
"dr:uuid" => "b70b71bf-21f1-49d9-8988-d7b29746ed92"
"checksum" => "525e1209fe0afcae44d8d0a26ea4962a"
"flv:exif" => array:18 [โ–ถ]
"sequence" => 2
"flv:pronom" => array:4 [โ–ถ]
"dr:mimetype" => "application/pdf"
"flv:pdfinfo" => array:28 [โ–ถ]
"crypHashFunc" => "md5"
]
+"force": false
+"plugin_config_entity_id": "pager"
}

This s3://525/application-nyhs-v01-n01-quarterly-bulletin-april-1917-1-b70b71bf-21f1-49d9-8988-d7b29746ed92.pdf file was removed and no longer exists.

Kitchen Door for strawberry_runners_postprocessor plugins

Another improvement here:

  • We should also have an API/smaller kitchen door way of providing Data for a configured @StrawberryRunnersPostProcessor. The idea is that e.g AMI can provide a ready OCR instead of having to go through the whole process. This requires some good planning but I was thinking maybe this:

  • The Kitchen door will simply deposit for a Given Source File (the UUID) and a given ADO (the ID) a file candidate in a certain location.

  • Every strawberry_runners_postprocessor instance will check before going full process mode if such exist.

    • If there, consume (but don't delete, mark to be composted by AMI really, since we might need it again) and/or transform.
    • If not do what it does, generate

Also: Every strawberry Runner output will also have a file (persisted) representation. This will allow you to build any simple "Text area" element to replace from S3/disk on OCR and trigger Solr reindex.

Add plaintext and Total Sequence Count to Search API indexable OCR processor

What is this?

Matching issue for esmero/strawberryfield#168 and https://github.com/esmero/strawberryfield/issue/165

This will make HOCR processor pass 2 new elements back to the Abstract Processor allowing pure, plain text and an expected total count of documents to be indexed in Solr. The first one is needed for nice Search Excerpts, the second to allow a "harvesting when ready" and saving back into a Frictionless data package at ADO level for long time persistence of generated HOCR. (expensive stuff to generate every time).

Move ADO Tools into strawberryfield

What?

As part of esmero/strawberryfield#237, ADO Tools need to be removed from this module. The following pieces need to be migrated over:

  • The corresponding section from strawberry_runners.links.task.yml.
  • The corresponding section from strawberry_runners.routing.yml.
  • The UpdateCodeMirrorCommand.php class.
  • The StrawberryRunnersToolsForm.php class.

Manage flavours one by one per ADO

@DiegoPino I'm thinking that it will be really more simple if we manage flavours one by one per node, that is when we push an item on queue, we load node + flavour, so if we have to manage more than one flavour (i.e. exif and hocr) for the same node we'll load node_id+exif then node_id+hocr.
In addition this help us to set flavours order and to enable single flavour runned ok per node and allow a flavour runs ok also if another flavour runs with errors.
What do you think about this?

Processor Parent (config) might get corrupted during drag and drop

What?

Just saw a processor become its own parent after being disabled/dragged/reenabled/dragged. Clearly a bug. Can't be noticed by just looking at that UI list -- with hierarchies-, the issue will become apparent only in the Config YML file (parent: itsownid) and/or when trying to "trigger Post processing" via an action -- bc the Post processor , even if active, will be missing from the list. BUG!

Post processor Plugin for Zip Files

What is needed?

Post processor that creates a zip file for objects comprised of many images.

  • to provide quick retrieval of source image files for an ADO (such as all the pages of an image-based Book object)
  • zip file attached to the original ADO with the option to download
  • along the lines of the existing Warc to Wacz post-processor

@DiegoPino & @aksm does this cover the issue? any other details to consider?

We need a str_replace_first function

As frictionlessdata/datapackage no more needs as sub-dependency illuminate/support, we need a function to replace current str_replace_first used in this module.

Make a Frictionless Data package Post processor

What?

Data, singular and plural is great but even better if it can be discovered. How you make that happen? happens that if you push it into Solr It can be discovered!

What type of data? Frictionless Data Packages with a manifest and resources. From CSV to JSON indexes, this will serve tabular, research, Government, publications and Web archiving but also our own OCR which will happily spent its life in a single package longer than in just files around.

Allow other queues to be fed

What is this?

This came from a friendly talk I had today. For larger deployments the internal Drupal queues may not be scaleable enough, or said differently, we may have macro queues feeding the Drupal Worker driven ones. The need is simple to accomplish. I will add a queue service that can be extended via plugins to feed others/too/instead and then we can figure out if an AWS/rabbit to even a RocksDB one can be added as recipient for pending processes as external services.

Pure Text extraction from HOCR is HTML entity encoded

What?

When we produce (from the HOCR/PDFALTO) extraction the pure OCR text we keep the HTML entity encoding. This hurts Views display since internally, twig can not decode the entities and will double encode.

I (just theory) think this can be fixed here

$page_text = isset($output->searchapi['fulltext']) ? strip_tags(str_replace("<l>",
PHP_EOL . "<l> ", $output->searchapi['fulltext'])) : '';

Basically, we don't want this:

image

Question (if fixing this) is how we remediate/tap into fixing this for existing OCRs. One way would be, on reindex detect if already cached Plain Text has HTML entities, decode and "update" the cache, somewhere here:

https://github.com/esmero/strawberryfield/blob/ce448a0ebe16650df19708459a4600d2c4d2c9e1/src/Plugin/search_api/datasource/StrawberryfieldFlavorDatasource.php#L661 but also could be a hook_update() ?

@aksm what do you think? @alliomeria what do you think? @karomabiles what do you think?

How to annotate flavours in SBF JSON

@DiegoPino after our talk, I figure out something like this to annotate flavours into SBF JSON:

  • we need to have fine granularity for each key "as:image" or any other future key to apply flavours to
  • we have to use flavour name into key to make it discoverable
  • we use this code logic:
 * (0) ready, runner executed and STB-JSON updated
 * (1) new, runner to execute
 * (2) update, runner to execute as STB-JSON was updated
 * (3) remove, runner must remove entry related to this flavour

So to mark an image to require "exif" flavour, we write this:

    "as:image": {
        "urn:uuid:35915592-83c8-40b6-b097-29c51c134cc7": {
            "url": "private:\/\/53d\/image-giovane-uomo-del-ballerino-che-indossa-un-salto-russo-piega-del-costume-28730977_3.jpg",
            "name": "giovane-uomo-del-ballerino-che-indossa-un-salto-russo-piega-del-costume-28730977_3.jpg",
            "tags": [],
            "type": "Image",
            "dr:fid": 62,
            "dr:for": "images",
            "dr:uuid": "35915592-83c8-40b6-b097-29c51c134cc7",
            "flv:exif": {
               "status": 1
            },

And after runner update SBF JSON looks like this:

    "as:image": {
        "urn:uuid:35915592-83c8-40b6-b097-29c51c134cc7": {
            "url": "private:\/\/53d\/image-giovane-uomo-del-ballerino-che-indossa-un-salto-russo-piega-del-costume-28730977_3.jpg",
            "name": "giovane-uomo-del-ballerino-che-indossa-un-salto-russo-piega-del-costume-28730977_3.jpg",
            "tags": [],
            "type": "Image",
            "dr:fid": 62,
            "dr:for": "images",
            "dr:uuid": "35915592-83c8-40b6-b097-29c51c134cc7",
            "flv:exif": {
              "status": 0,
              "data": {
                "XP Title": "http:\/\/www.dreamstime.com\/royalty-free-stock-photography-young-dancer-man-wearing-folk-russian-costume-jumping-image28730977",
                "Copyright": "(c) Stepanov | Dreamstime.com (Photographer) - [None] (Editor)",
                "Color Space": "Internal error (unknown value 65535)",
                "Exif Version": "Exif Version 2.1",
                "X-Resolution": "72",
                "Y-Resolution": "72",
                "FlashPixVersion": "FlashPix Version 1.0",
                "Resolution Unit": "Inch",
                "YCbCr Positioning": "Centered"
              }
            },

Push Solr Indexed Flavors into a Frictionless data package

Why?

Or Strawberryfield data source is totally virtual. During a processing chain we use local storage key values to allow Search API to fetch the recently ingested data. But for a longer/complete reindex we want to have that data in a more stable place, specially for longer running/expensive operations like HORC.

The logic we want is that after a Processor's output has been tracked we push the data into a (new or existing) frictionless data package managed by us file. Idea is if the file exists and the content of a certain Flavor ID is inside we update, if not we create and add.

The the Flavor Data source can always try to fetch from the less expensive Key/Value store if found, or if not, see if the Node itself has one of the packages corresponding to the same FLV id.

Flavors indexed into Solr have the id pattern (Flavor ID)
"ss_search_api_id":"strawberryfield_flavor_datasource/2017:1:en:1d9ae1cd-b3d0-477c-8061-313bb1bc9273:ocr",
Which means:
strawberryfield_flavor_datasource => the data source
2017 => the Node ID
1 = The sequence (remember this is one Node to many files to many sequences)
1d9ae1cd-b3d0-477c-8061-313bb1bc9273 => The File UUID that was processed
ocr => the Plugin type that generated this

Depending on how well I can deal with this issue esmero/strawberryfield#115 we may want to have many Frictionless Data Packages or a single one

The operation would be (pseudo buggy code)

  • Post processor (flavor) is tracked into Index // Already do this
  • Post processor checks if Node (source) has already a datapackage for that FLV (e.g ocr)
  • If yes, checks the manifest.json of the ZIP, if the same Flavor ID is already there replace it
  • If not, creates the Datapackage and initializes it, adds the first Postprocessor output and attaches it to the Node.
  • This happens for every sequence/etc/.

On reindexing/indexing/update from Search API:

  • We get a Flavor ID. // already do this
  • We check if the pattern makes sense and validate the data // already do this
  • We check if the ID is in the key/value store // already do this
  • If yes -> great adds the data again // // already do this
  • If no -> checks if the Node has the datapackage and it contains the Flavor ID, if so, fetches the data, rebuilds the needed data structure for Search API (because its more than just the HOCR) and passes that back to search API
  • If none, means it does not exist anymore, processing was deleted of the original files are gone and Solr document is removed.

@giancarlobi ideas/thoughts?

PDFALTO non fatal errors breaking OCR

What?

This is a multi-issue issue. We found a PDF that when processed through PDFALTO did generate correct OCR but also was throwing thousands on PDF standard syntax errors. Because the output of PDFALTO goes to the console directly (terminal) the resulting XML could not be processed. But here is where the larger issue happened, when Hydroponics was set 0 (means run until finishing) the failure was triggering an eternal re-enqueing (I'm pretty sure I coded 3x max retries) and getting stuck for days trying over and over.

  • The quick fix was to add -q as a PDFALTO argument to the OCR processor via the form but this should be a standard argument to be honest
  • I need to revisit what happens when we throw and exception on a Processor and how the main Queue Item Worker is dealing with these
  • There needs to be a log when this happens and we need a circuit breaker. We can't end with for-ever trying (ever)

image

For reference, the command run manually threw this type of syntax errors (PDF Standard non-compliant issue)

Syntax Error (1675049): Incorrect number of arguments in 'sc' command
Syntax Error (1675178): Incorrect number of arguments in 'sc' command
<?xml version="1.0" encoding="UTF-8"?>
<alto xmlns="http://www.loc.gov/standards/alto/ns-v3#" xsi:schemaLocation="http://www.loc.gov/standards/alto/v3/alto.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><Description><MeasurementUnit>pixel</MeasurementUnit>

HOCR per image page(s)

What is needed?

Post processor option to generate HOCR per image page(s), to complement the existing Post processor OCR/HOCR against PDF files.

Per @DiegoPino's notes and discussions:

  • Needs to incorporate correct page sequencing
  • Add for each Page (no collapsed data) an extra location of HOCR URL
  • Allow the OCR postprocessor to feed from a URL

This will be great to have @DiegoPino! ๐Ÿค“

Improve messages generated by SBR enqueued items

What?

When SBR Runners and/or Hydroponics (So strawberryfield module) are running the messages we generate are the utmost generic and give the Admin user no info of what is happening.

e.g hydroponics will say:
--- processing one item for strawberryrunners_process_background

And internally (really for performance reasons) there is little info coming out from each processor.

Let's make this better.
@alliomeria @karomabiles @aksm let's talk about this when possible.

Make a first EXIF only Post processor Plugin

What is this?

This is all related to #4 and a call we had with @giancarlobi today (MARCH 17th 2020).

The tasks:

Resuming: This task by itself is just making a particular, slim, and limited version of the generic Binary Processor. Nothing more

How does this fit in the global Chain?

I will explain how this needs to be done at the end, when we have all pieces ๐Ÿ“

  • Any CRUD action on a ADO will also trigger an JSON Event (as we do with the pre-save ones)
  • This module contains one specific event Type extending JSON Event, that when triggered that has/should have the logic to:
    • Fetch all Post processor Plugins (EXIF, HOCR, etc) that match the input (some metadata or files). EXIF will be one of those
    • Generate all local/temp files needed for the Plugin instances
    • Push into a Queue each atomic task. Queue will contain the value (JSON?) or just the ADO UUID, the Plugin ID that needs to trigger and (maybe?) the needed file UUIDs
    • The event will also, trigger, once all is pushed, the mainLoop that will open the background processing of this or, if already there, end its task

Make Processor Plugins hierarchical

What is this?

See #5 and #4 and #6

The idea is that Plugins, which are driven by config entities, as defined here https://github.com/esmero/strawberry_runners/pull/5/files#diff-6cb3b61e72b132f4e76eaf33127a920e are not only sorted by weight but also can/should be hierarchical. Why? Because we would like to allow, by logic, to have Post Processors Plugins to work on other Post processors Outputs. E.g One Post Processor extract files from a PDF, then another uses those files to process HOCR.

How to accomplish this?

  1. Our configuration entity needs more logic. First step would be to add two properties for that
  • Parent
  • Depth

Which would allow us to use something similar to this form in the Entity List Builder https://api.drupal.org/api/drupal/core%21modules%21system%21tests%21modules%21tabledrag_test%21src%21Form%21TableDragTestForm.php/class/TableDragTestForm/8.7.x
To allow people to move/drag Plugin instances into hierarchy.

Parent can be NULL (Top Post Processor) or another Post Processor Config Entity UUID/ID
Depth Can be used to find quickly siblings, etc.

  1. Our Event Subscriber that will get all the JSON events or the EVENT itself then needs better logic to be able to build a tree of execution. Means if we push data in to a QUEUE, then ITEM B can not process until ITEM A. That is quite complex and we can discuss how to deal with this. Options are (thinking loud)
    A. ITEM A is actually the one that in its process adds a new QUEUE ITEM for ITEM B. Means each TOP Post Processor is the responsible to (Parent/sibling) for generating the output but also, triggering the next processing (please ask if this makes no sense)
    OR
    B. We have many QUEUES, one per Depth... and each QUEUE once processed triggers a new Event that then sets some flag that allows a the NEXT DEPTH should be processed. This can lead to unnecessary processing.

ADO Tools Permissions & Display

What is needed?

  • Specific Permissions option for accessing ADO Tools

  • Hiding/not-showing of ADO Tools for Content types not associated with a Strawberryfield (such as Article or Basic Page)

Fix Binary detection

What?

For the new TEXT processor I used a very naive Binary Detection way (mb_detect) which funny enough does not work the same in PHP 7+, 8.0 and 8.1

I'm changing this to a pregmatch using //u as detection. This will work!

Add a NO POST PROCESSING json key (exception) to skip on a one-by-one level a certain post processor(s)

What?

e.g you have a Book, all is handwritten, the SBR rules say, run the pager and the OCR for all "pages" that have a "tiff", these are tiffs.

We allow this

"ap:tasks": {
   "ap:nopost": [
      "pager"
   ]
}

And that will skip the pager for that ADO.
Simple/cool

Remember kids we also have

"ap:tasks": {
   "ap:forcepost": true
}

That will force a re-processing even if it was processed before (only if the rules match of course). But ap:nopost wins. No reprocessing can be forced if we are skipping

Thanks

RC2 Bug in OCR Processor. Lines may have more than just 5 elements. Breaks Tesseract

What is this?

An OCR bug introduced recently and I'm not getting HOCR working on RC2 for PDFs

Is not accurate for Tesseract generated HOCR (version 4.1.1 +)

A line can look like this

 <span class='ocr_line' id='line_1_16' title="bbox 330 1210 706 1234; baseline 0.008 -2; x_size 27.361343; x_descenders 5.3613443; x_ascenders 6">
<span class='ocrx_word' id='word_1_32' title='bbox 330 1212 344 1232; x_wconf 71'>3</span>

With title having 7 or more elements for a single Word.

The fix is simple I just wish I had caught it sooner

@giancarlobi I will fix this asap and also push an update on RC2 docker. This breaks PDF without DJU HOCR

Ingested in-place files on S3 are queued for composting

We noticed that we're seeing messages in our logs like this:

"Attempt to compost an unsafe File with path s3://media/cna_001693/cna_001693_metadata.xml was made. Please manually delete it or lower/configure your security settings and code overrides defining what a Safe Path is."

Fortunately, CompostQueueWorker::processItem() is pretty paranoid about what is safe to delete (thank you Diego!!).

I'm trying to remember how file persistence works in this case where the file has been uploaded to its final destination in S3, and we just provide the path to it in our AMI spreadsheet. Is it just when the source path for the file in the AMI csv is inside the "Relative Path for Persisting Files" that it does not copy the file over and simply saves the path to the file?

My hunch is that AbstractPostProcessorQueueWorker::processItem() where it sets $needs_localfile_cleanup needs to be aware of whether the file is temporary or not. One way to do that might be to compare the original file location with the saved file location. But I haven't taken the time to figure out where we would find that inside this method.

I'd be very happy to take a whack at a solution for this @DiegoPino if you like and can give me a bit of direction as to the best way to handle this.

Build EZID integration

See https://ezid.cdlib.org/doc/apidoc.html

I'm not sure yet if this should be an enqueued task but it seems like good idea to maybe make this a post processor and not directly dependent on an Entity Insert/PreSave...

Could work like the WARC to WACZ processor. Under certain rules we request the DOI or ARK ID. If already present then we skip. etc.

Indexing OCR fails with multibyte characters

While testing for this issue, I found that OCR from some pages was failing to index in solr. As best I can tell/guess, this happens when some OCR with multi-byte characters is serialized and stored in the key_value table, and then fails to unserialize because the character counts in the serialized data don't match the byte counts in that data - or something like that.

It seems weird that something would serialize in a way that can't be unserialized!

Anyway, I'm attaching a couple files. One is a pdf that resulted in the OCR with multibyte characters. The second is the text that I see from the first page in the keystore.

CT_news_1871-01.pdf
western-ct-news-ocr-page-1.txt

tesseract OCR only takes pdf files as input

In OcrPostProcessor, where it builds the command to run tesseract, the command always emerges in the form:

{{ ghostscript command that takes the file and tries to generate a png from it }} && {{ tesseract command that uses the png as input }}

I.e. it only works with pdf files as input! Any raster file (tiff, jpeg, etc) results in no OCR being generated.

It should first check the input file to see if tesseract can run on it directly, and if not, then test if ghostscript can convert it into a file that tesseract can run on.

@DiegoPino I'll take this on, it seems pretty easy -- that is unless I have misdiagnosed this problem!

Sequence incorrect for OCR on multiple-image object

When using the method described here to create a CreativeWorkSeries that is to be displayed using the book viewer with in-book searchable OCR'd text, I'm finding that if I attached more than one image file to one of the page objects, that the book viewer (correctly) displays the extra image file as an extra page, the extra image file is OCR'd and saved and indexed in solr, and the content of that page is searchable in the book viewer. However in-book searching does not find it on the correct page, and the highlight displays on the incorrect page. The smoking gun is, when looking in the key_value table, the name of the ocr record always sets the sequence number as 1.

To answer your question @DiegoPino, the json for the page objects in question do have sequence_id values, and the file attachments do have correct sequence values (1, 2, 3 etc.).

I tried re-creating this page object with two image files, rather than first having one image file, saving, and then adding a second image file. No difference - the two images are saved in the key_value table with sequence number of 1.

I spent a while (without xdebug working on my local, alas) trying to isolate where the problem may be happening. The $sequence_number is correct in line 311 of the OcrPostProcessor. However, I noticed here of the AbstractPostProcessorQueueWorker, that it sets the $sequence_key to 1 if $data->siblings is not greater than one. And, I noticed in the strawberryrunners_process_background queue entries, that while the sequence value is correct, siblings is always set to 1. So that would force the sequence number to always be one.

So maybe this could be fixed by finding where the siblings value is being incorrectly set for multiple-file objects?

Pandadoc: Any document into any other document (transmute)

What is this?

We will eventually need to unify word/Doc/HTML/ any document that is uploaded to a single output.

For other things there is Mastercard, but for this there is https://pandoc.org/MANUAL.html!

So happens that Pandadoc is quite powerful and we could add it to the docker container and create also a specific processor. But: i need to build a test set/Matrix to see how the different outputs that we could eventually render in realtime in Archipelago are represented.

@mitchellkeaney (Does Traci has a Github handle? In case i want to include here in the discussion?)
@giancarlobi you ever had the need to uploading a DOCX/PPTX or any other format and have the content indexed in Solr. eg? Any other use cases you have?

Make sure stalled Tesseract processes are killed

What?

During background processing of very large images for OCR, Tesseract stalls/fails, but the processes are lingering and not being killed properly.

Is the timeout being properly calculated? Is something else failing in the kill process?

Explore Plugin based system for Flavor processors

What is this?

I waited too long for this so time to make a first branch to complement the good work of @giancarlobi
The idea here is that each post processor is derived from a Plugin. Plugins can/should be able to be connected to each other and have different inputs and outputs. They have settings and can self sustain complex work.

Depending on the number of assets/complexity of the task and where/how something is running they can be enqueued and run sequentially like @giancarlobi demo scripts or could, eventually run via a UI facing batch (need to figure out the exact conditional for this).

The branch associated with this issue will be mostly for now an experiment and should serve as starting ground for generalizing the needed functionality. Goal is to have HOCR as full implementation in the next 14 days. Beta3 will have this working and enabled.

Similar to "add a file" we need a "just add JSON" logic

What?

WE have right now processors that feed:

  • files that get attached to plugin
  • Sequences and connect to another plugin
  • plugin and just feed directly files/anything to another plugin
  • Index (search API)
    All this/any of this can be at the same time depending on the processor.
    Even if we have some other destinations setup we have no logic yet for those
    We want to have one that acts the same as the file logic but just adds a JSON (e.g entity extraction via Natural language processing).

We really just need to make this method

public function updateNode(ContentEntityInterface $entity, stdClass $data, stdClass $io) {

More generic.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.