esmero / strawberryfield Goto Github PK
View Code? Open in Web Editor NEWA Field of strawberries
License: GNU Lesser General Public License v3.0
A Field of strawberries
License: GNU Lesser General Public License v3.0
we right now do a good initial attempt on keeping track of provenance (not sure if the right word) of data inside a strawberryfield by adding this
"as:generator": {
"type": "Update",
"actor": {
"url": "http:\/\/localhost:8001\/form\/descriptive-metadata",
"name": "descriptive_metadata",
"type": "Service"
},
"endTime": "2019-04-15T15:30:33-04:00",
"summary": "Generator",
"@context": "https:\/\/www.w3.org\/ns\/activitystreams"
},
But we are depending on the Drupal's node data (uid)
to keep track of who did (human?, machine?) the actual action of updating. We need to make sure we allow a new structure that fits https://www.w3.org/TR/activitystreams-vocabulary/#actor-typesto keep track of the user and bring significant data into that structure so the info becomes a bit more independent of the current Drupal instance (e.g don't use uid:1, makes little sense out of this D8 context)
What data about an user or API or machine interaction is good enough to make the JSON self sustainable?
Tagging here @kllhwang since she came with this very very needed use case. Thanks!
Late night idea i came up with! So, imagine you have a 1TB file you need to attach to an ADO. Imagine that 1TB file is a WARC, or a huge movie, who knows, all your childhood love letters in 600dpi TIFFS and PDFS zipped. Who cares. Well the owner cares. OK, the fact is those are large files. You for sure don't want to upload that via the UI. But you can use the multipart upload system of S3, drag and drop it directly into min.io or even FTP it. Let's say you uploaded it. Now how in the world will you attach it to your to be born ADO when filling up a webform? Drupal only allows you to upload files. And selecting files from Drupal (when you have a million...) is not such a good idea, like via a Autocomplete..(but could be done..), but there is this idea...
Idea is: when you drop such a large file into S3, S3 tells Drupal, hey friend, there is a new file, and Drupal generates, a one time, voucher
for you. Voucher
is of course digital, a tiny unique hash or number. We can setup different folders per user or role in S3 and depending on where things get dropped, via a webhook
, Drupal gets notified and Drupal notifies the user (via a Drupal message and an alert/view in their user profile). Now when creating the new ADO, the user instead of uploading a file, or pointing to a hard to remember URI (maybe the user does not even know the URI right?) just adds that voucher (can select from his available ones or paste one someone send him to a special field). And all the rest, attaching, classifying, JSON-ifying etc happens magic-ly. The Voucher expires and now the file becomes a fully Drupal driven entity.
I know i make this sound like incredible when its just so simple stuff. But guess what. Nobody has this! And its just super useful. Admins can upload large files at night and then pass vouchers to Metadata people. Vouchers can be resolved before being used so the user knows what the thing is.
Saturday and late. But i had this idea a few days ago and needed to bring it somewhere. Hope we get this done, sounds like a good use case!
@giancarlobi @marlo-longley maybe this makes sense to you
As we move forward and our SBF JSON becomes more rich, we should start thinking and coding new types of JSON Key Provider Plugin implementations. This is related to #33 but goes beyond.
The ones i want and need:
The implementation is quite simple:
ContentEntityBase provides already a method:
\Drupal\Core\Entity\ContentEntityBase::referencedEntities
Which goes field by field checking for EntityReference Properties
/**
* {@inheritdoc}
*/
public function referencedEntities() {
$referenced_entities = [];
// Gather a list of referenced entities.
foreach ($this->getFields() as $field_items) {
foreach ($field_items as $field_item) {
// Loop over all properties of a field item.
foreach ($field_item->getProperties(TRUE) as $property) {
if ($property instanceof EntityReference && $entity = $property->getValue()) {
$referenced_entities[] = $entity;
}
}
}
}
return $referenced_entities;
}
So what we need is a JSON key provider that exposes a set (one or more) JSON property values (node ids for example ) as \EntityReference class properties.
We have at least two ways of providing the JSON keys as arguments:
First, automatic, by using the new "ap:entitymapping": [] key we preprocess (or should because webform maintainer dismissed my pull request for that...gosh)
Or by allowing people simply to type the keys (hopefully in this case a full JSON Path?) that contain entity references. Example are ismemberof, scene, etc.
With that Solr will allow us to co-index those referenced entities values, like their labels, etc.
Here is how i envision that:
Only keys that should be exposed are the leafs of a branch.
So if we have :
What we want to expose is as:document.*.checksum for example, which is really just the value of what is inside .checksum in that hierarchy. That seems also straightforward to do, logic would be
3.- I want an aggregator KeyName provider, one that takes a few different keys from all over the JSON and unites them in a single property to JSON. The UI for that could be a little bit more cumbersome, and thinking loud, it could be even working on Properties we are already exposing via the other KeyName Providers? Or do you think we should keep this one at the same level? Same level means less dependencies.. that is good. After process, means a different level, means the keys can be selected, instead of typed by the user.
The need for this is: get all referenced external URLS around the JSON and put then inside a single Solr field named URIS.
Logic here is simple
This plugin takes a bunch of keys, accumulates the values from all of them and then exposes all under a single, different Key name.
Do we want to name, prefix, fields coming from a given KeyName provider differently so people can deduce who exposed them?
Ideas?
Sometimes, specially me. But this is not about me speaking, its about SBF and the event generators being quite verbose on all operations that are happening.
We want to have a permission that enabled/disables that so users that do not need to know anything about File Persistence, etc, are not overwhelmed but our excessive (but needed many times) verbosity.
This requires
1.- A new permission so we are going to suggest any extenders/implementers of custom Event Subscribers(if any) use this
if ($this->account->hasPermission('display strawberry messages'))
before deciding to make the user happy with any new messages.
All EventSubscribers we implement will have that check
When configuring a JSON Key provider, the Config Entity list shows correctly the active/inactive status. But when editing, it shows always as active. Fix is mininal.
Missing schema for strawberry_textarea widget.
This was the only missing schema I found directly related to the strawberryfield
repo but we can leave open this issue until all the schema errors are resolved.
Related to esmero/webform_strawberryfield#12 (but here since i want this to be part of src/StrawberryfieldFilePersisterService.php
One of the most common use cases is people uploading images that already have some type of naming convention related to a given order/sequence.
This is a quick first step that requires more code to be fully compliant with esmero/webform_strawberryfield#12 (comment), basically this won't respect a manual reorder, since it will always apply and re-apply this based on given file names. Still, it solves the most common and missing need. Give images a default order based on file name uploaded.
Some ideas on how to move forward:
1.- Only do this if no previous order: Which adds the following problem. What if someone adds during an edit an new File and there is already an order? Do we simply add the new image to the end? What if the order that is there was not manually given but automatic already? How do we do that differentiation?
2.- Have a key that allows overrides. Add an extra key that defines what order was applied. like 'sequence_type': 'natural' or 'manual' ?
If manual is present, don't touch. Still the question persist on what to do when adding a new one. Simply add the highest order +1 ? And let the user then, manually decide what to do? Could be a way
3.- There are sure other ordering issues i have not considered.
To be honest, i don't totally like the idea of another key inside the as:image
etc structures. So maybe we can also do the following!. In case of manual order, why don't we add a fully new structure which acts as a ToC for that list of files? That would allow any arbitrary order, but still would allow to have, always, a natural ordered sequence like the one we generate automatically.
@giancarlobi @mitchellkeaney @marlo-longley i would love your opinion, hopefully any of you can in the few days think about use cases, edge cases and UIs to deal with this. Thanks
This is also related to esmero/webform_strawberryfield#12
Being VBO Views Batch Operations this features is quite simple. This module needs to provide a few action plugins. A base one which allows strings to be replaced by other strings in the JSON.
As simple as that, with an exception: that the end result needs to be a valid JSON before and i would love to start using more powerful options since we are JSON fans.
JSON Patching and also JSON Diffs. Why? Because the order of things in JSON can not be ensured when dealing with properties, but also because a JSON Patch allows greater complexity. My main concern is the interface. Like i would love to have a Webform similar (if not the webform itself) to apply a change to a certain field and through that built the JSON Patch.
Also we have help for that. https://github.com/swaggest/json-diff
To make a better use of long/lat for Map integration purposes in Views, allow one of our SBF Key providers to deliver a geolocation type of property.
This code should help us
https://git.drupalcode.org/project/geolocation/commit/2ceb713
If you ever had time to read through my Roadmap feature list you will have notice probably that i have planned ACL/access-control-list integration on our Digital Objects in Archipelago. ACL is fundamental for our IR needs to. And it goes way over just nodes, but also files (yeah, let's not speak about media here, since we use file entities).
What Drupal provides with its users and permissions per roles is closer to something named RBAC/Role based access control but we need more, fain grained, and UI/UX simpler to apply rules but also inheritance (and to be able to control how inheritance works and when it applies) that does not imply a batch operation to copy permissions from parent to child as e.g Islandora 7 does. That is expensive and not in our spirit.
I read this good post from @rosiel today during my coffee and i feel that speaks a little bit about how complex permission system and UI/UX in a hierarchical structure/Drupal can be and what expectations from users differ. I wish i had that type of feedback here! So will humbly borrow that post to get started
Thank to our own 'ahead' planning we have some things in place already to help with this. We lack UI of course because everyone hates to write Forms, but we will get there. I will enumerate my idea
Every Archipelago Digital Object (means any Node bearing a Strawberryfield) will/can have an ACL.
In any case, the body, the ACLs are written and defined in JSON. Means the same logic we use for loading/reading/formatting/parsing and making accessible (and yes, we could even use them in Solr as we do with metadata now)
We will use something named Entity Access Grants (actually Node Access Grants). Cool thing is that they are way more powerful than hook able permissions or fixed/in code permissions. The later only trigger on/per Route (means you won't get the effect you want on, e.g a View, only on the canonical url of a Node, for context in our case at /do/uuid). We use relationships and autocompletes in many places (like collection membership or any other node to node relationship you can add/edit via some of our webform elements and we need those decisions, of access and visibility to apply there too. So this is how grants will work
Finally, make an UI/UX to build an ACL.
I also want some global configs here. Like which properties (ismemberof, partOf, relatedTo are considered ACL inheritance. This also allows us to have specific to ACL only properties. Like 'inheritsACL'. Not bad? But i also don't know if i would set that as default, feels like reusing Semantically ones could make more sense.
Caching here is quite important, since a deep nested evaluation will be expensive, and some taxonomy based systems (which are a differently labeled RBAC implementations) already make accessing a Node super slow and also, extremely deployment specific.
Also related to this post (crickets i know, but i got some shay private messages... gosh) https://groups.google.com/forum/#!topic/archipelago-commons/MQHUxU3_9wA
@giancarlobi ping! @rosiel ๐ hopefully my mention here does not add stress to your wok. My intent is the opposite, i feel there could be some ideas here you could find useful for your community project work but also because i highly respect and appreciate your comments and feedback. If you feel this is not your thing and you don't want any mentions i can edit the post and remove those. In any case thanks!
Side note: This work goes into Beta3. I already started with the Node access grants and a quite simplistic demo ACL (for now fixed during testing) until we agree where the ACLs will be saved and what cool operations we want to allow.
Additional resources: This is a good example of how S3 uses ACLs. Since we do S3 everywhere and Min.io uses them too, we can also reuse learning curves. Please look at the JSON examples
"Resource":["arn:aws:s3:::examplebucket/*"],
Just a silly error. I was using self::
to reference the $priority each Subscriber class derived from /src/EventSubscriber/StrawberryfieldEventPresaveSubscriber.php
was given. self
on static properties does not late bind, which means i was always using the original priority defined in the abtract class instead of the one of the derived. This really did not affect anything here, but once i started getting more picky about the order and deriving in other modules i found this problem. Sad!
Solution: use static::
See https://github.com/esmero/strawberryfield/search?q=json_encode+OR+json_decode&type=
WE do a lot of JSON decoding and encoding because well, Strawberryfield is JSON. So, what happens if you depend on this, you decide setting of 10
is correct (for memory/ performance and because at my age you feel you smart) and then you code a thing that IMPORTS Multi hierarchical EAD V3 into strawberryfield? You break the logic. Yes! And you cry.
So. There are a few ways of going going about this:
Remember kids that JSON field can hold the shy number of 2 Gbytes of RAW JSON.
This requires:
A) An Advanced settings Form
B) An Alert that should happen either on Pre Save or on a Webform
C) a Failsafe, means a lot of other initialized arguments
In the meantime, for this Edgie Use case i'm pulling an extra 40 deepness, which is 50, until we get this other larger work done (larger in the sense of more than a few minutes)
We want to allow users to hide the Node Title widget/Field during ingest/edit completely and use, as source for the required node->title a value coming from SBF.
To accomplish so, and following our own way of coding we want to generate an Event Pre Save Subscriber (extending one of our base classes) that in the absence of such Node property sets its value from the SBF metadata. Drupal, strangely enough allows a Title field to be hidden in a Form Mode, but can not handle a Node CRUD operation if the value is not there (White screen with a Constraint error).
This subscriber is quite basic, and will required, to be stronger additional logic coming from the webform_strawberryfield module to, in case of being hidden, but set (e.g edit) to unset the value, so this Subscriber can override it. But that is another ISSUE:
Side note: Drupal's Content Entity Labels/Titles are limited to 128 Characters, quite not suited for metadata, so we will apply a truncating function. Still, the SBF, longer and richer Title will be available and can be used for Solr Indexing and display if wanted.
Everybody knows and loves breadcrumbs. But hey. In semantic, linked data world they are like looking trough a key lock the whole reality. Simple use case everyone knows is Collection/member. But that is not all, E.g, an ADO can be part of two collections, or can be connected to other objects via isrelatedto
, partof
, sameas
, etc, etc. All those things are relevant when thinking about a breadcrumb that is useful. People should be able to check which ones they want in the breadcrumb
So what i want?
Breadcrumbs that read from a list of JSON keys (you define which keys, we could even tag our JMESPATHs that are referencing NODES as such), then fast traverse (i'm quite good at traversing graphs) and accumulate not a single hierarchy but a list of parents. And so on. Direction of the relationship is important. We can do this via direct entityQueries (slow but consistent) or we can use Solr to drive this, fast but requires setting each predicate that people want as Fields, so more config.
UI is tricky. How to draw this tree (its not a cylic graph) in a way that still makes sense to people? Maybe show only one path by default and expand, via JS to more (if there is more) on over any node in this tree? Maybe color coded or prefixed with an Unicode character? having this would be lovely.
Just a way of helping people and myself to ingest/patch NODES faster via the JSONAPI.
I will add a drush command that allows an arbitrary path/wildcard/filename to be passed as attachement, a JSON file to be passed as payload and credentials. Drush command will run all the needed JSONAPI calls, double encode, fetch responses and stuff to ingesting new objects via the API is simpler and less convoluted.
Well yes, it is a good piece of complex software. But it is not perfect. Hold my beer here:
The story is this, when building/perfecting your Solr Search Driven Views you may come into realization that you want to use "Wildcards" for a filter
Here is one example:
i want to exclude a certain Field. Means only return Nodes where the field is not present. It's not about the value, it's about if it's there or not. Solr does not keep a 1:1 field count for every Document.
Under that scenario let's say we want this:
fq[0]=-mimetype:[* TO *]
or the newer fq[0]=-mimetype:*
if single value/Solr7/8
So what does Drupal View Conditional Filters do?
fq[0]=-mimetype:["*" TO "*"]
or the newer fq[0]=-mimetype:"*"
if single value/Solr7/8
Look closer (while holding my beer) do you see the double quotes? Yes. It double quotes everything. But in Solr *
is a totally different thing that "*"
and so our Repo/Existing use case is gone/
This is a debug
So.. how do we go on fixing this?
*
, or a [ something TO *] or a string value followed by a * and after removing the * we do not need to escape anything (means the user was smart to escape) then we assume the * really means * (as it supposed to do, who searches for * as a string anyway? Who keeps a repo of stars and asterisks that need to be string matched...!!!) OR*
)@giancarlobi @alliomeria any other user lurking that has an opinion?
Also. great to have repositories with lots of data already running! (even if a few can be shared) because this could have not been noticed if not!
We will need this to allow other parts of Archipelago to read JSONPaths provided by our Vocabulary builder as input for QA/Find and Replace and Properties exposure via JSON Key Name Providers
It should hopefully allow to also setup only "leaf" elements of the whole vocabulary hierarchy if needed
Related to #55
Seems like the title setter has some edge conditions, when there the element in the webform that sets the title is hidden but the main one not, since on new Element we are forcing the setting of a new title we end with our default one, which is generic. Bad.
Fix is simple, just add some checks. I'm also moving from $entity->getTitle() to $entity->label(). Just in case. Will make the branch and test around a little bit.
@Favenzio @bryjbrown i started serious research on this last week. Since Solr version proof of concept of this is working and the new JSON flattener options are quite appealing ( and can be also tested using this GIST), next natural way, and the one that is you are expecting, is to mix traditional Views exposed fields and Strawberryfield internal properties coming from JSON.
So, here is the way (TUT) and it is as Drupal as it gets. So no fear!
https://www.lullabot.com/articles/building-views-query-plugins-for-drupal-8-part-2
It is very simple code, but requires some testing and debugging and also some SQL magic, means basically i need you/me to explore this with real data
The ideas are
I will probably take over this after DLF2018 but if you are up to experimenting a bit, please feel free to share your ideas and thoughts.
This is totally not urgent since Solr Views
implementation is working fine and in 70% of the cases it will be faster, but this approach has the benefit of the hierarchies. Means we could simply add an argument to the field that is in the shape of a property path like [@graph.*.name]
and then join with others inside the same JSON strawberryfield value.
After my incursion on deploying a test IR with a lot of data, files, different media and IR needs of course this week (went well, so nice, learned a lot) i decided its time to bring some extra logic into our JSON Key providers
FYI: if you don't know what a JSON Key provider is that is Ok, its a plugin system i wrote that allows to dynamically expose internal data, keys and values from our SBF JSON to Drupal in a native to Drupal way. Which allows Drupal to index into Solr or expose to any other code like Tokens, all our deep, complex and evolving and changing JSON richness. And we have a few cool strategies, from simply "take this json KEY and put the value visible under this property" to query the JSON using JMESPATH and join many values from different places. OK, enough background (also ping to @aliomeria here, new in the block, time to subscribe to this repo)
Things i want
1.- Parser/logic processor. Basically one that allows data to be extracted via logic. and returned as an arbitrary key. Why?
Let's say i have LoD People in my metadata. A lot of them. Some have different roles, some are students others are Faculty, others are from a different place/institution. I want to have different facets so people can search/filter by Professors, or students only. With an extra processor (Twig template again, but stricter and shorter, i can even limit the size of the template) i can make some decisions, and even if do things like "Oh no, no student mentioned in the workds, lets add an extra value that says "No student was involved nor harmed" to the facet. data that was never there, we just expose it to the discovery. The archipelago dream made truth. This code is actually simple
2.- A chameleon processor. Which allows me to take on REAL drupal field class (lets say its the GEO one) and shove programatically data from our JSON and, wait for it, also shove programatically the "complex data" type into the code. This allows us to make Drupal thing we have data coming from one of those fields and makes community contributed code work with our chamaleons. This is actually simpler than you think, since instead of making a JSON Key processor, i can create a Copyfield processor at the entity level. Issue i see sometimes in Drupal8/9 is that most of the code people write is totally not aware of computed fields. I had to fix a few quite popular modules because all is made only for the most common use case, bad bad coding
See also #6 for my 3. Entity casting/reference Fields. We use open semantic here, we want every memberof, ispartof, etc, if they have either an ID or an UUID to be casted as Drupal entities. That way we can create deeper hierarchies and index the full paths into Solr.
@giancarlobi hope you around and all is well. Any ideas on this?
All our ADOs are using yoursite.edu/do/uuid as canonical path to access instead of the non sense /node/1 thing. Right now that is being done via Aliases. We need and we want UUIDs , right now it seems easier to have path alias programmed to do that. We do it and it works. But! Path alias does not expand, existing aliased paths for subpaths.
So we need a final solution (never final i know) that deals with any Content Entity under the /do idea because things like /do/uuid/metadata/iiifmanifest does not resolved automatically. Since our metadata exposed endpoints are dynamic (good!) we want to access them also via the uuid path.
I opened this originally inside here esmero/format_strawberryfield#25
but now i'm clear the right way of doing this is to create a Resolver Class/event subscriber that is, based on certain conditions, upcast UUIDS and convert them into Loaded Entities. Something in the tone of https://git.drupalcode.org/project/jsonapi/blob/8.x-2.x/src/ParamConverter/EntityUuidConverter.php but simpler, just for us.
Working on that. Will take me more days but on the right track.
When going through the process of creating a content type with a strawberry field you get an error when saving the field settings "The website encountered an unexpected error. Please try again later." The Apache error log shows the following:
AH01071: Got error 'PHP message: Uncaught PHP Exception Symfony\Component\DependencyInjection\Exception\ServiceNotFoundException: "You have requested a non-existent service "serializer"." at /var/www/drupalvm/drupal/web/core/lib/Drupal/Component/DependencyInjection/Container.php line 151\n', referer: http://drupalvm.test/admin/structure/types/manage/digital_object/fields/node.digital_object.field_strawberry_field_test
This method here is in charge of getting one SBFlavor Datasource ID and actually loading it and returning it so the the Solr Index Tracker can push it.
e.g given this ID:
"strawberryfield_flavor_datasource/2006:1:en:ac7fc929-a45e-4abc-9ff4-a6c35ec16c2f:ocr'
First part is the datasource strawberryfield_flavor_datasource
second one is after the '/' is the info
-2006
the originating NODE ID
-1
The sequence, can be a Page Number, order Number, etc. Logic defines how that is been interpreted by a consumer.
-en
The languange
-ac7fc929-a45e-4abc-9ff4-a6c35ec16c2f
The uuid of the originating File entitty
-ocr
The processor label that created this index entry
For a given set of pushed Strawberry Flavor Datasource Items I want to index
Fix is simple but I would love to see also more work on this custom Datasource and make it more failsafe. We may need to push into the Solr index the Processor Plugin ID, the Processor Label and also a differentiated body (means HOCR will push the miniOCR into the special OCR field and to make sure all matches, that field should be named same as the Processor plugin id (in this case )
* @StrawberryRunnersPostProcessor(
* id = "ocr",
or the Binary, or the Warc one, etc.
Fix coming because all my goodness, I can not be so silly and code like this!
@giancarlobi @alliomeria please forgive me!
What is the current setup/configuration for OCFL?
What are the plans for OCFL?
Actually its quite robust, but while upgrading an old archipelago (0.9 ALPHA) to Beta2 i found some things i did not like (and i used to like or felt i could push into the future, but the future is here!) and i feel could be better:
Right now theas:
structure generator is being triggered by the webform handler. That kinda defeats the purpose of all the other Event Driven Subscribers we have:
So what do we need?
We already generate this structure in the JSON
"ap:entitymapping": {
"entity:file": [
"images",
"documents",
"audios",
"videos",
"models"
]
}
Means we know which keys will contain file ids.
Pseudo logic i imagine is:
Get all ids from every JSON KEY element in "entity:file"
Get all ids from every JSON KEY element before saving (which will be in $entity->original)
Call something quite similar to this
https://github.com/esmero/webform_strawberryfield/blob/8.x-1.0-beta2/src/Plugin/WebformHandler/strawberryFieldharvester.php#L432
Generate the as:structure
Rest is the same:
By getting also the ids before saving, we can remove usage (just -1 one) of files no longer used by this particular ADO at the same time. That code does not exist yet.
as:filetype
structure generation better and file persister way more failsafeSome thing i don't like:
The filer persister service https://github.com/esmero/strawberryfield/blob/8.x-1.0-beta2/src/StrawberryfieldFilePersisterService.php#L155 here generates a new destination URI, normalized by us, for files that have no as structure yet. This has some benefits but can also be weird.
as:image
etc from the JSON. Also good and fast for new Files and Objects.url
json key of each file inside an as:something structure.url
key back the real path, but we keep ours. MEANS: we end with a wrong url in the JSON, that does not match where the file actually is saved. Bad. IF we remove the check for temporary then the file would be actually moved to the new URL (expensive but consistent) but then we would have to still deal with the fact that it could be in use by another ADO and in that case moving would be a bad idea.All this has a lot of iterations, etc. Its not a terrible thing, i mean 2000 files? foreach 2000 times. You won't have 1000 users ingesting at the same time, so performance is still OK. But i would like to save some CPU cycles and group expensive operations together. I also would like to make some of the entity queries happen on sets smaller than that. All that optimization would/needs to happen in The filer persister service https://github.com/esmero/strawberryfield/blob/8.x-1.0-beta2/src/StrawberryfieldFilePersisterService.php
That is where we fetch all files, load them, classify them and route them into their as:structures, we read from what is already there to avoid md5 on files, etc.
Also, we could move logic that is for too many objects into a last pass on entity save (could be a batch process) and the threshold could be variable based on file size.
This code is complex (i have to admit i felt some WOW + pity for myself while reading it
@giancarlobi @marlo-longley pinging you since this is going to be good. Will require a lot of saving, messing up and writing some unit test too.
Related to #37
Currently we deposit only one version of the Object. And we deposit RAW JSON and a full entity serialization.
Let's deposit every revision (and add a setting where people can decide if they want to delete this files when removing ADOs or not). Its just a matter of using \Drupal\strawberryfield\EventSubscriber\StrawberryfieldEventInsertSubscriberDepositDO
in a class that extends \Drupal\strawberryfield\EventSubscriber\StrawberryfieldEventNewrevisionSubscriber
and add the Revision id. to the file name. As simple as that.
But most important ,use the JSON:API extras approach to call the JSON:API serializer directly. See
https://git.drupalcode.org/project/jsonapi_extras/blob/8.x-3.x/src/EntityToJsonApi.php
Basically its just a wrapper, but a handy one that returns a RAW string. We can use that module (and make SBF depend on JSON:API and this one too, because, well, we like it and see that D9/10 will go that route fully also)
This is a super easy task, in case someone wants it.
Right now we can configure ADO by Type to map onto View Modes in Drupal.
See: /admin/config/archipelago/viewmode_mapping
in an Archipelago instance.
Solr field for Type is fixed/hardcode currently.
See here.
Tasks:
ViewModeMappingSettingsForm
to draw from this config, rather than hardcode.I need to think about naming for this. It's kind of clunky. I started using "Type to Solr Field" for the name of the form which probably isn't clear.
I installed Strawberry Field Module in a brand new clean Drupal installation. It enabled without a problem but, when I tried to create a content type with a Strawberry field, an error was thrown:
"There was a problem creating field test_strawberry_field: Exception thrown while performing a schema update. SQLSTATE[42000]: Syntax error or access violation: 1064 You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'JSON NOT NULL, PRIMARY KEY (entity_id
, deleted
, delta
, langcode
), INDE' at line 8: CREATE TABLE {node__field_test_strawberry_field} ( bundle
VARCHAR(128) CHARACTER SET ascii COLLATE ascii_general_ci NOT NULL DEFAULT '' COMMENT 'The field instance bundle to which this row belongs, used when deleting a field instance', deleted
TINYINT NOT NULL DEFAULT 0 COMMENT 'A boolean indicating whether this data item has been deleted', entity_id
INT unsigned NOT NULL COMMENT 'The entity id this data is attached to', revision_id
INT unsigned NOT NULL COMMENT 'The entity revision id this data is attached to', langcode
VARCHAR(32) CHARACTER SET ascii COLLATE ascii_general_ci NOT NULL DEFAULT '' COMMENT 'The language code for this data item.', delta
INT unsigned NOT NULL COMMENT 'The sequence number for this data item, used for multi-value fields', field_test_strawberry_field_value
JSON NOT NULL, PRIMARY KEY (entity_id
, deleted
, delta
, langcode
), INDEX bundle
(bundle
), INDEX revision_id
(revision_id
) ) ENGINE = InnoDB DEFAULT CHARACTER SET utf8mb4 COMMENT 'Data storage for node field field_test_strawberry_field.'; Array ( ) "
We dispatch all type of cool events so other modules or this one can react, modify and do things based on what happened, but we have been waiting for stability and me coming back from vacations (and the wait is over!) to actually start removing referenced files (referenced inside a Strawberry field JSON) Drupal tracking. For those who don't know what that is: we track each file's usage inside a JSON via Drupal 8's file tracking capabilities. That way same file can be used in many place (try the "clone" option for and digital object and you will see!) but also nobody can delete them as long as they are in use. (safe, safe). Now we want to remove tracking when a Digital Object gets purged. Once a file is not tracked anymore, the next cron run will get rid of it.
Simple as always. We extend one of our abstract classes that knows how to react to that delete event, we pass the deleted entity in the event, we check if there are files tracked for it, we remove them, then we do additional cleanup in case someone else is falsely tracking it too (Drupal's file managing is quite basic and prone to failure so we have to be double strict and double safe) and if there are false positives, we also remove those.
Pull coming.
Right now, if a user role doesn't have permissions for the default form mode, but does have access to another form mode, the Edit tab is not displayed when viewing a node.
For example, in a recent project, a "Contributor" role has access to the "Contributor" form mode, but not default. They can only see the "Delete" tab, but not "Edit" above a node.
These tabs above the node are called local tasks in Drupal. See here:
https://www.drupal.org/docs/8/api/menu-api/providing-module-defined-local-tasks
Will create the following YML file: strawberryfield.links.task.yml
and add code targeting nodes.
Another option is to use strawberryfield_menu_local_tasks_alter
.
Automatic Vocabulary generation is (in my opinion) the coolest (++factor) feature we have and is becoming almost 2 years old already. But, as cool as it is, we have not given it too much re-use across the stack.
Today while playing EAD V3 import (XML to JSON) via that new Widget i wrote, i found myself producing this vocabulary:
Which, ok. Makes sense, but in strictness is not "our vocabulary" but a particular one of a particular ingest, and we could have quite a lot of different schemas. This also applies to EXIF.
So question is: do we add a form/setting so certain to KEYS become excluded from vocabulary
and (hear me out here) also from the JSON KEY flattener? That one that would generate too much memory use to be useful if this goes too deep? i could exclude all the flv:
prefixed vocabs, since EXIF tags are not THAT useful really in a vocab.
I know @giancarlobi understands how this works, wonder if @alliomeria knows this/has seen this vocab, builder in the Archipelagos that are accessible by its user and has an opinion?
Ideas? Opinion? Questions?
If you ever had time to read through my Roadmap feature list you will have notice probably that i have planned ACL/access-control-list integration on our Digital Objects in Archipelago. ACL is fundamental for our IR needs to. And it goes way over just nodes, but also files (yeah, let's not speak about media here, since we use file entities).
What Drupal provides with its users and permissions per roles is closer to something named RBAC/Role based access control but we need more, fain grained, and UI/UX simpler to apply rules but also inheritance (and to be able to control how inheritance works and when it applies) that does not imply a batch operation to copy permissions from parent to child as e.g Islandora 7 does. That is expensive and not in our spirit.
I read this good post from @rosiel today during my coffee and i feel that speaks a little bit about how complex permission system and UI/UX in a hierarchical structure/Drupal can be and what expectations from users differ. I wish i had that type of feedback here! So will humbly borrow that post to get started
Thank to our own 'ahead' planning we have some things in place already to help with this. We lack UI of course because everyone hates to write Forms, but we will get there. I will enumerate my idea
Every Archipelago Digital Object (means any Node bearing a Strawberryfield) will/can have an ACL.
In any case, the body, the ACLs are written and defined in JSON. Means the same logic we use for loading/reading/formatting/parsing and making accessible (and yes, we could even use them in Solr as we do with metadata now)
We will use something named Entity Access Grants (actually Node Access Grants). Cool thing is that they are way more powerful than hook able permissions or fixed/in code permissions. The later only trigger on/per Route (means you won't get the effect you want on, e.g a View, only on the canonical url of a Node, for context in our case at /do/uuid). We use relationships and autocompletes in many places (like collection membership or any other node to node relationship you can add/edit via some of our webform elements and we need those decisions, of access and visibility to apply there too. So this is how grants will work
Finally, make an UI/UX to build an ACL.
I also want some global configs here. Like which properties (ismemberof, partOf, relatedTo are considered ACL inheritance. This also allows us to have specific to ACL only properties. Like 'inheritsACL'. Not bad? But i also don't know if i would set that as default, feels like reusing Semantically ones could make more sense.
Caching here is quite important, since a deep nested evaluation will be expensive, and some taxonomy based systems (which are a differently labeled RBAC implementations) already make accessing a Node super slow and also, extremely deployment specific.
Also related to this post (crickets i know, but i got some shay private messages... gosh) https://groups.google.com/forum/#!topic/archipelago-commons/MQHUxU3_9wA
@giancarlobi ping! @rosiel ๐ hopefully my mention here does not add stress to your wok. My intent is the opposite, i feel there could be some ideas here you could find useful for your community project work but also because i highly respect and appreciate your comments and feedback. If you feel this is not your thing and you don't want any mentions i can edit the post and remove those. In any case thanks!
Side note: This work goes into Beta3. I already started with the Node access grants and a quite simplistic demo ACL (for now fixed during testing) until we agree where the ACLs will be saved and what cool operations we want to allow.
Additional resources: This is a good example of how S3 uses ACLs. Since we do S3 everywhere and Min.io uses them too, we can also reuse learning curves. Please look at the JSON examples
"Resource":["arn:aws:s3:::examplebucket/*"],
When dealing with super large JSON documents, memory can be an issue, specially when decoding into an array.
We should do some deep testing to find out our sweetspot with SBF data and external sources and see how we can use
https://github.com/salsify/jsonstreamingparser
to get around some limitations on smaller/low end servers.
There is also this package https://github.com/halaxa/json-machine worth looking at, it claims never running out of memory!
See #63 and esmero/format_strawberryfield#43
Move Formatter methods for ordering Arrays using a 'sequence' key to our JSON helper class.
What the issue says. JSON (this is metadata diego boy) needs to be properly escaped before curl can use it as --data. It was working like 98% of the time except for that edge case... and here we are. Delaying release a Monday at 22:42. Get a life!
@giancarlobi this is the last piece of this and i will assume beta3 is done. All works, ingest of archipelago-recycables
via DRUSH works and i have TON of configs to share for deployment. On it.
When outputting via REST view a strawberry field JSON raw content or when using a Metadata Display based on Twig for the same type of field, Drupal double encodes and normalizes the content. This is Ok if the expected output is the whole, double encoded value (for sharing?) but limits us in building new and exciting apps, like OAI-PHM, or even a simple IIIF manifest using views.
Drupal serializers are only used to simple text values or type data item lists, but strawberry field, even when it can expose through its multiple properties that type of data has also a single ->value element containing an already in JSON format value. To be true to D8 serialization workflow, we need to add a new one that allows a passthrough and some mangling for our already in JSON value. We also need a way of totally passing through serialization if we want to allow field formatters to do that for us, which gives us a huge flexibility, like building full nested responses in any shape we want instead of depending on D8's perception of what data should look like.
1.- Handle strawberry fields normalizing as a new service attached to our class.
See https://www.drupal.org/docs/8/api/serialization-api/changing-the-way-serializer-handles-entities
2.- Allow a JSON passthrough serializer/normalizer and probably a new views display plugin extending RestExport
able to deliver our rendered (via formatter) field with any interference. The idea here is to allow our field formatters to decide on the desired format, exposed HTTP header for Content type
By doing so we can truly extend D8's data exposing capabilities without coding.
Import existing XML metadata (EAD, MODS, etc) into a native JSON format for strawberryfield. This can be handy when dealing with external sources of migrations where we want to maintain existing data/schemas but cast into a more general JSON format to allow our webform system (https://github.com/esmero/webform_strawberryfield) to handle further editing/creation.
Given a simple XML like
<?xml version='1.0' standalone='yes'?>
<archdesc localtype="inventory" level="subgrp">
<did>
<head>Overview of the Records</head>
<repository label="Repository:">
<corpname>
<part>Minnesota Historical Society</part>
</corpname>
</repository>
<origination label="Creator:">
<corpname>
<part>Minnesota. Game and Fish Department</part>
</corpname>
</origination>
<unittitle label="Title:">Game laws violation records,</unittitle>
<unitdate label="Dates:">1908-1928</unitdate>
<abstract label="Abstract:">Records of prosecutions for and seizures of property resulting from violation of the state's hunting and fishing laws.</abstract>
<physdesc label="Quantity:">2.25 cu. ft. (7 v. and 1 folder in 3 boxes)</physdesc>
<physloc label="Location:">See Detailed Description section for box location</physloc>
</did>
</archdesc>
A PHP
snippet of code like
$xml = simplexml_load_string($ead);
$json = json_encode($xml);
$array = json_decode($json,TRUE);
Would easily deal with XML to JSON and, if needed, to Array casting.
But:
For XML elements with @attributes
and text values, JSON serializer will discard them totally ending in an array like
[unittitle] => Game laws violation records,
[unitdate] => 1908-1928
Deal with JSON serialization in the same way JSON-LD does using the @value
key for the actual text value and a custom @attribute
key or even a @type
key with a mapping @context
that helps bring non semantic, from an XML schema coming, elements into an local context.
This implies:
1.- Build a decorator class for the JSON Serialization
2.- Subclass Simple XML Element Class
3.- Build a Composer aware PHP Library we can include in Strawberryfield
This is a great way of dealing with XML and integrating our own code. This would allow us to also accommodate files already processed by other systems (migrate) or even be fed by external APIs and then cast via Twig to visualizations, index in our Solr, etc.
/**
* Class JsonLDSimpleXMLElementDecorator
*
* Implement JsonSerializable for SimpleXMLElement as a Decorator with JSON-LD syntax
*/
class JsonLDSimpleXMLElementDecorator implements JsonSerializable
{
const DEF_DEPTH = 512;
private $options = ['@attributes' => TRUE, '@text' => TRUE, 'depth' => self::DEF_DEPTH];
/**
* @var SimpleXMLElement
*/
private $subject;
public function __construct(SimpleXMLElement $element, $useAttributes = TRUE, $useValue = TRUE, $depth = self::DEF_DEPTH) {
$this->subject = $element;
if (!is_null($useAttributes)) {
$this->useAttributes($useAttributes);
}
if (!is_null($useValue)) {
$this->useValue($useValue);
}
if (!is_null($depth)) {
$this->setDepth($depth);
}
}
public function useAttributes($bool) {
$this->options['@attributes'] = (bool)$bool;
}
public function useValue($bool) {
$this->options['@value'] = (bool)$bool;
}
public function setDepth($depth) {
$this->options['depth'] = (int)max(0, $depth);
}
/**
* Specify data which should be serialized to JSON
*
* @return mixed data which can be serialized by json_encode.
*/
public function jsonSerialize() {
$subject = $this->subject;
$array = array();
// json encode attributes if any.
if ($this->options['@attributes']) {
if ($attributes = $subject->attributes()) {
$array['@attributes'] = array_map('strval', iterator_to_array($attributes));
}
}
// traverse into children if applicable
$children = $subject;
$this->options = (array)$this->options;
$depth = $this->options['depth'] - 1;
if ($depth <= 0) {
$children = [];
}
// json encode child elements if any. group on duplicate names as an array.
foreach ($children as $name => $element) {
/* @var SimpleXMLElement $element */
$decorator = new self($element);
$decorator->options = ['depth' => $depth] + $this->options;
if (isset($array[$name])) {
if (!is_array($array[$name])) {
$array[$name] = [$array[$name]];
}
$array[$name][] = $decorator;
} else {
$array[$name] = $decorator;
}
}
// json encode non-whitespace element simplexml text values.
$text = trim($subject);
if (strlen($text)) {
if ($array) {
$this->options['@value'] && $array['@value'] = $text;
} else {
$array = $text;
}
}
// return empty elements as NULL (self-closing or empty tags)
if (!$array) {
$array = NULL;
}
return $array;
}
Use would be
$xml = new SimpleXMLElement($ead);
$xml = new JsonLDSimpleXMLElementDecorator($xml, TRUE, TRUE, 3);
echo json_encode($xml, JSON_PRETTY_PRINT), "\n";
This code is adapted (a few single lines change really) https://hakre.wordpress.com/2013/07/10/simplexml-and-json-encode-in-php-part-iii-and-end/ and its pretty cool!
This will require that form elements allow/read/write the @attribute
element, which can be generalized by the use of the custom JSON properties each Webform element can/could have.
At our NYC summit we came up with a set of requirements for Strawberry Field:
At some point we will need to formalize these requirements and create documentation for module usage.
This is a continuation of #86 and #87 which was merged.
Right now we are just getting general PDFinfo (single first page), which means in our metadata we only keep number of Pages (good) and IF even , a single page Dimension. Not cool for Rare books, complex displays in general and too simplistic to be honest when dealing with a IIIF Manifest generation we want to allow to work on Mirador and the Book reader since our implementation (also simplistic) of https://github.com/mozilla/pdf.js is a bit slow on large super large PDFs.
๐ @tomadams re:your email today
Solution. Simple. Get more Metadata. How?
Run PDF Info twice:
1.- get the pages as we do now
2.- then use the -f and -l arguments to get all the dimensions for all pages. Store that into an array and add to the JSON. 1000 pages, 1000 entries? May need to think about that but seems feasible, but could also go directly into SOLR same way we expect Text extraction, HOCR and entity extraction would happen per page (one Solr doc per page).
Use that data in the manifest and also rewrite our manifests. The one we have in play.archipelago.nyc is passing the IIIF V2 tests correctly, we need the same for IIIF V3.
Ok, still confused about this. @alliomeria may know better. Will put two Examples here: first one clean EXIF
https://play.archipelago.nyc/do/f4a4c6ee-4ce9-4b4c-8704-e8057bad0a7d
{
"flv:exif": {
"ISO": 100,
"Flash": "No Flash",
"Model": "RICOH THETA S",
"Aperture": 2,
"FileSize": "2.8 MB",
"MIMEType": "image\/jpeg",
"ImageSize": "5376x2688",
"Sharpness": "Normal",
"ColorSpace": "sRGB",
"ImageWidth": 5376,
"XMPToolkit": "RICOH THETA for iOS 2.14.0",
"FocalLength": "1.3 mm",
"ImageHeight": 2688,
"GPSVersionID": "2.3.0.0",
"MeteringMode": "Multi-segment",
"ShutterSpeed": "1\/6400",
"WhiteBalance": "Auto",
"ProjectionType": "equirectangular",
"GPSImgDirection": 270,
"PoseRollDegrees": 0,
"DateTimeOriginal": "2020:07:02 17:25:15",
"PosePitchDegrees": 0,
"UsePanoramaViewer": true,
"GPSImgDirectionRef": "True North",
"PoseHeadingDegrees": 0,
"FullPanoWidthPixels": 5376,
"CroppedAreaTopPixels": 0,
"ExposureCompensation": 0,
"FullPanoHeightPixels": 2688,
"CroppedAreaLeftPixels": 0,
"CroppedAreaImageWidthPixels": 5376,
"CroppedAreaImageHeightPixels": 2688
}
}
Unclean (see duplication because of changes history in the second PDF)
http://ec2-184-73-148-144.compute-1.amazonaws.com/do/018744ea-1d99-4d71-bd93-6cd402a82d74
PRESS HERE TO SEE ALL!
{
"flv:exif": {
"Title": "Basic RGB",
"Format": "application\/pdf",
"NPages": 1,
"FileSize": "1934 kB",
"FontFace": [
"Regular",
"Regular"
]
}
}
{
"flv:exif": {
"Title": "Basic RGB",
"Format": "application\/pdf",
"NPages": 1,
"FileSize": "1934 kB",
"FontFace": [
"Regular",
"Regular"
],
"FontName": [
"BebasNeue-Regular",
"MyriadPro-Regular"
],
"FontType": [
"Open Type",
"Open Type"
],
"MIMEType": "application\/pdf",
"Producer": "Adobe PDF library 10.01",
"PageCount": 1,
"CreateDate": "2019:11:01 17:16:50-04:00",
"DocumentID": "xmp.did:e88490b4-4350-2243-9e6a-e0e8a9092ec9",
"FontFamily": [
"Bebas Neue",
"Myriad Pro"
],
"InstanceID": "uuid:fac62424-48d5-4b85-84fd-49beb49d517c",
"Linearized": "No",
"ModifyDate": "2019:11:01 21:58:38-04:00",
"PDFVersion": 1.5,
"PlateNames": [
"Cyan",
"Magenta",
"Yellow",
"Black"
],
"XMPToolkit": "Adobe XMP Core 5.6-c145 79.163499, 2018\/08\/13-16:40:22 ",
"CreatorTool": "Adobe Illustrator CC 23.1 (Windows)",
"FontVersion": [
"Version 2.000;PS 002.000;hotconv 1.0.88;makeotf.lib2.5.64775",
"Version 2.106;PS 2.000;hotconv 1.0.70;makeotf.lib2.5.58329"
],
"HistoryWhen": [
"2019:10:30 12:21:32-04:00",
"2019:11:01 17:16:51-04:00"
],
"FontFileName": [
13407,
"MyriadPro-Regular.otf"
],
"MaxPageSizeH": 28,
"MaxPageSizeW": 42,
"MetadataDate": "2019:11:01 21:58:38-04:00",
"FontComposite": [
false,
false
],
"HistoryAction": [
"saved",
"saved"
],
"CreatorVersion": 23,
"HistoryChanged": [
"\/",
"\/"
],
"RenditionClass": "proof:pdf",
"StartupProfile": "Basic RGB",
"ThumbnailWidth": 256,
"MaxPageSizeUnit": "Inches",
"SwatchGroupName": [
"Default Swatch Group",
"Cold",
"Grays"
],
"SwatchGroupType": [
0,
0,
0
],
"ThumbnailFormat": "JPEG",
"ThumbnailHeight": 172,
"ContainerVersion": 11,
"ManifestLinkForm": [
"EmbedByReference",
"EmbedByReference"
],
"HistoryInstanceID": [
"xmp.iid:615189d1-95dc-e64c-b838-2a31d901c875",
"xmp.iid:e88490b4-4350-2243-9e6a-e0e8a9092ec9"
],
"SwatchColorantRed": [
255,
0,
255,
255,
0,
0,
0,
255,
192,
236,
240,
246,
250,
251,
216,
139,
57,
0,
0,
34,
0,
41,
0,
46,
27,
102,
146,
157,
211,
236,
198,
152,
115,
83,
197,
165,
139,
117,
96,
66,
101,
130,
185,
0,
26,
51,
77,
102,
128,
152,
178,
203,
229,
241
],
"EmbeddedImageWidth": 381,
"OriginalDocumentID": "uuid:9E3E5C9A8C81DB118734DB58FDDE4BA7",
"SwatchColorantBlue": [
255,
0,
0,
0,
0,
255,
255,
255,
45,
36,
36,
30,
59,
33,
33,
63,
74,
69,
55,
115,
156,
225,
187,
145,
100,
144,
142,
93,
90,
121,
152,
117,
87,
65,
109,
82,
57,
36,
19,
11,
207,
196,
200,
0,
26,
51,
77,
102,
128,
152,
178,
203,
229,
241
],
"SwatchColorantMode": [
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB",
"RGB"
],
"SwatchColorantType": [
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS",
"PROCESS"
],
"EmbeddedImageFilter": "FlateDecode",
"EmbeddedImageHeight": 602,
"HasVisibleOverprint": false,
"IngredientsFilePath": [
"C:\\Users\\clee4\\Downloads\\mora_2018_figures\\Photos to Share\\CAD Images\\20191101-DSC00067.jpg",
"C:\\Users\\clee4\\Downloads\\mora_2018_figures\\Photos to Share\\Poster and Paper Figures\\coral-reef-drawing-10.png"
],
"SwatchColorantGreen": [
255,
0,
0,
255,
255,
255,
0,
0,
39,
28,
90,
146,
175,
237,
223,
197,
180,
145,
104,
180,
168,
170,
113,
49,
20,
45,
39,
0,
20,
30,
177,
133,
99,
71,
155,
124,
98,
76,
56,
33,
199,
138,
154,
0,
26,
51,
77,
102,
128,
152,
178,
203,
229,
241
],
"HistorySoftwareAgent": [
"Adobe Illustrator CC 22.1 (Windows)",
"Adobe Illustrator CC 23.1 (Windows)"
],
"DerivedFromDocumentID": "xmp.did:f6f8d79a-3268-9b47-b8ca-5ae8fc53d04a",
"DerivedFromInstanceID": "xmp.iid:f6f8d79a-3268-9b47-b8ca-5ae8fc53d04a",
"IngredientsDocumentID": [
"xmp.did:24918461-5358-463f-8f02-8a25bbc0f753",
"adobe:docid:photoshop:80092d49-6ba7-f649-b041-a6d0af913b7f"
],
"IngredientsInstanceID": [
"xmp.iid:24918461-5358-463f-8f02-8a25bbc0f753",
"xmp.iid:c964d0b5-30e9-854d-9422-d12453198b63"
],
"HasVisibleTransparency": true,
"EmbeddedImageColorSpace": [
"DeviceRGB",
"Indexed",
"DeviceRGB",
1,
"DeviceRGB"
],
"SwatchColorantSwatchName": [
"White",
"Black",
"RGB Red",
"RGB Yellow",
"RGB Green",
"RGB Cyan",
"RGB Blue",
"RGB Magenta",
"R=193 G=39 B=45",
"R=237 G=28 B=36",
"R=241 G=90 B=36",
"R=247 G=147 B=30",
"R=251 G=176 B=59",
"R=252 G=238 B=33",
"R=217 G=224 B=33",
"R=140 G=198 B=63",
"R=57 G=181 B=74",
"R=0 G=146 B=69",
"R=0 G=104 B=55",
"R=34 G=181 B=115",
"R=0 G=169 B=157",
"R=41 G=171 B=226",
"R=0 G=113 B=188",
"R=46 G=49 B=146",
"R=27 G=20 B=100",
"R=102 G=45 B=145",
"R=147 G=39 B=143",
"R=158 G=0 B=93",
"R=212 G=20 B=90",
"R=237 G=30 B=121",
"R=199 G=178 B=153",
"R=153 G=134 B=117",
"R=115 G=99 B=87",
"R=83 G=71 B=65",
"R=198 G=156 B=109",
"R=166 G=124 B=82",
"R=140 G=98 B=57",
"R=117 G=76 B=36",
"R=96 G=56 B=19",
"R=66 G=33 B=11",
"C=56 M=0 Y=20 K=0",
"C=51 M=43 Y=0 K=0",
"C=26 M=41 Y=0 K=0",
"R=0 G=0 B=0",
"R=26 G=26 B=26",
"R=51 G=51 B=51",
"R=77 G=77 B=77",
"R=102 G=102 B=102",
"R=128 G=128 B=128",
"R=153 G=153 B=153",
"R=179 G=179 B=179",
"R=204 G=204 B=204",
"R=230 G=230 B=230",
"R=242 G=242 B=242"
],
"DerivedFromRenditionClass": "proof:pdf",
"ManifestReferenceFilePath": [
"C:\\Users\\clee4\\Downloads\\mora_2018_figures\\Photos to Share\\CAD Images\\20191101-DSC00067.jpg",
"C:\\Users\\clee4\\Downloads\\mora_2018_figures\\Photos to Share\\Poster and Paper Figures\\coral-reef-drawing-10.png"
],
"ManifestReferenceDocumentID": [
"xmp.did:24918461-5358-463f-8f02-8a25bbc0f753",
"adobe:docid:photoshop:80092d49-6ba7-f649-b041-a6d0af913b7f"
],
"ManifestReferenceInstanceID": [
"xmp.iid:24918461-5358-463f-8f02-8a25bbc0f753",
"xmp.iid:c964d0b5-30e9-854d-9422-d12453198b63"
],
"DerivedFromOriginalDocumentID": "uuid:9E3E5C9A8C81DB118734DB58FDDE4BA7"
}
}
Question is: Do we de-dup?, do we simply strip from EXIF a list of offenders? I mean i love the idea of indexing in Solr the Colorswatches, but its a lot, like really too much?
See esmero/archipelago-deployment#54
Basically fix the last deprecation notices and bump compatibility version to 8.x | 9.x
Something we discussed today with @giancarlobi (and that affects that Self Deposit very specific use case but can because of that be extrapolated to more generic needs): we need a better way of making sure that Archipelago/Strawberry field has access to files always in the place it wants to/needs to so IIIF/Security/Access and Order (not global order, no worries no that type of order here). We do a pretty good job but there are always edge cases, and even a year or more ago we were too flexible and had files moving around and being renamed all over the place.
For those who do not know how our file persisting strategy works (same since the start of the project just getting smarter every day!, there are a few Event Subscribers/Data describing logics that happen in a certain order (@alliomeria for you also so we can make a tiny .MD file in the docs explaining this)
as:somefiletype
JSON
structure into the main ADO
SBF JSON
with info about the file, checksums, size, Drupal fids, uuid, etc. This is a heavy function part of the StrawberryfieldFilePersisterService
. It does a lot, and I tried to optimize its logic but we may do more in the future to handle too many files/to big files needs (FYI: solution is simple, add to a queue and process later).StrawberryfieldEventPresaveSubscriberAsFileStructureGenerator
runs and checks if 2.1 already was processed. This is needed since the user could have triggered an ingest via drush/JSONAPI/Webhooks etc. If all is well (this is a less expensive check) we continue.StrawberryfieldEventPresaveSubscriberFilePersister
runs, checking all TEMPORARY files described in as:somefiletype
and actually copying them to the right "desired" locationStrawberryfieldEventInsertFileUsageUpdater
also marking the file as "being" used by a Strawberry driven Node (different Event)NOTE: Interesting to know (also for @alliomeria) is that anytime we remove directly/raw from the JSON a full as:somefiletype
structure of a sub element from an as:structure we force Archipelago to do all the again, and we can regenerate technical Metadata. We have used this when updating EXIF binaries or even when something went wrong (while testing, this stuff is safe no worries). It works well and I will eventually add a BIG red button
that does that if you do not like JSON editing.
Many other events trigger other things. But the key to understand this is:
1.- Archipelago (the wise) was acting always on "temporary" files here. temporary means $file->isTemporary() returns true == which means Drupal would eventually get rid of them if we do not act. They are tracked, they have a Drupal ID and UUID but not meant to survive a Cron run. Assigning a clean name and desired destination works based on that and copying them to that place also expect the file to be temporary. So why?
Logic was to not over process (file operations are expensive) but also not and not step over other modules and "ways" toes and move files some other entities/node/ADO may be referencing already. Archipelago allows files to be reused many times. All was fine (almost) until we added Self Deposits!!
In Self deposit situations we may allow Anonymous users to Upload a file and metadata (works great by the way). In those cases the file, when the submission ends gets taken and made permanent by the webform
module and usage added to a webform submission. That immediately is a kill switch (๐) to all our logic and we leave the file in peace. Well nice of use but not good for IIIF or our needs of keeping the house clean (not my own for sure..).
What to do? Is the logic not made for this case? It was just too respectful and sometimes you (or your code) needs to step up and demand what is right. A place where we want a file to be.
Fix was not complex (already did it, now testing) but involves:
Ok. I think the explanation is actually more than the CODE but is needed. Will make a pull later tonight.
What is next? (another pull)
@giancarlobi suggested/needs also: unmanaged files (Fedora 3 files). That will bring a new mapping in "ap:entitymapping", a new Webform upload and a new type of as:somefiletype
sub structure that allows us to "mention/reference" these file without ever taking control of them, but still having enough data to act/do things with them.
@alliomeria ๐ Deposit directly to backend storage/Dropbox like which is really #76 (secretly hidden because I got crickets with the idea that time and it is still a great one!) Imagine you have a 2TB large VIDEO of your wedding (worth preserving memory?), there are better ways of uploading files than via the web browser. I promise its real. You upload your file via one of those (Multipart S3, FTP, who knows) and a manifest that comes with the Big Binary. Or a ZIP file with a Manifest (frictionless data package). The manifest contains a few cool keys (including a secret TOKEN!). Code comes with
with a webform element (paste your Token, we connect everything) and a "ap:entitymapping" subset. Also manifest/ token generator Form in your User account so you can do the work AUTOMATICALLY! (one time use) and just attach. I guess we can even add that as an API later e.g for just the TOKEN. Hope someone else is reading this too since its a lot! And sounds so great.
The default Drupal 8 theme hinting for node is kinda silly. it uses page-- or page--node some number like page--node--12.html.twig
. We really think its cool that we are actually using NODES as ADOS and that we only need one Content Type (Digital Object) but also we want to allow site builders and theming folks to target pages via templates grouped by JSON key type (book, etc) and also specifically for a ADO (which bears a SBF).
Solution: find a simple way of add a theme hint for SBF bearing Content Types (easy, code is there) but also, for specific JSON types. That way we can give people close to design and aesthetics the chance to shine. Beta3 task
See esmero/archipelago-deployment#17
Sometimes Google or any other WebSem JSON-LD provider can forget to pay the bill and external services (to us..) can go out. We download now schema.org or basically any JSON-LD context used to feed initially our SBF properties directly from the web. In a perfect scenario that is OK. In reality, as demostrated today Dec 3rd 2019 via a Quote exceded message on Schema.org, that is not so true.
We cache right now downloaded Contexts, but if there is nothing to download? One good option, when cache is not yet present is to allow a File fallback (yes a file!), means first check for a JSON-LD file locally that matches/was provided/generated by us. If not, try the remote option. And then remote option happens, we still download to file and keep local. Good, good.
Well you could need to refresh your remote schema. And since file is already there we will keep using it. Simple solution is documentation. Teach people how to remove file. Better/other solution would be to expose that in the Plugin Config (Which requires some serious recoding because of how Config Entity saving happens on the Plugin form, and not in each Implementations). Another is auto expiring based on last modifications, last every 3 months or so. Still, in 3 months we could hit a blackout again... we can discuss this further.
Basically a quick hack, don't do this at home kids for File Description extraction when dealing with ADOs/Archipelago Digital Objects with Small Binaries/not many Binaries and that don't have the luxury (yet) to have strawberry_runners doing all their cool ReactPHP Async background processing for you.
This adds a WIP Method to the filePersisterService that will process/move/to/tmp files, run exiftool and (UK) Pronom extraction (FIDO) and push back all the insane amount of data into the JSON.
A check that says , hey! You are adding like 3 files, Ok, i can process in sync/realtime, but hey, those are 2000 pages of poetry and thoughts, no way i'm doing that, install/setup strawberry_runners for that, i will only do it for the first, 3? that are less than 100 Mbytes. OR so.
Move Checksumming which runs on the main method into this ::getBaseFileMetadata() (oh.. you have not seen the pull yet.. well, linked down!)
Add a Config Form for FIDO and EXIFTOOL. Future i want to use https://github.com/richardlehane/siegfried also
Use Finding Aids (EAD) in an Archipelago environment with Strawberryfield support. Focus on EAD3
https://github.com/saa-ead-roundtable/ead3-toolkit
@type
(Finding Aid?)Finding aids in EAD are ancient technology but without any semantic aware replacement right now. Means they work and Archivists and Archival Systems relay heavily on them. Finding a solution could help bind Archival needs with Repository/access needs in a single simple to use solution. Should we focus, research on binding directly to/from Archivespace too?
Strawberryfield flattens on common keys the whole JSON Structure to be able to expose individual, deep nested values inside a common key. This is used for Drupal's interaction with Solr amongst many other use cases. This works in most of the cases as expected but, weirdly enough (my bad) it over nests items if a single key, contains multiple values in an non associative array.
The issue happens exactly here: https://github.com/esmero/strawberryfield/blob/master/src/Tools/StrawberryfieldJsonHelper.php#L136
function arrayToFlatCommonkeys(array $array, &$flat = array(), $jsonld = TRUE)
Is a recursive function that accumulates values it finds as it traverses the array hierarchy.
Well, i'm really bad at managing recursion in my brain. (i noticed it a bit late, like 41 years too late) so i have to debug this a little deeper. The gist is, basically making sure that the extra nesting does not happen under that circumstance. Since i can not know up front if the array i will get has more levels, and i also don't want to look the tree down in a recursive function (but then by being recursive i should not even care.. gosh recursion?) i need to assign values by either merging the values with existing ones or simply assigning if the "key" i'm accumulating was found for the first time.
@giancarlobi this is what you just saw in your Solr. Give me like 30-60 minutes and a few cups of coffee. I know i have this done correctly many times in the past, i just need to debug, debug and build some test cases.
To be honest, not sure if this needs to be fixed/just reviewed or simply other modules are not good. But while testing (with success) PDF to text extraction by creating an S3 aware PDF to Text plugin i found that Search API was calling some methods we have not present in our special source. Because its not an content entity, just a data type. But then, i also noticed that we provide no Display Mode which means even if we store HOCR or any other flavor, we can not show it.
@giancarlobi you worked on this so maybe you have more ideas/questions. Any of this makes sense? I will share my findings later today
IIIF manifests in their 3.0-Draft version 'width' (finally) is not a requirement. That is the whole reason of having, e.g a info.json, so the Client can take the sizes/proportions from there. See https://iiif.io/api/presentation/3.0/#width
But because of CORS, some tiny bugs or implementation details we can not manage right now, Clients Viewers without that info tend to fail badly. In the past (our past is short) we used to hit the IIIF server's info.json to extract that and pass it directly to any client that needs it, like in the thumbnail processor.
Still, it could be nice, to have inside our File Persister service, a simple call to exif/pronom to fetch that data upfront. Not thinking necessarily about the whole exif data (like 148 fields i got the other day with a simple image), just the basics. There are other uses cases where having that directly in the JSON can be useful.
So what is needed?
Not much (i mean its never easy peasy), we need to decide really if this will be part of strawberry_runners
module, or we can simply deal with this tiny/quick processing needs directly inside this module with some settings. And then runners can expand/reuse on those settings. Would require me to move the base plugin logic i have staged in runners to SBF, not a big deal really and implement 3x fixed plugins, one for each binary that is modal, executed against uploaded files. Those would be run while persisting/updating an image on storage, same as we do right now persister service. Probably better to move this into its own service that runs after persisting (or not... better like into an event Subscriber but triggered inside the persister since i would love to process this while the files are still local, can always fallback to download to temp, process, etc, but think about larger bigger ones!) since persister is already quite heavy on logic. See https://github.com/esmero/strawberryfield/blob/8.x-1.0-beta2/src/StrawberryfieldFilePersisterService.php
Our SB Flavor Search API Data Source Plugin inherits the basic configs of every other content Data Sources, but now in 8.8.1, we need to be quite explicit about those configs being defined in a schema.
When we set title/label of a node via the event subscriber, we notify the user everytime something happens, even when the title that is to be set/being set/ is exactly the same as the one that was before.
We don't like that and its confusing.
I looked at the code a few times thinking if we should totally avoid assigning a title if the current one is already in place and equal, but then decided to leave the check out. Since we have to calculate it anyway, its an extra CPU cycle(or cycles) to check/validate before setting. Decide to only check if there was a change after the fact. Not before. Pretty sure i will change my mind in the future and will come to even more optimal code. For now this is OK.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.