Coder Social home page Coder Social logo

volume-manifest-builder's Introduction

Vagrant scripts for BUDA platform instanciation

The base platform is built using Vagrant and VirtualBox:

  1. Install Vagrant and VirtualBox.
  2. Download or git clone this repository.
  3. cd into the unzipped directory or git clone
  4. install VirtualBox guest additions with vagrant plugin install vagrant-vbguest
  5. run vagrant up to summon a local instance

Or for an AWS EC2 instance:

  1. install the vbguest plugin: vagrant plugin install vagrant-vbguest
  2. and run the command: vagrant up or rename Vagrantfile.aws to Vagrantfile and run vagrant up --provider=aws

This will grind awhile installing all the dependencies of the BUDA platform.

Once the initial install has completed the command: vagrant ssh will connect to the instance where development, customization of the environment and so on can be performed as for any headless server.

Similarly, the jena-fuseki server will be listening on:

http://localhost:13180/fuseki

Lds-pdi application is accessible at :

http://localhost:13280/

(see https://github.com/buda-base/lds-pdi/blob/master/README.md for details about using this rest services)

The command: vagrant halt will shut the instance down. After halting (or suspending the instance) a further: vagrant up will simply boot the instance without further downloads, and vagrant destroy will completely remove the instance.

If running an AWS instance, after provisioning access the instance via ssh -p 15345 and delete Port 22 from /etc/ssh/sshd_config and sudo systemctl restart sshd. This will further secure the instance from attacks on port 22.

volume-manifest-builder's People

Contributors

drupchen avatar eroux avatar jimk-bdrc avatar tbrc-timb avatar

Watchers

 avatar  avatar  avatar  avatar

volume-manifest-builder's Issues

Support eXistDB

If this tool is to be run as part of the Sync to Amazon process, it needs to get lists of images from eXist, because BUDA doesn't have them yet. Use the eXist work-igs query to get the list of image groups. Use S3 to get the object names in each group.

fetching only the metadata

This might be a bit tricky and we need a backup plan in cases it doesn't work, but the technique exposed here looks promising to download the small kb of data necessary to get the image dimensions, that would speed up the script significantly

Update AWS dependencies

The current installer package seems to overwrite boto, botocore, and boto3 with some older versions - or at least interferes with the awscli pip package.
Remove version numbers and test

Sort image file ordering

v_m_b gets its list of files from (in order):

  • a Bill of Materials that bdrcSync.sh builds
  • Python's os.listdir()
  • AWS S3 ListObjects call
    None of these are guaranteed to be ordered. IIIFPRES requires an ordered list ao-608

Handle image groups which are not present

In W8LS66822,
the get_volume_infos_from_S3 queries BUDA, and gets back:

"item","list","grId"
"bdr:I8LS66822","","I8LS68086"
"bdr:I8LS66822","","I8LS68087"
"bdr:I8LS66822","","I8LS68088"
"bdr:I8LS66822","I8LS680890001.tif:2|I8LS680890003.jpg:85","I8LS68089"
"bdr:I8LS66822","I8LS680900001.tif:2|I8LS680900003.jpg:81","I8LS68090"
"bdr:I8LS66822","","I8LS68091"
"bdr:I8LS66822","","I8LS68092"
...

and chokes when there are no images in the groups. It should ignore.

Place a format descriptor in the output

As part of Archive ops issue 607 @eroux makes a case for an image type descriptor going into the output of volume_manifest_builder.
The fix for this issue will map image types into a controlled dictionary that IIFPRES uses to determine the image type.

At @eroux request, the implementation was updated to only emit the image format discovered from the image when that image format does not match the file name extension. The mappings of image formats to file name extensions is coded as:

Image Format File suffixes - case insensitive
JPEG jpg, jpeg
TIFF tiff, tif

Make volume manifest tool a standing service

v-m-t has the infrastructure to be run as a service, but it is usually processed in a shell which runs once and then shuts the machine down. Make it poll s3://manifest.bdrc.org/processing/todo for work. Use the existing service/ infrastructure (without the shutdown)
See also #12

parallel treatment

We should be able to define several jobs running in parallel, the current script is too slow

Support optional output folder

To fix archive-ops #506 we need to have the tool be able to read from and write to independent directories.

A use case for this is in the above issue. As a digital archivist, I want to generate a manifest for the archives as they are (/mnt/ArchiveX/nn/Work/etc). but only write the resulting manifest to a different tree: (mydump/ArchiveX/nn/Work/images/). so that we can do a sync without slinging around a lot of images)

Volume manifest tool is a little too quick to emit an SNS when it is invoked from the command line incorrectly.

ValueError: Usage: manifestforwork sourceFile where sourceFile contains a list of work RIDs

--
If you wish to stop receiving notifications from this topic, please click or visit the link below to unsubscribe:
https://sns.us-east-1.amazonaws.com/unsubscribe.html?SubscriptionArn=arn:aws:sns:us-east-1:170602929106:ArchiveOpsNotifications:794e43fb-f5e0-4ba7-9470-6940a4cfcbc7&[email protected]

Please do not reply directly to this email. If you have any questions or comments regarding this email, please contact us at https://aws.amazon.com/support

Fallback when BUDA is not ready

There are cases where a sync has occurred and BUDA has not been sync'd from eXists. this is a side effect of trying to run volume-manifest-tool when BUDA hasn't completed its image list. The BUDA API which v-m-t uses shows no images in each image group.

Therefore, the fallback is to use each image group and query S3 for the images in that image group, exactly as the getVolumeInfosExist.expandGroups and expandImages does.

throw error on dimension of 0

The dimensions.json of image group I5449 of W22703 (see here) has some images that are computed as having a width and a height of 0px, which is not true (they seem to work fine on the tbrc.org website). @TBRC-JimK can you take a look?

Add exception case for image group IDs that only contain 4 digits

For example, consider image group ID W28882-4784

According to Elie:

if an image group ID (such as I4784) in the database, the
code that constructs its S3 path has one condition:

  • if the image group is I + 4 digits, it only uses the 4 digits, not the I
  • else it uses the full image group ID
    So in that case it looks for W28882-4784

v_m_b will need an exceptional case made for any "I + 4 digits" image group IDs so that it searches for the appropriate image group folder on disk.

handle error

The following error occurred on W21808-0117:

The object does not exist.
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/PIL/Image.py", line 2315, in open
    fp.seek(0)
AttributeError: 'NoneType' object has no attribute 'seek'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./manifestforwork.py", line 222, in <module>
    main()
  File "./manifestforwork.py", line 25, in main
    manifestForList(sys.argv[1])
  File "./manifestforwork.py", line 39, in manifestForList
    manifestForWork(client, bucket, workRID)
  File "./manifestforwork.py", line 48, in manifestForWork
    manifestForVolume(client, bucket, workRID, vi)
  File "./manifestforwork.py", line 58, in manifestForVolume
    manifest = generateManifest(bucket, s3folderPrefix, vi.imageList)
  File "./manifestforwork.py", line 180, in generateManifest
    width, heigth = dimensionsFromBlobImage(blob)
  File "./manifestforwork.py", line 194, in dimensionsFromBlobImage
    im = Image.open(blob)
  File "/usr/lib/python3/dist-packages/PIL/Image.py", line 2317, in open
    fp = io.BytesIO(fp.read())
AttributeError: 'NoneType' object has no attribute 'read'

This is probably an image listed in the image list but not present on S3. This kind of case should be handled properly, and logged in stderr.

NPL works fail during sync

Fails during dimensions manifest generation with this error:

    building W1NPL109 dimensions manifest for BUDA
ERROR - Could not find image groups for /Users/tbrc/staging/sync2archive/W1NPL109

2021-02-17_21_22_4651.local_v_m_b.log

These works don't have tbrc.org database records as the tech team only configured NPL works for BUDA.

Change v_m_b log file name

@TBRC-Travis requests buildManifestForWork-2022-05-27_09.23.07.txt rather than 2022-05-27_09.23.07-local_v_m_b.log

Use a camel cased figure with the name first.

@jimk-bdrc notes

  • this can be run in parallel, so add process id.
  • we may want to aggregate runs on different hosts, so add machine name.
  • Can this be parameterized, so someone can set an environment variable to specify the format, something like {processName}-YYYY-mm-DD-HH_MM_SS-[pid]-host.ext where ext can be anything you want? (logs should remain .log, to distinguish them from text, docs, and lists, but allowing an override is still a good idea.)

Volume Manifest Builder to use local file system

v-m-b gets its images for processing from S3. We would like to use the local file system for two reasons:

  • we can generate manifest.jsons during the sync process, instead of a post-processing step
  • we eliminate the cost of calling S3's GetObject for each image.
  • Should be faster

Additional error log output during manifest creation

Encountered an error when running manifestForWork on W28882 where the image group folder (W28882-4784) didn't match the image group record in the database (W28882-I4784). The only error generated in the log was:

local_v_m_b-ERROR: No manifest created for W28882-I4784

It would be great if, in addition to this output, v_m_b could also specify that the image group folder could not be found, or that there's a mismatch (in this case between W28882-4784 and W28882-I4784

buildManifestForWork still has existDB dependencies

existDB was offline this morning and broke buildManifestForWork:

ERROR - Could not find image groups for /mnt/AO-staging-Incoming/FPL/W1FPL6263
Some builds failed. See log file /home/service/tmp/syncWorkFiles/buildManifestForWorkLog/2022-05-27_10_57_1881.local_v_m_b.log
CRITICAL - Exception: Some builds failed. See log file /home/service/tmp/syncWorkFiles/buildManifestForWorkLog/2022-05-27_10_57_1881.local_v_m_b.log

traceback:
	  File "/home/service/.local/bin/manifestforwork", line 8, in <module>
    sys.exit(manifestShell())
  File "/home/service/.local/lib/python3.7/site-packages/v_m_b/manifestBuilder.py", line 83, in manifestShell
    raise Exception(error_string)

need to remove dependencies on existDB and fully migrate code over to BUDA as existDB will be going offline next month

Trigger warning notification for volumes with no images

Currently v-m-b generates an error message when there is no manifest created for empty volumes. This should trigger a warning notification. Ideally this should be aggregated together into a single notification if there are multiple volumes in a work that have no images.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.