buda-base / buda-iiif-server Goto Github PK
View Code? Open in Web Editor NEWthe buda Image server based on hymir iiif-server
License: MIT License
the buda Image server based on hymir iiif-server
License: MIT License
Some grayscale images are buggy when a modified image is treated. For instance
http://iiif.bdrc.io/bdr:V1PD127393_I1PD127464::I1PD1274640004.jpg/full/max/0/default.jpg
is fine while
http://iiif.bdrc.io/bdr:V1PD127393_I1PD127464::I1PD1274640004.jpg/full/max/90/default.jpg
is almost entirely white. Note that in the first case the original .jpg is accessed so it doesn't go through the image saving pipeline with turbojpeg. So there may be another bug with the way turbojpeg is handled. We should verify that and if the bug is indeed in the turbojpeg plugin, report a bug here
The ArchiveController doesn't handle FairUse accessType. Resources having this access type should only display the first and last 20 pages (or images for zip)
if we want support inclusion of protected images through simple <img>
tags we need to support passing the token in an alternate way, as in this case it cannot be passed in the header (see this question and this one). So we could probably use cookies for this case. We don't have to support that now, but we might in the future so it's important that we're cautious about the assumptions we make, including in the bdrc-auth-lib library (that should support this kind of case)
In Some cases (full volume downloads), it way be relevant to use the ebook if it's present on S3. Not all are generated but when they are it seems they have the following URL convention:
s3://archive.tbrc.org/Works/{md5}/{workid}/eBooks/{workid}-{volumenumber}.pdf
with the volume number padded on 3 digits. Example:
s3://archive.tbrc.org/Works/60/W22084/eBooks/W22084-001.pdf
it's not really any kind of priority, but I thought I'd report this to provide some awareness on the existence of these. The main difference with normal PDF is that they contain bookmarks, table of contents, a cover and a copyright notice. The ones on S3 are quite old so it may be a bad idea to use them, maybe the new ones would make more sense... (but they're not on S3 yet?).
opening http://library.bdrc.io/show/bdr:W12827 png images won't load
see http://iiif.bdrc.io/bdr:V12827_I2061::020610003.tiff/full/max/0/default.png , http://iiif.bdrc.io/bdr:V12827_I2065::020650001.tiff/full/max/0/default.png , http://iiif.bdrc.io/bdr:V12827_I2068::020680001.tiff/full/max/0/default.png etc.
actually these seem to be tiff images renamed as png
When reaching for a non-existing image like
http://iiif.bdrc.io/image/v2/bdr:V23703_I1521::152106isfjgoeirf.jpg/full/full/0/default.jpg
the server returns a 500 error instead of a 404. The spec is quite clear that it should be a 404
I tried to download the first volume of W22084, the first time it took a very long time, but the second time it gave me a link to
http://iiif.bdrc.io/download/file/pdf/bdr:V22084_I0886:1-624
which is a 500 error
see pdf from http://library.bdrc.io/show/bdr:V22084_I0978
seems to be the same in any volume of http://library.bdrc.io/show/bdr:W22084 or http://library.bdrc.io/show/bdr:W12827
For bitonal images, it seems webp could be a better option than png (faster encoding and similar size), but it doens't seem to be handled by our server:
http://iiif.bdrc.io/bdr:V1PD127393_I1PD127464::I1PD1274640004.jpg/full/max/0/default.webp
returns a 415
The content-type header returned when accessing
http://iiif.bdrc.io/bdr:V23703_I1421::14210082.tif/info.json
is html, that should be application/ld+json
(see spec)
in this case:
http://iiif.bdrc.io/bdr:V1KG1418_I1KG1521::I1KG15210004.tif/full/,600/0/default.png
we have
adm:access bdr:AccessRestrictedInChina ;
adm:license bdr:LicenseCopyrighted ;
which should make the first 20 / last 20 images work (including the one linked above). But we have a 401 instead.
in some of the first volumes of bdr:W4CZ5369,
full size image and resized image seem to have differents "red level" of something
resulting in unsatisfying OpenSeaDragon viewer behavior when zooming in/out due to this "red level" changing
below is a mix between both images, left is fullsize and right 1250px width:
We have basically two options:
The following URL gives an erroneous output:
https://iiif.bdrc.io/bdr:V22334_I3867::38670133.tif/full/,2000/0/default.png
while this one is ok:
https://iiif.bdrc.io/bdr:V22334_I3867::38670133.tif/full/max/0/default.png
In the perspective of having permanent identifiers for images, it seems reasonable that, if images are entities, an http request should provide some RDF in some serialization... This RDF would be different from the info.json
(which is of type ImageService3
for iiif v3.0), but more similar to the information used in the presentation api, which in JSON-LD is (in the example):
{
"id": "https://example.org/iiif/book1/res/page1.jpg",
"type": "Image",
"label": {"en": ["Page 1"], "es": ["Página 1"]},
"format": "image/jpeg",
"service": [
{
"id": "https://example.org/images/book1-page1",
"type": "ImageService3",
"profile": "level2"
}
],
"height": 2000,
"width": 1500
}
iiif server pdfs of Fair_use works always return 41 pages regardless the use profile. In other words, an admin user has no way to access the entire work.
There should be a way to manage pdf/zip links expiration (typical use case: a user starts a pdf generation then goes out and comes back a few hours later)
In order to serve the error images, we should be able to serve an image in
src/main/resources/static/abc.def
when requesting the id
http://iiif.bdrc.io/static::abc.def/
I've changed a few things in Geolocation in presentation (see this commit), I think a similar set of changes could be applied here, no emergency though
I'm a bit surprised as I can't seem to find the code that handles the caching of S3 images... are the images cached at all in the normal context (not in the context of building an archive)? If not, doing it should make an important difference, especially since the same image is requested several times by the viewer in different sizes (thumbnail, big thumbnail and full)
Write a custom implementation of enrichInfo method (info.json building)
I have a perspective in mind, I'm not sure if it's reasonable or doable (or in contradiction with the iif spec), but basically it's the following: having permanent identifiers for images. This implies that an image is not dependent of the iiif version, and we need to have a iiif-server-independant way to refer to them, and thus we need to completely control the URL of the images in the image server. Hence the request. It's quite obvious for annotations: when someone annotates an image provided in iiif v2, we don't want their annotation not to work anymore when we'll move to iiifv3.
When on the download page (ex: here) and click on the link, the file that is saved by my browsers (both Firefox and Chrome) has a double .pdf
extension (it's named bdr_V1PD96945_I1PD96947FAIR_USE_1-662.pdf.pdf
), there is certainly a way to fix that... also, I'm not sure it has much impact on the experience of most users, but the <a>
could have a type="application/pdf"
attribute
mozjpeg delivers jpg that are quite smaller that are quite smaller than regular jpgs (it optimizes encoding, especially for high contrasts). It should be a drop-in replacement for libturbo-jpeg (the lib has the same name), so it's theoretically just a matter of installing mozjpeg instead of libturbo-jpeg... There is a script to do that under Debian that we could use in buda-base:
https://gist.github.com/Kelfitas/f3fb99984698ccd79414c6a29e9f4edd
but I don't think we need to do that now, we're not reencoding jpeg a lot (we're serving them as is, mostly) so we won't see many benefits...
Some metadata should be added to the exported PDF, such as:
the reference to iText should be removed
The readme should be updated with the new instructions to run the server locally
In some cases tiffs are grayscale but they should be B&W, I think the server could automatically do the translation. Example:
http://iiif.bdrc.io/bdr:V00JW501203_I1CZ2552::I1CZ25520075.tif/info.json
There should be a system to configure logback so that when executing the jar we can override properties for the path to the log files, the log level, etc.
For all the images that don't require authentication to be accessed (including the 20 first and final pages of a FairUse work), the cache control mechanism should advertise the image to be public
(see rfc7234). I think images restricted in China could have that too...
Context: on S3, the tif corresponding to this image is below 30KB, but the output jpg on the iiif server is 461KB.
On the current website (using JAI), the corresponding image (here) is a png of about 30KB.
The png version is only 53KB, much more reasonable but still significantly more than the current website.
It seems hymir just uses the basic javax.imageio functions (see here) as provided by twelvemonkeys. The parameters that we can use in JPEGImageWriteParam look very limited. There doesn't seem to be a much better option in Java though.
This is an important issue for various reasons:
Here are a few ideas to start dealing with the issue:
Once the auth is done, it should be indicated in the images which users can't access. It seems it's mostly a matter of adding the auth service to the info.json
, see this example
it's very easy to get the number of TbrcIntroPages in the data (it's in volumeInfo). This gives number X (often 0 or 2), and we should skip the first X pages in the PDF export.
In the same vein as #6, but in a less important way, maybe it would be nice to change the identifier format a little bit, so that it can be prefixed. For instance
<http://iiif.bdrc.io/image/v2/bdr:V00KG0545_I1KG20698::I1KG206980007.tif>
could be
@prefix bdi: http://iiif.bdrc.io/image/v2/
bdi:bdr:V00KG0545_I1KG20698::I1KG206980007.tif
but although having :
s in the local names seems to be allowed by the spec, I'm afraid this could confuse both poorly written libraries and humans, so maybe we should replace :
by something else in the identifiers... maybe ,
?
The experience feedback I have for the iiif system is that it's very slow, and I tend to agree with that. There are some aspects that are not really due to the server, but I think some definitely are. It would be good to log some timings in the debug
log level in order to have information about bottlenecks. Basically for each big operation there should be a log with some size and timing information. A non-exhaustive list of operations:
We need to setup a password outside of the application.yml config file
re: buda-base/buda-base#7 we should be able to have a way to configure the aws credentials in a file in /etc/buda/iiifserv/
(it could in a properties file)
There must be a Hymir version issue as this option should be available as of version 3.5.2
It would be helpful if hymir was adding some exif metadata to the served images (when the format allows). Something rather simple like the source of download would be good enough. Does hymir allow that? If not we should open an issue about it.
the following URL gives a 404:
http://iiif.bdrc.io/bdr:V1NLM7_I1NLM7_001::I1NLM7_0010003.jpg/full/max/0/default.jpg
but the s3 key exists though:
s3://archive.tbrc.org/Works/ba/W1NLM7/images/W1NLM7-I1NLM7_001/I1NLM7_0010003.jpg
I just got another 500 error, on https://iiif-dev.bdrc.io/bdr:I4CZ75259::I4CZ752590001.tif/info.json , see:
this should be very high priority
in some Volumes of http://library.bdrc.io/show/bdr:I1PD96684
eg http://library.bdrc.io/show/bdr:V1PD96684_I1PD106654, http://library.bdrc.io/show/bdr:V1PD96684_I1PD106655,
http://library.bdrc.io/show/bdr:V1PD96684_I1PD106657
but not http://library.bdrc.io/show/bdr:V1PD96684_I1PD106656,
manifest is fetched successfully but there is a 500 error fetching first image: http://iiif.bdrc.io/image/v2/bdr:V1PD96684_I1PD106657::I1PD1066530003.jpg/full/full/0/default.jpg
cannot access to the volume listing page for bdr:W22084 (link)
can access to the volume listing page for bdr:W12827 (link) but then not to individual volume pdf download page eg Volume 9 (link)
can access to individual volume pdf download page eg Volume 10 (link) only after corresponding zip volume has been requested (link)
The vagrant provisioning of buda-base cannot build the iiifserv package because it cannot find the webp-imageio dep (which should be installed... I don't really understand why it cannot find it). There seems to be a maven package here:
https://mvnrepository.com/artifact/org.sejda.webp-imageio/webp-imageio-sejda
maybe we could use it? That would make things more simple...
clicking on generated download links leads to error 500:
http://iiif.bdrc.io/download/pdf/wi:bdr:W22704::bdr:I22704
http://iiif.bdrc.io/download/zip/wi:bdr:W22704::bdr:I22704
for example:
http://iiif.bdrc.io/download/zip/v:bdr:V22704_I3252::1-562
http://iiif.bdrc.io/download/pdf/v:bdr:V22704_I3265::1-864
I don't know if it's relevant in all hymir use cases, but it definitely is in ours: we want to instruct web browsers to cache images and info.json
s. Currently there is no http cache instructions, there should be at least a configurable max-age
. In our case it can be very large, the images more or less never change.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.