Comments (3)
When run with higher verbosity -v 1 --output-type pdfa
ocrmypdf logs the following:
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata. _metadata.py:62
The following metadata fields were not copied: {'{http://purl.org/dc/elements/1.1/}created', '{http://purl.org/dc/elements/1.1/}contributor', '{http://ns.adobe.com/xap/1.0/}MetadataDate', '{http://purl.org/dc/elements/1.1/}subject'} _metadata.py:67
It looks like you used pikepdf to set dc:contributor, and set it to a "singleton" text string. pikepdf does not block you from setting metadata to a type that is not consistent with the XML schema, unfortunately. exiftool displays it even though it's the wrong type as a best-effort fallback I suppose.
Depending what you're doing you could also use libexempi3 (python-xmp-toolkit) which is a more comprehensive implementation of the XMP spec, but also very difficult to use in my experience. (When it comes down to it, XMP is a ridiculously overengineered spec, so there's only so much one can wrangle its complexity into a clean interface.) There are some complex XMP data structures that pikepdf cannot generate.
dc:contributor's type is rdf:Bag - that is, an unordered list/set - there are potentially multiple contributors to a work and no priority is assumed. If you assign it using a set, pikepdf will generate a rdf:Bag and the correct metadata is generated.
In [5]: with p.open_metadata() as m:
...: del m['dc:contributor']
...: m['dc:contributor'] = {'Contributor One', 'Contributor Two'}
...:
In [6]: p.save('issuepdf/1220.fixed.pdf')
It's actually Ghostscript that silently strips out incorrect metadata when it is run. Then OCRmyPDF reports that some metadata was missing.
Using the procedure above you can determine appropriate types for the other metadata fields of interesting and fix them.
Since OCRmyPDF warns about removal of metadata, there's nothing to fix in its codebase. I could see adding an enhancement to pikepdf to warn about assigning wrong types for the most important metadata fields (Dublin Core, mainly).
from ocrmypdf.
Thanks a lot for the detailed answer! And yeah, I agree, XMP seems to be one of these overengineered XML specs. 😄
from ocrmypdf.
Just tried it again with this pikepdf snippet to create the metadata:
import pikepdf
import sys
from datetime import datetime
from pikepdf.models.metadata import encode_pdf_date
d = encode_pdf_date(datetime(year=2023, month=12, day=25))
pdf = pikepdf.open(sys.argv[1])
with pdf.open_metadata() as meta:
meta['dc:contributor'] = { "Test Contributor" }
meta['dc:title'] = "Title"
meta['dc:created'] = d
pdf.save(sys.argv[2])
The metadata generated by pikepdf looks ok:
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="pikepdf">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""><dc:contributor xmlns:dc="http://purl.org/dc/elements/1.1/"><rdf:Bag><rdf:li>Test Contributor</rdf:li></rdf:Bag></dc:contributor></rdf:Description><rdf:Description rdf:about=""><dc:title xmlns:dc="http://purl.org/dc/elements/1.1/"><rdf:Alt><rdf:li xml:lang="x-default">Title</rdf:li></rdf:Alt></dc:title></rdf:Description><rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="" dc:created="D:20231225000000"/><rdf:Description xmlns:xmp="http://ns.adobe.com/xap/1.0/" rdf:about="" xmp:MetadataDate="2024-01-18T06:31:13.412391+00:00"/><rdf:Description xmlns:pdf="http://ns.adobe.com/pdf/1.3/" rdf:about="" pdf:Producer="pikepdf 8.10.1"/></rdf:RDF>
</x:xmpmeta>
Still, the dc:contributor
and dc:created
get dropped:
The following metadata fields were not copied: _metadata.py:67
{'{http://purl.org/dc/elements/1.1/}contributor',
'{http://purl.org/dc/elements/1.1/}created', '{http://ns.adobe.com/xap/1.0/}MetadataDate'}
I'd also rather use dc:subject
instead of dc:title
, but it also gets dropped. 😦
from ocrmypdf.
Related Issues (20)
- [Bug]: watcher.py requires the "ARCHIVE" folder to be assigned, even if the option is disabled HOT 1
- Release notes don't include the latest versions HOT 1
- [Bug]: real text replaced by � � (visually unchanged, only by copying)
- [Feature]: Change demo format to VHS
- [Feature]: JPEG XL support HOT 3
- not user friendly HOT 1
- [Bug]: ValueError: ObjectList must have 6 elements HOT 3
- [Bug]: conda installation HOT 2
- [Bug]: File size increased HOT 7
- [Bug]: No longer works - macos-11.7 x86_64 Python 3.10 HOT 10
- [Bug]: cannot import name 'PDFTextSeq' from 'pdfminer.pdfdevice' HOT 3
- Make usage of --rotate-pages-threshold clearer
- Indian Numbers on Arabic text
- [Bug]: Crash on multiple .pdf files HOT 5
- Show progress during postprocessing HOT 5
- [Feature]: If page has text, force OCR and rasterize page HOT 1
- [Bug]: NotImplementedError: not sure how to get colorspace
- [Bug]: test_semfree fails with ghostscript 10.03.0+
- Pushed docker image is always Ubuntu instead of alpine HOT 1
- [Bug]: HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ocrmypdf.