Comments (2)
Automating content ingestion is something that I've been working on quite a bit, which you can see by browsing the various open issues and the code.
The Login
link on the pages is to allow people to authenticate themselves and then be allowed to add new documents to the database. I made a video that explains the current process for adding documents.
While this is fine for onesy-twosy style adding of documents, it's not ideal for adding tens, hundreds or thousands of documents. The "URL Wizard" page analyzes the document URL and tries to fill out as many fields as possible based on the URL. Due to web server timeouts, it's not feasible to fetch the URL and analyze it's content for additional information that might be embedded in the PDF.
If the following conventions are followed in the URLs of documents, then manx has logic in the URL Wizard to determine the most important metadata from the URL:
- part number of the document
- title of the document
- publication date of the document
The convention is as follows:
<partno>_<title>_<date>.pdf
Where:
<partno>
is a sequence of digits, dashes and letters, but no underscores or spaces.<title>
is the title of the document with_
or spaces used to separate words in the title<date>
is a 2 or 4 digit year, or a month-year combination such asDec1989
,Dec89
,December89
, or a full date in the formMM-DD-YYYY
Some people think URLs with _
s separating words are "ugly", but without clear word separators you can't distinguish between DEC link
and DEClink
and other such weird capitalization that computer companies have always used in document titles. Since manx can't enforce the structure of URLs, if underscores are not present it does it's best guess at identifying the start of words (such as a change in case) and presents it's best guess in the URL Wizard to the user for editing before adding the document to the database.
In general, spaces and ampersands in URLs may still cause things to mess up due to the need to escape them for proper rendering. I believe I have fixed all those cases in the existing code, but it's best to just avoid putting spaces and &
in URLs in general, even if they are valid filenames on your hosting machine.
For sites like bitsavers and the Vintage Technology Documentation Archive that publish an IndexByDate.txt
file, there is support for fetching this file periodically by cron job and using it to prepopulate a staging area in the database with documents mentioned in the index but unknown to manx. Any site with such a file is assumed to publish documents organized by company into subdirectories, e.g. the directory dec
for Digital Equipment Corporation. The actual directory names aren't assumed, but the association is built between directories and companies as documents are added, so that subsequent documents in the same directory are assumed to be for the same company.
(The RSS feeds are generated from these IndexByDate.txt
files using a perl script that I wrote and runs locally by the admin on the hosting sites for bitsavers and VTDA to generate the RSS XML files via cron job.)
For logged in users, manx can present the entries from the IndexByDate for browsing and adding of individual documents. What is needed next is to improve the productivity for users to bulk add documents from the IndexByDate entries. This is on the current development roadmap. I tried some experiments with automatically adding documents, but there were too many errors for my taste in just assuming that the URLs contained all the correct data. So my current thinking is to present a table of metadata for documents based on the IndexByDate file and allow someone to preview that table of data, edit as necessary, and then submit. This would be a bulk submit of new entries to the database, maybe 10 at a time or all the files in a subdirectory or something like that. This would be more productive than doing those documents one at a time, but still provide for editorial oversight to catch mistakes.
The IndexByDate logic is generic and the same for the two sites that currently support this. If your documents are going into bitsavers, then nothing needs to be done to support them. (Again, it's best if you follow the above URL convention in order to get the most accurate extraction of part number, title and date for a document.) If you are going to be hosting your documents on your own machine, then support can be added for your site if you follow the same conventions. Adding custom logic for your site beyond that is possible (it's just software, after all), but not likely to be done by me at any time soon, if ever. It's worth investing in supporting the conventions of bitsavers due to the size of the archive.
Naturally, since this is an open source project if you want to add custom support on your own and send a pull request, I'm happy to accept such contributions.
Well, that's enough for this incredibly long comment on a github issue :)
.
from manx.
Oh yeah, also #
in filenames is also an annoyance that has to be coded around.
from manx.
Related Issues (20)
- many documents available from "Wilber Williams' Computer Musem" HOT 3
- Switch ChiClassicComp to VTDA
- MP-01394 DEC PC350 Field Maintenance Print Set now available on bitsavers.org HOT 1
- Found VAXstation 3100 Maintenance Guide - EK-285AA-MG HOT 1
- The RK611 and KK11-A document is indeed available from bitsavers HOT 1
- Documents aren't removed from the WhatsNew list after being processed through the wizard
- Perform keyword search on bitsavers IndexByDate.txt HOT 3
- Extract PDF metadata via cron
- "RMS Structures and Utilities on VAX/VMS: Student Guide" is down/missing HOT 4
- Broken URL for "creating a work item for the Data component" HOT 6
- Create a mechanism for a non-registered user to submit a document
- Mirror DFWCUG HCPS: DEC Manuals and Print Sets HOT 1
- Explore a django based implementation
- GA22-6821
- https URLs aren't recognized properly
- Missing document found (DEC vt330plus service guide EK-VT33R-SV-001)
- Wrong link on 'VK100 GIGI Terminal Illustrated Parts Breakdown'
- LCG01 doc inconsistency
- COBOL 68 manual online. HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from manx.