As best that I can determine, Manx currently supports two mechanisms for adding conten

Add support for accepting regular content updates from non-Bitsavers sites about manx HOT 2 OPEN

pbirkel commented on July 28, 2024

Add support for accepting regular content updates from non-Bitsavers sites

from manx.

Comments (2)

LegalizeAdulthood commented on July 28, 2024

Automating content ingestion is something that I've been working on quite a bit, which you can see by browsing the various open issues and the code.

The Login link on the pages is to allow people to authenticate themselves and then be allowed to add new documents to the database. I made a video that explains the current process for adding documents.

While this is fine for onesy-twosy style adding of documents, it's not ideal for adding tens, hundreds or thousands of documents. The "URL Wizard" page analyzes the document URL and tries to fill out as many fields as possible based on the URL. Due to web server timeouts, it's not feasible to fetch the URL and analyze it's content for additional information that might be embedded in the PDF.

If the following conventions are followed in the URLs of documents, then manx has logic in the URL Wizard to determine the most important metadata from the URL:

part number of the document
title of the document
publication date of the document

The convention is as follows:

<partno>_<title>_<date>.pdf

Where:

<partno> is a sequence of digits, dashes and letters, but no underscores or spaces.
<title> is the title of the document with _ or spaces used to separate words in the title
<date> is a 2 or 4 digit year, or a month-year combination such as Dec1989, Dec89, December89, or a full date in the form MM-DD-YYYY

Some people think URLs with _s separating words are "ugly", but without clear word separators you can't distinguish between DEC link and DEClink and other such weird capitalization that computer companies have always used in document titles. Since manx can't enforce the structure of URLs, if underscores are not present it does it's best guess at identifying the start of words (such as a change in case) and presents it's best guess in the URL Wizard to the user for editing before adding the document to the database.

In general, spaces and ampersands in URLs may still cause things to mess up due to the need to escape them for proper rendering. I believe I have fixed all those cases in the existing code, but it's best to just avoid putting spaces and & in URLs in general, even if they are valid filenames on your hosting machine.

For sites like bitsavers and the Vintage Technology Documentation Archive that publish an IndexByDate.txt file, there is support for fetching this file periodically by cron job and using it to prepopulate a staging area in the database with documents mentioned in the index but unknown to manx. Any site with such a file is assumed to publish documents organized by company into subdirectories, e.g. the directory dec for Digital Equipment Corporation. The actual directory names aren't assumed, but the association is built between directories and companies as documents are added, so that subsequent documents in the same directory are assumed to be for the same company.

(The RSS feeds are generated from these IndexByDate.txt files using a perl script that I wrote and runs locally by the admin on the hosting sites for bitsavers and VTDA to generate the RSS XML files via cron job.)

For logged in users, manx can present the entries from the IndexByDate for browsing and adding of individual documents. What is needed next is to improve the productivity for users to bulk add documents from the IndexByDate entries. This is on the current development roadmap. I tried some experiments with automatically adding documents, but there were too many errors for my taste in just assuming that the URLs contained all the correct data. So my current thinking is to present a table of metadata for documents based on the IndexByDate file and allow someone to preview that table of data, edit as necessary, and then submit. This would be a bulk submit of new entries to the database, maybe 10 at a time or all the files in a subdirectory or something like that. This would be more productive than doing those documents one at a time, but still provide for editorial oversight to catch mistakes.

The IndexByDate logic is generic and the same for the two sites that currently support this. If your documents are going into bitsavers, then nothing needs to be done to support them. (Again, it's best if you follow the above URL convention in order to get the most accurate extraction of part number, title and date for a document.) If you are going to be hosting your documents on your own machine, then support can be added for your site if you follow the same conventions. Adding custom logic for your site beyond that is possible (it's just software, after all), but not likely to be done by me at any time soon, if ever. It's worth investing in supporting the conventions of bitsavers due to the size of the archive.

Naturally, since this is an open source project if you want to add custom support on your own and send a pull request, I'm happy to accept such contributions.

Well, that's enough for this incredibly long comment on a github issue :).

from manx.

LegalizeAdulthood commented on July 28, 2024

Oh yeah, also # in filenames is also an annoyance that has to be coded around.

from manx.

Add support for accepting regular content updates from non-Bitsavers sites about manx HOT 2 OPEN

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent