Coder Social home page Coder Social logo

Comments (2)

LegalizeAdulthood avatar LegalizeAdulthood commented on July 28, 2024

Automating content ingestion is something that I've been working on quite a bit, which you can see by browsing the various open issues and the code.

The Login link on the pages is to allow people to authenticate themselves and then be allowed to add new documents to the database. I made a video that explains the current process for adding documents.

While this is fine for onesy-twosy style adding of documents, it's not ideal for adding tens, hundreds or thousands of documents. The "URL Wizard" page analyzes the document URL and tries to fill out as many fields as possible based on the URL. Due to web server timeouts, it's not feasible to fetch the URL and analyze it's content for additional information that might be embedded in the PDF.

If the following conventions are followed in the URLs of documents, then manx has logic in the URL Wizard to determine the most important metadata from the URL:

  • part number of the document
  • title of the document
  • publication date of the document

The convention is as follows:

<partno>_<title>_<date>.pdf

Where:

  • <partno> is a sequence of digits, dashes and letters, but no underscores or spaces.
  • <title> is the title of the document with _ or spaces used to separate words in the title
  • <date> is a 2 or 4 digit year, or a month-year combination such as Dec1989, Dec89, December89, or a full date in the form MM-DD-YYYY

Some people think URLs with _s separating words are "ugly", but without clear word separators you can't distinguish between DEC link and DEClink and other such weird capitalization that computer companies have always used in document titles. Since manx can't enforce the structure of URLs, if underscores are not present it does it's best guess at identifying the start of words (such as a change in case) and presents it's best guess in the URL Wizard to the user for editing before adding the document to the database.

In general, spaces and ampersands in URLs may still cause things to mess up due to the need to escape them for proper rendering. I believe I have fixed all those cases in the existing code, but it's best to just avoid putting spaces and & in URLs in general, even if they are valid filenames on your hosting machine.

For sites like bitsavers and the Vintage Technology Documentation Archive that publish an IndexByDate.txt file, there is support for fetching this file periodically by cron job and using it to prepopulate a staging area in the database with documents mentioned in the index but unknown to manx. Any site with such a file is assumed to publish documents organized by company into subdirectories, e.g. the directory dec for Digital Equipment Corporation. The actual directory names aren't assumed, but the association is built between directories and companies as documents are added, so that subsequent documents in the same directory are assumed to be for the same company.

(The RSS feeds are generated from these IndexByDate.txt files using a perl script that I wrote and runs locally by the admin on the hosting sites for bitsavers and VTDA to generate the RSS XML files via cron job.)

For logged in users, manx can present the entries from the IndexByDate for browsing and adding of individual documents. What is needed next is to improve the productivity for users to bulk add documents from the IndexByDate entries. This is on the current development roadmap. I tried some experiments with automatically adding documents, but there were too many errors for my taste in just assuming that the URLs contained all the correct data. So my current thinking is to present a table of metadata for documents based on the IndexByDate file and allow someone to preview that table of data, edit as necessary, and then submit. This would be a bulk submit of new entries to the database, maybe 10 at a time or all the files in a subdirectory or something like that. This would be more productive than doing those documents one at a time, but still provide for editorial oversight to catch mistakes.

The IndexByDate logic is generic and the same for the two sites that currently support this. If your documents are going into bitsavers, then nothing needs to be done to support them. (Again, it's best if you follow the above URL convention in order to get the most accurate extraction of part number, title and date for a document.) If you are going to be hosting your documents on your own machine, then support can be added for your site if you follow the same conventions. Adding custom logic for your site beyond that is possible (it's just software, after all), but not likely to be done by me at any time soon, if ever. It's worth investing in supporting the conventions of bitsavers due to the size of the archive.

Naturally, since this is an open source project if you want to add custom support on your own and send a pull request, I'm happy to accept such contributions.

Well, that's enough for this incredibly long comment on a github issue :).

from manx.

LegalizeAdulthood avatar LegalizeAdulthood commented on July 28, 2024

Oh yeah, also # in filenames is also an annoyance that has to be coded around.

from manx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.