Coder Social home page Coder Social logo

feed2html's Introduction

feed2html

Generate static HTML sites using specifications like OCFL and Bagit from various feed formats like OAI-PMH, RSS. Part of a suite of static HTML repository tools.

Introduction

This project was inspired by Professor Hussein Suleman (University of Cape Town), who gave a rousing closing keynote at Open Repositories 2023 about what really makes an open access repository accessible, and a call to reduce complexity in digital repository/library development.

Rather than re-invent a whole repository platform, this tool is one step in that direction: using existing protocols that have served us well for years (OAI-PMH and RSS), we can harvest existing repositories on the web, no matter how complex they happen to be, and produce our own simple, file-based repositories using standards and simple document formats like HTML.

OR2024 Static Site Workshop notes

If you came here from the OR2024 closing presentation, you might want to check out the 'mets' spider instead of the oai_dc one, as it will harvest an OAI-PMH METS with MODS feed, download linked files, and write markdown documents with frontmatter instead of using XSLT to try and produce all the HTML itself. (this lets us feed static site generators which generally will do a better job of this).

See the slides and pitches from this presentation at [TODO: Put link here once slides read-only]

The rest of the instructions here should still apply, but you should also read through some of the initial settings like base file paths for storage, etc., not just the OFCL paths.

Quick start

Right now, the tool is only tested on DSpace OAI-PMH feeds using the oai_dc (simple Dublin Core elements) metadata format.

  1. Make sure you have Python 3 installed. I also recommend pyenv for virtualenv and version management.
  2. Clone this repository with git clone https://github.com/kshepherd/feed2html.git
  3. Optional: Create or activate a virtualenv with the standard tools or pyenv
  4. Install requirements with pip install -r requirements.txt
  5. Identify the start URL for your DSpace ListRecords OAI verb, eg. https://openaccess.myinstitution.edu/oai/request?verb=ListRecords&metadataPrefix=oai_dc
  6. Give the output/oaidc2html.xsl stylesheet a quick check to make sure it is transforming the fields you're interested in
  7. Set up a base directory for your OCFL repository and css files and note the full path
  8. Copy or symlink output/css to this base directory
  9. Begin a crawl! Let's go with that example URL and a base dir of /tmp/site
scrapy crawl oaipmh_dc_xml \
   -a url="https://openaccess.myinstitution.edu/oai/request?verb=ListRecords&metadataPrefix=oai_dc" \
   -a website_title="Test" \
   -a website_subtitle="open access research" \
   -a path_to_assets="/tmp/site" \
   -a path_to_ocfl="/tmp/site/repository" \
   -L INFO

To test just the first page of the OAI results, uncomment CLOSESPIDER_ITEMCOUNT in feed2html/spiders/oaipmh_dc_xml.py

Customising

  1. Take a look at the parse_record method in feed2html/spiders/oaipmh_dc_xml.py to see how the simple item objects are constructed. This can be extended the same way as any other scrapy XML feed spider
  2. If the spider does not properly follow resumption tokens (to get the next page), run the crawl in debug mode with -L DEBUG and compare the expected XML with the token extraction in parse_node

Tools and methodologies

Python 3 is a popular, accessible language and is widely used by researchers, librarians and other open access practitioners.

Scrapy is a well-supported, extensible Python module which can scrape web resources and process the results through pipelines, allowing a lot of customisation while leaving the low-level HTTP, document parsing work to an existing framework which has its own open source community and can easily be extended for more advanced solutions.

The Oxford Common File Layout (OCFL) specification describes an application-independent approach to the storage of digital information in a structured, transparent, and predictable manner. It is designed to promote long-term object management best practices within digital repositories.

Extensible Stylesheet Language Transformations (XSLT) is an XML-based language used, in conjunction with specialized processing software, for the transformation of XML documents. It has long been popular in the library sector.

Goals

  • Turn feeds (OAI, RSS/Atom, ActivityPub) into complete static websites
    • "Put DSpace on a CD-ROM"
  • Start with most basic requirements -- OAIPMH, Dublin Core elements and terms -- RSS 2.0 for blogs, podcasts -- (RDF, jsonld, other formats and protocols come later)

TODO

  1. Spider
    1. Build OCFL layer
    2. Read XML feed with resumption tokens
  2. Pipelines:
    1. Initialize OCFL repository on disk
    2. Create BagIt fs structure (simpler alternative to OCFL)
    3. Search OA services (unpaywall etc) for OA links
    4. Transform to HTML with XSLT
    5. Create OCFL object and version and add to repository
    6. Send to search index (solr, ES, zincsearch?)
  3. Documentation
    1. Complete pydoc coverage
    2. Installation and usage instructions
    3. Complete this README with thorough explanation of the spider and pipelines, and advanced usage instructions
  4. Release
    1. Create requirements.txt and INSTALL.md
    2. Create LICENSE.txt for BSD 3-Clause license.
    3. Release to PyPI (or figure out the best way to package and distribute releases) once the project is beyond prototype

Notes

See NOTES.md for informal notes, ideas, links, references.

feed2html's People

Contributors

kshepherd avatar

Stargazers

Julien Sicot avatar Ben Companjen avatar Hardy Pottinger avatar

Watchers

 avatar

feed2html's Issues

View generated site?

I successfully ran a crawl over a 150 item subset of our DSpace repository. This resulted in the OCFL objects being created in the designated output directory. However, I am not seeing any pages or an index.html. Is a full run over the entire repository required to generate site pages?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.