kshepherd / feed2html Goto Github PK

3.0 1.0 0.0 66 KB

A scrapy project to generate browseable, searchable static HTML sites from XML feeds

License: BSD 3-Clause "New" or "Revised" License

Python 54.36% CSS 3.66% XSLT 41.97%

feed2html's Introduction

feed2html

Generate static HTML sites using specifications like OCFL and Bagit from various feed formats like OAI-PMH, RSS. Part of a suite of static HTML repository tools.

Introduction

This project was inspired by Professor Hussein Suleman (University of Cape Town), who gave a rousing closing keynote at Open Repositories 2023 about what really makes an open access repository accessible, and a call to reduce complexity in digital repository/library development.

Rather than re-invent a whole repository platform, this tool is one step in that direction: using existing protocols that have served us well for years (OAI-PMH and RSS), we can harvest existing repositories on the web, no matter how complex they happen to be, and produce our own simple, file-based repositories using standards and simple document formats like HTML.

OR2024 Static Site Workshop notes

If you came here from the OR2024 closing presentation, you might want to check out the 'mets' spider instead of the oai_dc one, as it will harvest an OAI-PMH METS with MODS feed, download linked files, and write markdown documents with frontmatter instead of using XSLT to try and produce all the HTML itself. (this lets us feed static site generators which generally will do a better job of this).

See the slides and pitches from this presentation at [TODO: Put link here once slides read-only]

The rest of the instructions here should still apply, but you should also read through some of the initial settings like base file paths for storage, etc., not just the OFCL paths.

Quick start

Right now, the tool is only tested on DSpace OAI-PMH feeds using the oai_dc (simple Dublin Core elements) metadata format.

Make sure you have Python 3 installed. I also recommend pyenv for virtualenv and version management.
Clone this repository with git clone https://github.com/kshepherd/feed2html.git
Optional: Create or activate a virtualenv with the standard tools or pyenv
Install requirements with pip install -r requirements.txt
Identify the start URL for your DSpace ListRecords OAI verb, eg. https://openaccess.myinstitution.edu/oai/request?verb=ListRecords&metadataPrefix=oai_dc
Give the output/oaidc2html.xsl stylesheet a quick check to make sure it is transforming the fields you're interested in
Set up a base directory for your OCFL repository and css files and note the full path
Copy or symlink output/css to this base directory
Begin a crawl! Let's go with that example URL and a base dir of /tmp/site

scrapy crawl oaipmh_dc_xml \
   -a url="https://openaccess.myinstitution.edu/oai/request?verb=ListRecords&metadataPrefix=oai_dc" \
   -a website_title="Test" \
   -a website_subtitle="open access research" \
   -a path_to_assets="/tmp/site" \
   -a path_to_ocfl="/tmp/site/repository" \
   -L INFO

To test just the first page of the OAI results, uncomment CLOSESPIDER_ITEMCOUNT in feed2html/spiders/oaipmh_dc_xml.py

Customising

Take a look at the parse_record method in feed2html/spiders/oaipmh_dc_xml.py to see how the simple item objects are constructed. This can be extended the same way as any other scrapy XML feed spider
If the spider does not properly follow resumption tokens (to get the next page), run the crawl in debug mode with -L DEBUG and compare the expected XML with the token extraction in parse_node

Tools and methodologies

Python 3 is a popular, accessible language and is widely used by researchers, librarians and other open access practitioners.

Scrapy is a well-supported, extensible Python module which can scrape web resources and process the results through pipelines, allowing a lot of customisation while leaving the low-level HTTP, document parsing work to an existing framework which has its own open source community and can easily be extended for more advanced solutions.

The Oxford Common File Layout (OCFL) specification describes an application-independent approach to the storage of digital information in a structured, transparent, and predictable manner. It is designed to promote long-term object management best practices within digital repositories.

Extensible Stylesheet Language Transformations (XSLT) is an XML-based language used, in conjunction with specialized processing software, for the transformation of XML documents. It has long been popular in the library sector.

Goals

Turn feeds (OAI, RSS/Atom, ActivityPub) into complete static websites
- "Put DSpace on a CD-ROM"
Start with most basic requirements -- OAIPMH, Dublin Core elements and terms -- RSS 2.0 for blogs, podcasts -- (RDF, jsonld, other formats and protocols come later)

TODO

Spider
1. Build OCFL layer
2. Read XML feed with resumption tokens
Pipelines:
1. Initialize OCFL repository on disk
2. Create BagIt fs structure (simpler alternative to OCFL)
3. Search OA services (unpaywall etc) for OA links
4. Transform to HTML with XSLT
5. Create OCFL object and version and add to repository
6. Send to search index (solr, ES, zincsearch?)
Documentation
1. Complete pydoc coverage
2. Installation and usage instructions
3. Complete this README with thorough explanation of the spider and pipelines, and advanced usage instructions
Release
1. Create requirements.txt and INSTALL.md
2. Create LICENSE.txt for BSD 3-Clause license.
3. Release to PyPI (or figure out the best way to package and distribute releases) once the project is beyond prototype

Notes

See NOTES.md for informal notes, ideas, links, references.

feed2html's People

Contributors

Stargazers

Watchers

feed2html's Issues

View generated site?

I successfully ran a crawl over a 150 item subset of our DSpace repository. This resulted in the OCFL objects being created in the designated output directory. However, I am not seeing any pages or an index.html. Is a full run over the entire repository required to generate site pages?

Recommend Projects

kshepherd / feed2html Goto Github PK

feed2html's Introduction

feed2html

Introduction

OR2024 Static Site Workshop notes

Quick start

Customising

Tools and methodologies

Goals

TODO

Notes

feed2html's People

Contributors

Stargazers

Watchers

feed2html's Issues

View generated site?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent