princeton-cdh / ppa-django Goto Github PK

View Code? Open in Web Editor NEW

4.0 6.0 2.0 30.21 MB

Princeton Prosody Archive v3.x - Python/Django web application

Home Page: http://prosody.princeton.edu

License: Apache License 2.0

Python 78.68% HTML 7.16% CSS 0.79% JavaScript 5.20% TypeScript 2.06% SCSS 6.11%

python django hathitrust digital-humanities solr

ppa-django's Introduction

ppa-django

Django web application for Princeton Prosody Archive version 3.x.

Code and architecture documentation for the current release available at https://princeton-cdh.github.io/ppa-django/.

This repo uses git-flow conventions; main contains the most recent release, and work in progress will be on the develop branch. Pull requests should be made against develop.

Python 3.11 / Django 5.0 / Node 18.12 / Postgresql 15 / Solr 8

Development instructions

Initial setup and installation:

recommended: create and activate a python 3.11 virtual environment, perhaps with virtualenv or venv

Use pip to install required python dependencies:

pip install -r requirements.txt
pip install -r dev-requirements.txt

Copy sample local settings and configure for your environment:

cp ppa/settings/local_settings.py.sample ppa/settings/local_settings.py

Create a database, configure in local settings in the DATABASES dictionary, change SECRET_KEY, and run migrations:
```
python manage.py migrate
```
Create a new Solr configset from the files in solr_conf :
```
cp -r solr_conf /path/to/solr/server/solr/configsets/ppa
chown solr:solr -R /path/to/solr/server/solr/configsets/ppa
```
and configure SOLR_CONNECTIONS in local settings with your preferred core/collection name and the configset name you created.

See developer notes for setup instructions for using docker with solr:8.4 image.
Bulk import (provisional): requires a local copy of HathiTrust data as pairtree provided by rsync. Configure the path in localsettings.py and then run:
```
python manage.py hathi_import
```

Then index the imported content into Solr:

python manage.py index -i work
python manage.py index_pages

Frontend development setup:

This project uses the Fomantic UI library in addition to custom styles and javascript. You need to compile static assets before running the server.

To build all styles and js for production, including fomantic UI:
```
npm install
npm run build
```

Alternatively, you can rebuild just the custom files or fomantic independently. This is useful if you make small changes and need to recompile once:

npm run build:qa # just the custom files, with sourcemaps
npm run build:prod # just the custom files, no sourcemaps
npm run build:semantic # just fomantic UI

Finally, you can run a development server with hot reload if you'll be changing either set of assets frequently. These two processes are separate as well:

npm run dev # serve just the custom files from memory, with hot reload
npm run dev:semantic # serve just fomantic UI files and recompile on changes

Tests

Python unit tests are written with pytest but use Django fixture loading and convenience testing methods when that makes things easier. To run them, first install development requirements:

pip install -r dev-requirements.txt

To run all python unit tests, use: pytest

Some deprecation warnings for dependencies have been suppressed in pytest.ini; to see warnings, run with pytest -Wd.

Make sure you configure a test solr connection and set up an empty Solr core using the same instructions as for the development core.

Some python unit tests access rendered views, and therefore expect static files to be compiled; see "Frontend development setup" above for how to do this.

In a CI context, we use a fake webpack loader backend that ignores missing assets.

Javascript unit tests are written with Jasmine and run using Karma. To run them, you can use an npm command:

npm test

Automated accessibility testing is also possible using pa11y and pa11y-ci. To run accessibility tests, start the server with python manage.py runserver and then use npm:

npm run pa11y

The accessibility tests are configured to read options from the .pa11yci.json file and look for a sitemap at localhost:8000/sitemap.xml to use to crawl the site. Additional URLs to test can be added to the urls property of the .pa11yci.json file.

Setup pre-commit hooks

If you plan to contribute to this repository, please run the following command:

pre-commit install

This will add a pre-commit hook to automatically style and clean python code with black and ruff.

Because these styling conventions were instituted after multiple releases of development on this project, git blame may not reflect the true author of a given line. In order to see a more accurate git blame execute the following command:

git blame <FILE> --ignore-revs-file .git-blame-ignore-revs

Or configure your git to always ignore styling revision commits:

git config blame.ignoreRevsFile .git-blame-ignore-revs

Documentation

Documentation is generated using sphinx To generate documentation them, first install development requirements:

pip install -r dev-requirements.txt

Then build documentation using the customized make file in the docs directory:

cd sphinx-docs
make html

To check documentation coverage, run:

make html -b coverage

This will create a file under _build/coverage/python.txt listing any python classes or methods that are not documented. Note that sphinx can only report on code coverage for files that are included in the documentation. If a new python file is created but not included in the sphinx documentation, it will be omitted.

Documentation will be built and published with GitHub Pages by a GitHub Actions workflow triggered on push to main.

The same GitHub Actions workflow will build documentation and checked documentation coverage on pull requests.

License

This project is licensed under the Apache 2.0 License.

©2019-2024 Trustees of Princeton University. Permission granted via Princeton Docket #20-3624 for distribution online under a standard Open Source license. Ownership rights transferred to Rebecca Koeser provided software is distributed online via open source.

ppa-django's People

Contributors

Stargazers

Watchers

Forkers

vineetbansal quadrismegistus

ppa-django's Issues

As an admin, I want a way to search and select digitized items for bulk addition to a collection so that I can efficiently organize large groups of items.

Notes for testing

Select a few digitized works in the admin interface and then select (lower left) from the dropdown to add them to a collection. This will take you to an intermediate page where you can select collections and then finalize the add.
On successful add you should be returned to the digitzed works listing with a success message and the collections should be set.
Try doing this by selecting all items on the page (check box in the top row) and then using the select all 5007 option in lower left and add everything to a collection. This will take several seconds, but should work.
Try searching the digitized works for something that returns more than 100 items (e.g. "elocution" returns 259) and add all items from the search to a collection.

Notes for development

Implement as a custom django admin action - see https://docs.djangoproject.com/en/2.0/ref/contrib/admin/actions/
Should provide an intermediate page for user to select the collection - https://docs.djangoproject.com/en/2.0/ref/contrib/admin/actions/#actions-that-provide-intermediate-pages
The action shoud be specific to/available for DigitizedWorks only

As a user, I want to add a work to my Zotero library from the individual item page so that I can save it for research without having to go back to the list of results.

Notes for testing

You need to have Zotero installed and running, and use a browser with the Zotero plugin installed. On the list archive page, you should see a folder icon (indicating multiple items available). Click on the folder and after a delay (while it loads the metadata) you should see a dialog box that will let you select all or some of the items on the page for harvest into your Zotero library.

As an admin, I want to add and edit collection descriptions so that I can help site users understand the collection and find related materials.

Notes for testing

There should be a description on collections
You can edit it, and also add bolding and italics to fonts
These will display on the 'View Collection' page with formatting intact.

Notes for development

Needs to support for basic formatting
Suggest tinymce with fairly minimal set of html fields/styles
make sure description displays in list view with formatting

As a user, I want to filter search results by publication year or range of years so that I focus on works from a particular time period.

Notes for testing

placeholder text should display max/min values based on what's in the database
should be able to specify just a min (everything after date), just a max (everything before), or both
use same date in both to get items from a single year
entering a min greater than max should give an error message
entering a date outside the max/min in the db should give an error message

Styles: editorial post

dev notes

testing notes

designs
desktop / tablet / mobile

As an admin, I would like to be able to see the Hathi Catalog IDs for a volume so that I can see how individual volumes are grouped together within the HathiTrust.

Mockup: homepage

testing notes

should be mostly responsive
card styles are ugly but they'll come in the next wave since they're component styles

As an admin, I want to see when an item was added to the archive and when it was last modified so that I can see which materials were added and changed and when.

modify import script to only update db/solr records when they have changed

As an admin, I want a link from the digitized work list view to HathiTrust so that I can check the contents as I curate the archive.

Notes for testing

View the listing of all digitized works. There should now be a second column labeled Source id as before, but that is now clickable and links to the HathiTrust url in a new tab/window depending on your browser's defaults.
View the change form for a particular digitized work. It should also have a similar uneditable link at the top of the form with the same functionality. (The source_url field has been suppressed since it is no longer needed and isn't intended to be editable from the admin).
All other fields appear as they did before (both editable and uneditable)

Notes for development

add a property to generate a link to Hathi using source_url and display the source_id
link should open in a new window (target="_blank")
adjust list field to make title the edit link, display source id second
display on the edit page as well (clickable version)

As a user, I want to browse the list of collections so I can find out more about important groupings of items in the archive.

Notes for testing

Add a collection or two (if you haven't done so already)
Go to 'View Collections' on the main site.
They should be there!

Notes for development

Public facing collection list view. Propose /collections/ for url.
Display collection name, description; number of items?

As a user, I want item titles to ignore definite articles and punctuation when sorting, so that I can find the most relevant content first.

notes for testing

When you sort on title in the public search powered by Solr, it should now sort on the same sort titles that you can view and edit in the admin interface (see #100 )

dev notes

need to update form to sort items on the new sort title field

As a user, I want to filter search results by collection so that I can include or exclude groups of materials based on my interests.

Suggestions for Testing

add a collection
add a digitized work (on its edit screen) to that collection
the collection should appear on the main search form and allow you to use a checkbox to filter by collection

Notes for Development

add a solr collection field to use as a facet (consider text collection field and string collection_exact copy field)
add logic to reindex digitized works when they are updated (i.e. when associated with a collection)
modify the search form to provide a collection facet select field

Bulk add-to collections tool is clearing items that were previously added to collections individually

There must be a "replace" rather than "add" logic bug in the code

As a user, I want to see numbered results so I can keep track of results as I’m scrolling and paging through.

Notes for testing

search results should be in a numbered list instead of a bulleted list
numbering should continue on subsequent pages of results (rather than starting from 1 again)

As a user viewing an individual item from a keyword search, I want to see page image thumbnails and text snippets that match my search terms so I can see how many and what kind of pages match my search terms.

Notes for testing

when you do a keyword search on the main archive page, you should see a "view all N pages" link for any work that has more than two matching pages
link should take you to the detail view for that work, with a paginated list of matching pages; pages should display thumbnail image and highlighted text snippets
you should see your query text in a search box which you can refine and edit and resubmit

Notes for development

pass querystring from listview context
add a search form to detail view
update the view to search for pages in the book when there's a keyword
use existing pagination logic
refactor thumbnail and snippets as components

As a user, I want to change how my results are sorted so I can browse the results in multiple ways.

Notes for testing

Should be able to sort by:

relevance, but only if there is a keyword search
chronology (reversible)
alpha by title (reversible)
alpha by author (reversible)
Also check:
Sort should persist as you page through results
default sort is title a-z

Note: we currently don't have a way to make relevance the default for keyword searches without overriding user's sort selection when there is a keyword search, so consider that out of scope for this feature (I'm thinking about a way to do it and we'll try to add it; feel free to create a new issue to document this).

As an admin, I want to suppress items from the site so that I can pull content that should not be included or was wrongly added as I am going through and assigning collections to archive volumes.

Notes for testing

digitized works now have a status field; default is public, you can set manually to suppressed
should see an indicator on the list view if something is public or suppressed
should be able to filter the list view on status
status should be included in CSV export
when you set a record to suppressed the data should be deleted from the hathitrust pairtree data so we don't actually import and index it again (not sure how you can test this; you could ask us to run the hathi import script on the source id?)
if you try to switch a suppressed record back to public, you should get a validation error because it's not yet supported

Notes for development

We don't want to actually delete the record from the database; we'll want to keep a stub at least, to indicate the record was removed and track the history.

add a status field; options public/suppressed, default to public
make editable in admin
display status in the admin list view so removed items are obvious; also configure as a filter.
Include status field in CSV export
when status is changed to suppressed, delete rsync data so it won't be re-added/indexed on a full import
don't allow un-suppressing items (validation? pre-save hook?)

out of scope

We may eventually want a bulk removal option, but consider that out of scope for now.
Supporting "un-suppress" logic is out of scope for now.

Error page styles and content

dev notes

404 page
500 page

testing notes

designs
404 page

I didn't make the "error 400/500" text all caps because it seemed arbitrary (and loud)...happy to change it though.

To test the 400, just go to a random nonexistent url, like: https://test-ppa.cdh.princeton.edu/foo/
To test the 500, go to https://test-ppa.cdh.princeton.edu/500/ to cause an error.

As an admin, I want to see a list of all digitized materials in the archive so that I can view and manage the contents.

Notes for testing

Hathi content loaded via bulk import is visible in the admin interface under Archive -> Digitized Works. Basic metadata and page count is displayed in list view, which can be sorted and filtered by keyword.

As a user, I want to add all or selected works from the search results list to my Zotero library, so that I can efficiently save them for later research or citation.

Notes for testing

Similar to #36 - need Zotero installed and running. You should see an icon for harvesting a single item. Check that when you harvest it you get the metadata for the book and not the webpage.

As an admin, I want an easy way to give project team members archive management and content editing permissions so that I don’t have to keep track of all the individual required permissions.

Notes for testing

See the notes on #2 - I guess they really belong more properly on this story (the two are related). I've set the app to automatically create the two groups for you to simplify things.

Snippets: base template, header and footer

testing notes

zeplin links

nav L / M / M expanded / S / S expanded
footer L / M / S

notes

main navigation should have the correct "pitbar" behavior borrowed from cdh-web (it hides itself when you scroll down quickly and reappears when you scroll up quickly)
main navigation links should all work, except for editorial which isn't in this milestone
main navigation should be responsive
if the hamburger menu is clicked on mobile, it should make the main navigation not do the "pitbar" (it should always stay until the menu is dismissed)
footer should have the correct items with more or less correct alignment (e.g. the version number is floated right on large screens but stacked on mobile)

dev notes

add all the basic meta information to the base.html template
add blocks for css and js, including compress tags
add header and footer snippets that are rendered on every page

Mockup: single volume search

testing notes

zeplin links

results within work: L / M / S

notes

does searching for a term within the work function as you expect it to?
do the search controls (e.g. pagination) function the same way they do on the archive search?
is it responsive?

As an admin, I want to create and update collections so that I can group digitized works into subcollections for site users.

Notes for development

Needs to support associating a digitized work with more than one collection (many to many field)
Collections should have names and descriptions; may not need more than that

As a user, I should not see suppressed items in search results or item display so that my results are not cluttered by items not meant to be part of the archive.

Notes for testing

suppressed items should not be included in the public archive search
suppressed items should not be included in collection counts
suppressed items should not be included in sitemap.xml (maybe check by looking at the xml and then suppressing the first item listed?)
detail page for a suppressed item should return a 410 Gone page and should not show any item details

Notes for development

when a record is suppressed by an admin, delete from solr index (on save, when status has changed)
when reindexing, do not index suppressed/removed records (maybe index data returns nothing for suppressed items?)
detail view should return a 410 Gone status for items that have been suppressed
check behavior for changing a collection name that includes removed items
collection counts should ignore suppressed items
xml sitemaps should exclude suppressed items

As a user, I want to see a simple timeline visualization of works by publication year so that I can get a sense of how the materials are distributed by time.

testing notes

design links

search results: L / M / S

notes

is it responsive?
test with a small publication date range (e.g. 1700-1707)
test with a large range
test with only supplying one end of the range and leaving the other blank
does it seem to match up with your results?

As an admin, I want to see the history of all edits to a digitized work, including import and updates via script, so that I can track the full history of contributions and changes to the record.

Notes for testing

view history for any record using the "history" button on the top right of the individual digitized work edit page
import script now creates a log entry when a record created via import script
import script now creates a log entry when records are updated via import script; message should indicate if update was forced by person running the script or triggered by a hathi last modified date

update import script to create admin log entries on create and update
make author and pub date optional in admin edit
unit tests

Updating solr schema to index pub_date as numeric does not update the field

After running python manage.py solr_schema, when running hathi_import, the import errors out as follows:

    "msg":"Exception writing document id uiuo.ark:/13960/t5q85628k to the index; possible analysis error: cannot change DocValues type from SORTED to NUMERIC for field \"pub_date\"",
    "code":400}}

As a user viewing keyword search results, I want to see a few text snippets from the full text of a work so that I can get an idea how my search terms are used in the work.

Notes for testing

you should not see page images or text highlighting if you don't enter search terms
you should see one or two page images with text highlighting when you enter search terms; most relevant pages are displayed first; for now, matching terms will be italicized (styles will be added later based on Xinyi's design)
you should see a link "view all N pages" only if there are more than two matches; link goes to the item detail view for now (actual functionality will be implemented and tested under #32)

Styles: editorial list

dev notes

I've removed the borders between top and bottom of cards, as we discussed at the meeting.

testing notes

designs
desktop / tablet / mobile

reindex script

Need to be able to reindex content in Solr independently of importing content into the database.

needs an option for reindexing just books or just pages
needs multithreading to make the full reindex faster

Mockup: collections list page

testing notes

zeplin links

collections list: L / M / S

notes

cards are a component style that will come in the next iteration, so they won't look much like the spec for now.
is it responsive?
do the card links work?
is there anything that looks out of place?

As an admin, I want a "Collection" column viewable on the "Digitized works" page so that I can easily see what collection(s) an item belongs to.

Setup js pipeline with compressor

At a minimum I'd like to be able to use ES6 features in the source, so that will require a transpiler in addition to minification.

It might be overkill, but TypeScript could be nice too.

It looks like this plugin actually handles both the scss and ES6 js together, which could be a nice drop-in solution going forward.

As an admin, I want to edit user and group permissions so I can manage project team member access within the system.

Notes for testing

I've customized the user list display to show more information that I hope will make it easier to manage accounts
There are now two groups preloaded for you: archive manager and content editor (please let me know if you want different names). An archive manager can edit digitized works; content editor can use the content management functionality (create site pages).
To test the group permissions and get comfortable with managing user accounts, I recommend the following:
- Login to the admin site with your own account (should have superuser permissions)
- Create two new test users. They should have staff permissions (allows login to the admin site) and be active but should not have superuser (allows everything). Assign one group to each user.
- In a different browser or in an incognito window in the same browser, login as your test user and check what that user sees and is able to do in the admin site.
FYI: I have also set the app to create a script user account so that the import script can create log entries for when records are created and updated. This account should not be removed. If you have ideas for a better name please let me know!

As an admin, I want to add individual digitized items to one or more collections so that I can manage which items are included in which collections.

Notes for testing

Add a Collection using the admin.
There should be a new field on the admin edit for DigitizedWork labeled collections.
From the DigitizedWork, you should be able to add it to a collection.

Notes for development

Make collection field editable in the individual record view

As a user, I want different styles for the main title and subtitle on search results so that I can visually distinguish titles.

testing notes

works with subtitles should display both title and subtitle
main titles should be bold; subtitles should be italic
you can see this on the main archive search page in this design
you can also see it on the search within work page in this design

As a project team member, I want to login with my Princeton CAS account so that I can use existing credentials and not have to keep track of a separate username and password.

Notes for testing

Test site admin link will be provided. Use the orange "login with Princeton CAS" button.
Test users have already been initialized as admins for the test site.

add mezzanine now to avoid dealing with grappelli/admin js issues later

load partial html for page of search results via ajax

If we do, template needs to be broken into components so that the view can specify the full page template or just the reloaded portion if the request is made via AJAX.

Also note from the Zotero documentation on exposing metadata

Websites for which metadata changes without a page reload should fire a ZoteroItemUpdated event to tell Zotero to re-detect metadata on the page. This is supported in Zotero 3.0 and later.

var ev = document.createEvent('HTMLEvents');
ev.initEvent('ZoteroItemUpdated', true, true);
document.dispatchEvent(ev);

Setup scss pipeline with compressor

store page-specific .scss files in app /static directories
include them in the template's css block
compress the css block on every page
add dev info to README

Snippet: search result pagination controls

testing notes

zeplin links

components All
search page L

notes

check behavior with small and large sets of search results
check behavior when you are on a page at the beginning, middle, and end of search results

As a user, I want to search and browse digitized volumes by keyword so that I can see what materials are in the archive.

Notes for testing

Simple keyword search on anything in basic metadata and full text
Basic display (no styles, limited formatting) provided for testing search functionality. Does not yet have a custom default sort or handle pagination.
Searches across book metadata and page contents, grouping all pages and book together; book should always be retrieved for display even if the only matches are in pages of the book.
Currently supports keyword, exact phrase (use quotes) and also boolean searching - default behavior is OR, but you can specify AND. I'm not sure if we want to preserve this behavior or not, but probably worth testing to get an idea how it works.
Updates 12/14:
- results are sorted by title by default; sorted by relevance when a search term is entered
- now has basic pagination displayed at the bottom of the page; for now, showing 50 items per page
- now displaying the first handful of pages for each volume
- when there is a keyword search, I'm also displaying relevance per page
- I'm currently displaying relevance for the group, but it's always 1.0, so there's something off there but I'm not sure what yet
- I looked up the near search syntax, if you want to try that: "grammar children"~4, where 4 is the maximum number of words apart
- please also test searches with bad syntax to see how it's handled; I've tried e.g. "incomplete quote or incomplete boolean such as thing OR; you may be able to think of others.

default sort
pagination
display list of matching page numbers for testing
unit tests

Mockup: archive search page

testing notes

this is just an aggregator issue for the below issues - it will be marked complete when they are completed

dev notes

#63 (search form)
#59 (search result)
#64 (sort controls)

As a user browsing the list of collections, I want to see brief summary statistics so I can decide which collections of materials I want to browse.

Notes for testing

On the collection list page, you should see a count and date range for each collection. Feel free to add new test collections on the test site to play with this - you can add lots of items or a few. Hopefully the Solr indexing will keep up, but let us know if you have problems!

Mockup: digitized work detail page

testing notes

zeplin views

individual work: L / M / S

notes

is the correct set of metadata for a work displayed in the table?
is it responsive?

As an admin, I want a bulk import of HathiTrust materials so that previously identified and downloaded data can be added to the system.

Notes for testing

Imports basic metadata into the Django database for admin management and imports metadata and page full text into Solr for search.
Update 12/7 - import now includes publisher and publication place from the MARC record.
Note that currently the import does not strip any trailing punctuation from the author, publisher, or publication places.

As an admin, I want to generate a CSV report of materials on the site so that I can do analysis with other tools such as OpenRefine to analyze collection assignment.

Notes for testing

Should see a download CSV button on the digitized works list page
Make sure it includes all the information you expect
Please test after adding some items to collections; test with at least one item in more than one collection

Notes for Development

view to export all works and all metadata fields from the database
Customize the Admin list view to add a link to download the data
~~Restricted to admins; possibly create a new permission for this?~~
Unit tests
include collections

Snippet: single search result

testing notes

zeplin links

list with results: L / M / S

notes

is it responsive?
is the important metadata there (e.g. publisher)?
do long titles or weird metadata distort it significantly?

As an admin, I want the CSV report of materials on the site to include items' Hathi catalog ID so that I can identify duplicates and multi-volume works.

dev notes

see discussion on #13 for context on what catalog IDs are and why they are needed.

As a user, I want to see basic details for individual items in the archive so that I can see the record details and get to the HathiTrust version.

Notes for testing

accessible from search results or from admin interface (use the 'view on site' button)
display is provisional and unstyled