libraryofcongress / chronam Goto Github PK

This software project is no longer being actively developed at the Library of Congress. Consider using the Open-ONI (https://github.com/open-oni) fork of the chronam software. Project mailing list: http://listserv.loc.gov/archives/chronam-users.html.

Shell 4.76% Python 53.65% CSS 7.96% JavaScript 2.76% HTML 29.91% VCL 0.55% Dockerfile 0.17% Less 0.24%

chronam's Introduction

chronam

chronam is the Django application that the Library of Congress uses to make its Chronicling America website. The Chronicling America website makes millions of pages of historic American newspapers that have been digitized by the National Digital Newspaper Program (NDNP) browsable and searchable on the Web. A little bit of background is needed to understand why this software is being made available.

NDNP is actually a partnership between the Library of Congress, the National Endowment for the Humanities (NEH), and cultural heritage organizations (awardees) across the United States who have applied for grants to help digitize newspapers in their state. Awardees digitize newspaper microfilm according to a set of specifications and then ship the data back to the Library of Congress where it is loaded into Chronicling America.

Awardee institutions are able to use this data however they want, including creating their own websites that highlight their newspaper content in the local context of their own collections. The idea of making chronam available here on Github is to provide a technical option to these awardees, or other interested parties who want to make their own websites of NDNP newspaper content available. chronam provides a core set of functionality for loading, modeling and indexing NDNP data, while allowing you to customize the look and feel of the website to suit the needs of your organization.

The NDNP data is in the Public Domain and is itself available on the Web for anyone to use. The hope is that the chronam software can be useful for others who want to work with and/or publish the content.

Install

System level dependencies can be installed by following these operating system specific instructions:

After you have installed the system level dependencies you will need to install some application specific dependencies, and configure the application.

Install dependent services

MySQL

You will need a MySQL database. If this is a new server, you will need to start MySQL and assign it a root password:

sudo service mysqld start
/usr/bin/mysqladmin -u root password '' # pick a real password

You will probably want to change the password 'pick_one' in the example below to something else:

echo "DROP DATABASE IF EXISTS chronam; CREATE DATABASE chronam CHARACTER SET utf8mb4; CREATE USER 'chronam'@'localhost' IDENTIFIED BY 'pick_one'; GRANT ALL ON chronam.* to 'chronam'@'localhost'; GRANT ALL ON test_chronam.* TO 'chronam'@'localhost';" | mysql -u root -p

Solr

The Ubuntu and Red Hat guides have instructions for installing and starting Solr manually. For development, you may prefer to use Docker:

cd solr
docker build -t chronam-solr:latest .
docker run -p8983:8983 chronam-solr:latest

Install the application

First you will need to set up the local Python environment and install some Python dependencies:

cd /opt/chronam/
virtualenv -p /usr/bin/python2.7 ENV
source /opt/chronam/ENV/bin/activate
cp conf/chronam.pth ENV/lib/python2.7/site-packages/chronam.pth
pip install -r requirements.pip

Next you need to create some directories for data:

mkdir /srv/chronam/batches
mkdir /srv/chronam/cache
mkdir /srv/chronam/bib

You will need to create a Django settings file which uses the default settings and sets custom values specific to your site:

Create a settings.py file in the chronam directory which imports the default values from the provided template for possible customization:
```
 echo 'from chronam.settings_template import *' > /opt/chronam/settings.py
```
Ensure that the DJANGO_SETTINGS_MODULE environment variable is set to chronam.settings before you start a Django management command. This can be set as a user-wide default in your ~/.profile or but the recommended way is simply to make it part of the virtualenv activation process::
```
 echo 'export DJANGO_SETTINGS_MODULE=chronam.settings' >> /opt/chronam/ENV/bin/activate
```

Add your database password to the settings.py file following the standard Django settings documentation:

 DATABASES = {
     'default': {
         'ENGINE': 'django.db.backends.mysql',
         'NAME': 'chronam_db',
         'USER': 'chronam_user',
         'HOST': 'mysql.example.org',
         'PASSWORD': 'NotTheRealPassword',
     }
 }

You should never edit the settings_template.py file since that may change in the next release but you may wish to periodically review the list of changes to that file in case you need to update your local settings.

Next you will need to initialize database schema and load some initial data:

django-admin.py migrate
django-admin.py loaddata initial_data languages
django-admin.py chronam_sync --skip-essays

And finally you will need to collect static files (stylesheets, images) for serving up by Apache in production settings:

django-admin.py collectstatic --noinput

Load Data

As mentioned above, the NDNP data that awardees create and ship to the Library of Congress is in the public domain and is made available on the Web as batches. Each batch contains newsaper issues for one or more newspaper titles. To use chronam you will need to have some of this batch data to load. If you are an awardee you probably have this data on hand already, but if not you can use a tool like wget to bulk download the batches. For example:

cd /srv/chronam/batches/
wget --recursive --no-parent --no-host-directories --cut-dirs 2 --reject index.html* https://chroniclingamerica.loc.gov/data/batches/uuml_thys_ver01/

In order to load data you will need to run the load_batch management command by passing it the full path to the batch directory. So assuming you have downloaded batch_uuml_thys_ver01 you will want to:

django-admin.py load_batch /srv/chronam/batches/uuml_thys_ver01

If this is a new server, you may need to start the web server:

sudo service httpd start

After this completes you should be able to view the batch in the batches report via the Web:

http://www.example.org/batches/

Caching

After loading data, you will need to clear the cache. If you are using a reverse proxie (like Varnish) you will need to also clear that, as well as any CDN you have. Below is a list of URLS that should be cleared based on what content you are loading.

All pages that contain a LCCN are tagged with that LCCN in the cache headers. This allows for purging by specific LCCN tag if there is a update to a batch.

List of URLs to purge when loading new batch

All URLs tagged with lccn=<LCCN>

All URLs matching these patterns:

chroniclingamerica.loc.gov/tabs
chroniclingamerica.loc.gov/sitemap*
chroniclingamerica.loc.gov/frontpages*
chroniclingamerica.loc.gov/titles*
chroniclingamerica.loc.gov/states*
chroniclingamerica.loc.gov/counties*
chroniclingamerica.loc.gov/states_counties*
chroniclingamerica.loc.gov/cities*
chroniclingamerica.loc.gov/batches/summary*
chroniclingamerica.loc.gov/reels*
chroniclingamerica.loc.gov/reel*
chroniclingamerica.loc.gov/essays*

List of URLs to purge when loading new Awardees

All URLs matching chroniclingamerica.loc.gov/awardees*

List of URLs to purge when loading new basic data

All URLs matching chroniclingamerica.loc.gov/institutions*

List of URLs to purge when loading code

All URLs matching these patterns:

chroniclingamerica.loc.gov/ocr
chroniclingamerica.loc.gov/about
chroniclingamerica.loc.gov/about/api
chroniclingamerica.loc.gov/help

Run Unit Tests

For the unit tests to work you will need:

to have the batch_uuml_thys_ver01 available. You can use the wget command in the previous section to get get it.
A local SOLR instance running
A local MySQL database
Access to the Essay Editor Feed

After that you should be able to:

cd /opt/chronam/
django-admin.py test chronam.core.tests --settings=chronam.settings_test

License

This software is in the Public Domain.

chronam's People

Contributors

Stargazers

Watchers

chronam's Issues

search newspaper directory

The search newspaper directory page needs to be styled yet (in core).

Missing Highlights re: hyphens on page view

The page view is currently not highlighting words with hyphens in them whereas the search results view does highlight them. For example, compare the highlights on the search results view vs. the corresponding page views for the following search: http://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1836&date2=1922&proxtext=coca%2Bcola&x=0&y=0&dateFilterType=yearRange&rows=20&searchType=basic

in advanced search, only the last state is "sticky"

Example URL with multiple search parameters:

http://chroniclingamerica.loc.gov/search/pages/results/?state=Illinois&state=Indiana&state=Minnesota&state=Ohio&dateFilterType=yearRange&date1=1890&date2=1922&language=&ortext=&andtext=&phrasetext=William+Foote&proxtext=&proxdistance=5&rows=20&searchType=advanced

In particular, note the multiple "state=" options in the URL.

Now click on the "advanced search" tab. Years are properly prefilled, as is "phrase". However with the "Select State(s)" box, only Ohio is selected.

institution page pagination styling

The pagination needs styling on the institution page.

style core's title page

The title page in core needs to be styled

Cache js in app

main.js is not being cached currently

newspaper page view: hide icon text from mobile devices

The newspaper page view looks great on mobile devices like the iPhone, but I think it would look even better if we could have some of the text around the menu icons disappear (responsive design) when viewed on devices that don't have a lot of screen real estate. This way the menu bar should appear on one line, instead of broken up across two. I think we would just need to wrap Page, Issue, Text, PDF, JP2 and Clip Image in a span's of the right class.

Fix capitalization of states on newspaper directory filtered pages

States are upper case on this page type:
http://chroniclingamerica.loc.gov/newspapers/

But lower case on this page type:
http://chroniclingamerica.loc.gov/newspapers/alabama/

Fix it.

Reexamine tile size

Default tile size is 512. We're down at 256 right now. We should at least go back to 512, and possibly even further.

batch load instructions

The install documents apparently don't reflect how batch loading works currently. According to Nathan Yarasavage the wget command pulls down the data so that it is in a batches subdirectory of /opt/chronam/data/batches. Also a full path is now required for batch loading, and I think the instructions may still just have the batch name.

parallelize ocr dump process

The last time it ran it took ~2 weeks to run the process that generates the OCR bulk downloads. This is partly because it works one batch at a time. It ought to be fairly easily to parallelize the dump process with celery.

This is an intentional duplicate of Trac ticket #1254.

installation problem w/ lucene-memory.jar

As reported by Chris Ehrman who was installing on Ubuntu:

Command:
sudo ln -s /usr/share/java/lucene-memory.jar /usr/share/solr/WEB-INF/lib/lucene-memory.jar
Error message:
sudo ln -s /usr/share/java/lucene-memory.jar /usr/share/solr/WEB-INF/lib/lucene-memory.jar
ln: failed to create symbolic link `/usr/share/solr/WEB-INF/lib/lucene-memory.jar': No such file or directory

Clicking "Advanced Search" doesn't update URL

http://chroniclingamerica.loc.gov/#tab_advanced_search is the link, but the URL does not update with this target in FF, IE or WebKit.

a map-based selector would be a really cool addition to advanced search

Something like this:

Directory - using 338$a when 245$h not present

Summary: use the 338$a instead of the 245 $h if the 245 $h is present

Full Description:
A change in cataloging practice has led to this request.

In RDA practice, encoding "material format" in the 245$h ("medium")is deprecated in favor of encoding it in the 338$a (carrier type).

In records created or updated since the RDA changeover, you will see "online resource" appear in the 338$a, with nothing in the 245$h.

In addition to displaying the 245 $h (in search lists, title display, etc.) when present, please use the contents of 338$a, when it is available, in the same way.

(Former internal trac ticket 1381)

Language field initial data and schema.xml requirements needs to be updated for clean install

When loading a batch using the most recent software from Git on Ubuntu I kept getting an error about "language matching query does not exist." (see the paste below). When I manuallt added a field in the mysql database for "eng" I then got a solr error saying the required field wasn't found. So, I think this is because language is a required field in the schema.xml for solr AND because there is no default value of "eng" or English found in the mysql language database table when you initially are installing the software. I got it working by adding an "eng" field in mysql and by making the language value optional in the schema.xml. I don’t know if this will cause problems down the road, but thought I would mention that it is an issue.

Hope that helps

Batch_loader.py
Starting line: 428
for lang, text in lang_text.iteritems():
try:
language = models.Language.objects.get(Q(code=lang) | Q(lingvoj__iendswith=lang))
except models.Language.DoesNotExist:
# default to english as per requirement
language = models.Language.objects.get(code='eng')
ocr.language_texts.create(language=language,
text=text)
page.ocr = ocr

Here's the output from my batch load before I changed the solr field to optional and added the "eng" field to mysql:

(ENV)ubuntu@ip-10-119-97-242:/opt/chronam/data$ django-admin.py load_batch /opt/chronam/data/batches/batch_vi_affirmed_ver01
INFO:root:loading batch at /opt/chronam/data/batches/batch_vi_affirmed_ver01
INFO:chronam.core.batch_loader:loading batch: batch_vi_affirmed_ver01
INFO:rdflib:version: 3.4.0
INFO:chronam.core.views.image:NativeImage backend '%s' not available.
INFO:chronam.core.views.image:NativeImage backend '%s' not available.
INFO:chronam.core.views.image:Using NativeImage backend 'graphicsmagick'
INFO:chronam.core.batch_loader:Assigned page sequence: 1 INFO:chronam.core.batch_loader:Saving page. issue date: 1886-07-17 00:00:00, page sequence: 1 ERROR:chronam.core.batch_loader:unable to load batch: Language matching query does not exist.
ERROR:chronam.core.batch_loader:Language matching query does not exist.
Traceback (most recent call last):
File "/opt/chronam/core/batch_loader.py", line 166, in load_batch
issue = self._load_issue(mets_url)
File "/opt/chronam/core/batch_loader.py", line 283, in _load_issue
page = self._load_page(doc, page_div, issue)
File "/opt/chronam/core/batch_loader.py", line 405, in _load_page
self.process_ocr(page)
File "/opt/chronam/core/batch_loader.py", line 433, in process_ocr
language = models.Language.objects.get(code='eng')
File "/opt/chronam/ENV/local/lib/python2.7/site-packages/django/db/models/manager.py", line 131, in get
return self.get_query_set().get(_args, *_kwargs)
File "/opt/chronam/ENV/local/lib/python2.7/site-packages/django/db/models/query.py", line 366, in get
% self.model._meta.object_name)
DoesNotExist: Language matching query does not exist.
WARNING:root:no OcrDump to delete for batch_vi_affirmed_ver01 (Library of Virginia; Richmond, VA) ERROR:chronam.core.management.commands.load_batch:unable to load batch: Language matching query does not exist.
Traceback (most recent call last):
File "/opt/chronam/core/management/commands/load_batch.py", line 39, in handle
batch = loader.load_batch(batch_name)
File "/opt/chronam/core/batch_loader.py", line 195, in load_batch
raise BatchLoaderException(msg)
BatchLoaderException: unable to load batch: Language matching query does not exist.
Error: unable to load batch. check the load_batch log for clues

Create example project for extension purposes

Create an example app for people to use for extension purposes.

Directory - holdings display does not include all locations and formats

Update holdings types & display to handle changes in holding records.
Example: This page - should display like this w/ the new holdings: http://chroniclingamerica.loc.gov/lccn/sn88085187/holdings/

"HTML display should look as follows: (using different example to show multiple formats on one institution.)"

<hr>
HOLDING: Tacoma Pub Libr, Tacoma, WA

View more titles from this institution

Available as: Microfilm Service Copy

Dates:

    <1894:8:19>
    <1909:1:1-1918:6:29>
    <1923:9:12, 10:1-1934:3:31>

Last updated: 08/1990

<hr>
HOLDING: Univ of Washington Libr, Seattle, WA

View more titles from this institution

Available as: Microfilm Service Copy

Dates:

    <1903:12:21-1904:9:21>
    <1907:5:18>
    <1918:4:13-1923:11:13>
    <1929:12:2-1930:8:30>
    <1930:10:1-1933:11:23>
    <1934:2:1-1943:4:12>

Last updated: 07/1989

Available as: Microfilm Master

Dates:

    <1903:12:21-1904:9:21>
    <1907:5:18>
    <1918:4:13-1923:11:13>
    <1929:12:2-1930:8:30>
    <1930:10:1-1933:11:23>
    <1934:2:1-1943:4:12>

Last updated: 07/1989
<hr>

Create util to check that bib settings exists & update through code

We are doing the same thing in bunch of files, so I am going to pull this out as a util function.

This will effect the following...

Holding loader
Chronam sync
Title sync
Title loader

Search highlights in OCR

ChronAm currently supports hit highlights on the item view, but not on the OCR page. There has been some discussion of adding highlights to the OCR page as well, such that when someone visits the page with words highlighted and clicks on the OCR page, this functionality persists there.

Search a particular date in any year

For instance, all April 13ths, regardless of year.

Help link next to search

A la See http://www.loc.gov/pictures/

search result navigation for page view

The search result navigation needs to be added to core's page view.

5 second timeout on SeaDragon

Up the timeout to something higher than 5 seconds, TBD. Doesn't seem there is much upside to having a timeout at all.

search form and newspaper info

Call get_search_form and newspaper_info when needed instead of unconditionally via context processors.

reconcile management command doesn't work

It looks like the reconcile management command is looking for the wrong URL:http://chroniclingamerica.loc.gov/batches/json instead of http://chroniclingamerica.loc.gov/batches.json. Also, it doesn't appear to take into consideration that the JSON response is now paged.

Update title pull to error if no bib_storage in settings

Disallow invalid date ranges

If someone picks a start year that is 1860, use a little reflection to either remove or gray out the end years before 1860.

style core's issue page

The issue page in core needs to be styled

Open Source JP2 viewer

Need to provide some baseline functionality for JP2 rendering using an open source tool.

clip image link broken

The "clip image" link on the page view appears to be broken, maybe since the change to the tiling URLs?

advanced search (page)

The advanced search functionality has not yet been added to the generic skin in core.

Write tests to test data returned by the title_pull process

Internal trac ticket 1265

Help link next to search:

All Digitized Titles - add sort by date

Full list or filtered list should both be able to sort by "earliest date" or "latest date" available.

This is in reference to this page: http://chroniclingamerica.loc.gov/newspapers/?state=&ethnicity=&language=

This was moved from internal trac ticket 1034.

browse issues calendars in core needs some css stylin'

While the browse issue calendars looks fine in loc app, the core app could use some additional style.

Fix duplicate requests made during title pull

Generate_requests, which creates the initial request for the queue of requests, already has executed the request when the list of requests is compiled.

When the requests are passed to grab_content, it sends the request again.
Fix it.

Issue JSON 500 Error

e.g. http://chroniclingamerica.loc.gov/lccn/sn86064199/1898-07-19/ed-1.json

store jp2 filesizes in database

Store the jp2 filesizes in the database so the page view doesn't need to hit the filesystem for it.

purge_batch automatically on exception during load_batch

Change load_batch command so that the batch gets purged if there is an error loading it. At the moment an exception during load_batch results in a partial batch.

Convert "Electronic Resource" in the drop down menu to "Online Resource"

Convert "Electronic Resource" in the drop down menu to "Online Resource" on this page: http://chroniclingamerica.loc.gov/search/titles/

Change needs to happen in the code & in the database for existing records.

Add 856$u from title field to the to the Material type search return

Original ticket -- trac #1350

When a user selects "Electronic Resource" for the material type, the items returned should return titles that have a link in the 856$u field.

style core's search pages results page

The search pages results page needs to be styled yet (in core).

image tiles without database hit

We should be able to serve up image tiles without needing to hit the database.

cts functionality and minicts dependency

Users of Chronicling America outside of the Library of Congress have no need for Content Transfer Services (CTS) related functionality. They also won't be able to install the minicts module since it is currently in LC's private git repository. At the very least minicts should be removed from the requirements.txt, and perhaps the tasks and management commands related to CTS should be moved from chronam.core into chronam.loc.

All Digitized Newspapers - skip leading articles in title alphabetization

skip "the", "le," "la", "das", etc.

fwiw, this is already done in the Directory where titles are alphabetized regardless of their leading article.

More info:
This also goes for ignoring capitalization.

Currently, all the capitalized titles are alphabetized above lower-cased titles. Please ignore caps.

(Since title capitalization is inconsistent. If capitalization can be made consistent, let's make another ticket.)

titles are mostly lower-cased per MARC cataloging rules, rather than standard grammar proper noun rules. But I think maybe someone ran a "capitalize" this on the directory titles at some point? Curt might know.)

Formerly internal trac ticket 1369

Add 852z (notes) to extractor & front-end display

Refs internal trac ticket (more details there): #1363

The record should look something like this on ChronAm:

SUMMARY HOLDING: Mercer Cnty Libr, Lawrenceville, NJ

Available as: Original
Retains current 2 months [Microfilm=1964- 0,4]

Last updated: 05/1990

chronam_sync dependent on access to ndnp-essays.rdc.lctl.gov

The chronam_sync management command will not work outside of the Library of Congress since it expects to be able to grab an Atom feed from ndnp-essays.rdc.lctl.gov, which is not publicly available. We have (at least) a few options for fixing this:

change the management command to not poll for essays by default
periodically make the essays part of the chronam repository (they aren't big) for loading locally
have chroniclingamerica.loc.gov make a feed of the essays available as well
make the essay editor publicly available at ndnp-essays.rdc.lctl.gov, or another awardee institution

Thoughts?

Add results summary statements to "All Digitized Newspapers" page

Add a summary search statement to the top of the table on this page: http://chroniclingamerica.loc.gov/newspapers/?state=Florida&ethnicity=&language=

The summary statement should be an overview of the results that were returned.

Follow a similar format to that which is on the search return pages:
Example search

(Previously internal trac ticket #1247)

Fix page count on pages that is missing on

I am not sure why, but why is the page count missing from some pages in the newspaper dir?

When I refer to page count, I am referring to this statement:
"Pages currently available: 6,025,474"

Has page count:

Does not have page count:

It seems to me that state and language specific newspaper list filters have it, but not anything that has ethnicity.