Coder Social home page Coder Social logo

libraryofcongress / chronam Goto Github PK

View Code? Open in Web Editor NEW
71.0 19.0 34.0 9.59 MB

This software project is no longer being actively developed at the Library of Congress. Consider using the Open-ONI (https://github.com/open-oni) fork of the chronam software. Project mailing list: http://listserv.loc.gov/archives/chronam-users.html.

Shell 4.76% Python 53.65% CSS 7.96% JavaScript 2.76% HTML 29.91% VCL 0.55% Dockerfile 0.17% Less 0.24%

chronam's Introduction

chronam

chronam is the Django application that the Library of Congress uses to make its Chronicling America website. The Chronicling America website makes millions of pages of historic American newspapers that have been digitized by the National Digital Newspaper Program (NDNP) browsable and searchable on the Web. A little bit of background is needed to understand why this software is being made available.

NDNP is actually a partnership between the Library of Congress, the National Endowment for the Humanities (NEH), and cultural heritage organizations (awardees) across the United States who have applied for grants to help digitize newspapers in their state. Awardees digitize newspaper microfilm according to a set of specifications and then ship the data back to the Library of Congress where it is loaded into Chronicling America.

Awardee institutions are able to use this data however they want, including creating their own websites that highlight their newspaper content in the local context of their own collections. The idea of making chronam available here on Github is to provide a technical option to these awardees, or other interested parties who want to make their own websites of NDNP newspaper content available. chronam provides a core set of functionality for loading, modeling and indexing NDNP data, while allowing you to customize the look and feel of the website to suit the needs of your organization.

The NDNP data is in the Public Domain and is itself available on the Web for anyone to use. The hope is that the chronam software can be useful for others who want to work with and/or publish the content.

Install

System level dependencies can be installed by following these operating system specific instructions:

After you have installed the system level dependencies you will need to install some application specific dependencies, and configure the application.

Install dependent services

MySQL

You will need a MySQL database. If this is a new server, you will need to start MySQL and assign it a root password:

sudo service mysqld start
/usr/bin/mysqladmin -u root password '' # pick a real password

You will probably want to change the password 'pick_one' in the example below to something else:

echo "DROP DATABASE IF EXISTS chronam; CREATE DATABASE chronam CHARACTER SET utf8mb4; CREATE USER 'chronam'@'localhost' IDENTIFIED BY 'pick_one'; GRANT ALL ON chronam.* to 'chronam'@'localhost'; GRANT ALL ON test_chronam.* TO 'chronam'@'localhost';" | mysql -u root -p

Solr

The Ubuntu and Red Hat guides have instructions for installing and starting Solr manually. For development, you may prefer to use Docker:

cd solr
docker build -t chronam-solr:latest .
docker run -p8983:8983 chronam-solr:latest

Install the application

First you will need to set up the local Python environment and install some Python dependencies:

cd /opt/chronam/
virtualenv -p /usr/bin/python2.7 ENV
source /opt/chronam/ENV/bin/activate
cp conf/chronam.pth ENV/lib/python2.7/site-packages/chronam.pth
pip install -r requirements.pip

Next you need to create some directories for data:

mkdir /srv/chronam/batches
mkdir /srv/chronam/cache
mkdir /srv/chronam/bib

You will need to create a Django settings file which uses the default settings and sets custom values specific to your site:

  1. Create a settings.py file in the chronam directory which imports the default values from the provided template for possible customization:

     echo 'from chronam.settings_template import *' > /opt/chronam/settings.py
    
  2. Ensure that the DJANGO_SETTINGS_MODULE environment variable is set to chronam.settings before you start a Django management command. This can be set as a user-wide default in your ~/.profile or but the recommended way is simply to make it part of the virtualenv activation process::

     echo 'export DJANGO_SETTINGS_MODULE=chronam.settings' >> /opt/chronam/ENV/bin/activate
    
  3. Add your database password to the settings.py file following the standard Django settings documentation:

     DATABASES = {
         'default': {
             'ENGINE': 'django.db.backends.mysql',
             'NAME': 'chronam_db',
             'USER': 'chronam_user',
             'HOST': 'mysql.example.org',
             'PASSWORD': 'NotTheRealPassword',
         }
     }
    

You should never edit the settings_template.py file since that may change in the next release but you may wish to periodically review the list of changes to that file in case you need to update your local settings.

Next you will need to initialize database schema and load some initial data:

django-admin.py migrate
django-admin.py loaddata initial_data languages
django-admin.py chronam_sync --skip-essays

And finally you will need to collect static files (stylesheets, images) for serving up by Apache in production settings:

django-admin.py collectstatic --noinput

Load Data

As mentioned above, the NDNP data that awardees create and ship to the Library of Congress is in the public domain and is made available on the Web as batches. Each batch contains newsaper issues for one or more newspaper titles. To use chronam you will need to have some of this batch data to load. If you are an awardee you probably have this data on hand already, but if not you can use a tool like wget to bulk download the batches. For example:

cd /srv/chronam/batches/
wget --recursive --no-parent --no-host-directories --cut-dirs 2 --reject index.html* https://chroniclingamerica.loc.gov/data/batches/uuml_thys_ver01/

In order to load data you will need to run the load_batch management command by passing it the full path to the batch directory. So assuming you have downloaded batch_uuml_thys_ver01 you will want to:

django-admin.py load_batch /srv/chronam/batches/uuml_thys_ver01

If this is a new server, you may need to start the web server:

sudo service httpd start

After this completes you should be able to view the batch in the batches report via the Web:

http://www.example.org/batches/

Caching

After loading data, you will need to clear the cache. If you are using a reverse proxie (like Varnish) you will need to also clear that, as well as any CDN you have. Below is a list of URLS that should be cleared based on what content you are loading.

All pages that contain a LCCN are tagged with that LCCN in the cache headers. This allows for purging by specific LCCN tag if there is a update to a batch.

List of URLs to purge when loading new batch

  • All URLs tagged with lccn=<LCCN>
  • All URLs matching these patterns:
    chroniclingamerica.loc.gov/tabs
    chroniclingamerica.loc.gov/sitemap*
    chroniclingamerica.loc.gov/frontpages*
    chroniclingamerica.loc.gov/titles*
    chroniclingamerica.loc.gov/states*
    chroniclingamerica.loc.gov/counties*
    chroniclingamerica.loc.gov/states_counties*
    chroniclingamerica.loc.gov/cities*
    chroniclingamerica.loc.gov/batches/summary*
    chroniclingamerica.loc.gov/reels*
    chroniclingamerica.loc.gov/reel*
    chroniclingamerica.loc.gov/essays*
    

List of URLs to purge when loading new Awardees

  • All URLs matching chroniclingamerica.loc.gov/awardees*

List of URLs to purge when loading new basic data

  • All URLs matching chroniclingamerica.loc.gov/institutions*

List of URLs to purge when loading code

  • All URLs matching these patterns:
    chroniclingamerica.loc.gov/ocr
    chroniclingamerica.loc.gov/about
    chroniclingamerica.loc.gov/about/api
    chroniclingamerica.loc.gov/help
    

Run Unit Tests

For the unit tests to work you will need:

  • to have the batch_uuml_thys_ver01 available. You can use the wget command in the previous section to get get it.
  • A local SOLR instance running
  • A local MySQL database
  • Access to the Essay Editor Feed

After that you should be able to:

cd /opt/chronam/
django-admin.py test chronam.core.tests --settings=chronam.settings_test

License

This software is in the Public Domain.

chronam's People

Contributors

acdha avatar cclauss avatar dbrunton avatar dcmcand avatar dependabot[bot] avatar dillonpeterson avatar edsu avatar eikeon avatar jechols avatar johnscancella avatar keshavmagge avatar kzwa avatar luisb avatar morskyjezek avatar myusuf avatar nwy avatar rioh avatar tongwang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chronam's Issues

in advanced search, only the last state is "sticky"

Example URL with multiple search parameters:

http://chroniclingamerica.loc.gov/search/pages/results/?state=Illinois&state=Indiana&state=Minnesota&state=Ohio&dateFilterType=yearRange&date1=1890&date2=1922&language=&ortext=&andtext=&phrasetext=William+Foote&proxtext=&proxdistance=5&rows=20&searchType=advanced

In particular, note the multiple "state=" options in the URL.

Now click on the "advanced search" tab. Years are properly prefilled, as is "phrase". However with the "Select State(s)" box, only Ohio is selected.

newspaper page view: hide icon text from mobile devices

The newspaper page view looks great on mobile devices like the iPhone, but I think it would look even better if we could have some of the text around the menu icons disappear (responsive design) when viewed on devices that don't have a lot of screen real estate. This way the menu bar should appear on one line, instead of broken up across two. I think we would just need to wrap Page, Issue, Text, PDF, JP2 and Clip Image in a span's of the right class.

Reexamine tile size

Default tile size is 512. We're down at 256 right now. We should at least go back to 512, and possibly even further.

batch load instructions

The install documents apparently don't reflect how batch loading works currently. According to Nathan Yarasavage the wget command pulls down the data so that it is in a batches subdirectory of /opt/chronam/data/batches. Also a full path is now required for batch loading, and I think the instructions may still just have the batch name.

parallelize ocr dump process

The last time it ran it took ~2 weeks to run the process that generates the OCR bulk downloads. This is partly because it works one batch at a time. It ought to be fairly easily to parallelize the dump process with celery.

This is an intentional duplicate of Trac ticket #1254.

installation problem w/ lucene-memory.jar

As reported by Chris Ehrman who was installing on Ubuntu:

Command:
sudo ln -s /usr/share/java/lucene-memory.jar /usr/share/solr/WEB-INF/lib/lucene-memory.jar
Error message:
sudo ln -s /usr/share/java/lucene-memory.jar /usr/share/solr/WEB-INF/lib/lucene-memory.jar
ln: failed to create symbolic link `/usr/share/solr/WEB-INF/lib/lucene-memory.jar': No such file or directory

Directory - using 338$a when 245$h not present

Summary: use the 338$a instead of the 245 $h if the 245 $h is present

Full Description:
A change in cataloging practice has led to this request.

In RDA practice, encoding "material format" in the 245$h ("medium")is deprecated in favor of encoding it in the 338$a (carrier type).

In records created or updated since the RDA changeover, you will see "online resource" appear in the 338$a, with nothing in the 245$h.

In addition to displaying the 245 $h (in search lists, title display, etc.) when present, please use the contents of 338$a, when it is available, in the same way.

(Former internal trac ticket 1381)

Language field initial data and schema.xml requirements needs to be updated for clean install

When loading a batch using the most recent software from Git on Ubuntu I kept getting an error about "language matching query does not exist." (see the paste below). When I manuallt added a field in the mysql database for "eng" I then got a solr error saying the required field wasn't found. So, I think this is because language is a required field in the schema.xml for solr AND because there is no default value of "eng" or English found in the mysql language database table when you initially are installing the software. I got it working by adding an "eng" field in mysql and by making the language value optional in the schema.xml. I don’t know if this will cause problems down the road, but thought I would mention that it is an issue.

Hope that helps

Batch_loader.py
Starting line: 428
for lang, text in lang_text.iteritems():
try:
language = models.Language.objects.get(Q(code=lang) | Q(lingvoj__iendswith=lang))
except models.Language.DoesNotExist:
# default to english as per requirement
language = models.Language.objects.get(code='eng')
ocr.language_texts.create(language=language,
text=text)
page.ocr = ocr

Here's the output from my batch load before I changed the solr field to optional and added the "eng" field to mysql:

(ENV)ubuntu@ip-10-119-97-242:/opt/chronam/data$ django-admin.py load_batch /opt/chronam/data/batches/batch_vi_affirmed_ver01
INFO:root:loading batch at /opt/chronam/data/batches/batch_vi_affirmed_ver01
INFO:chronam.core.batch_loader:loading batch: batch_vi_affirmed_ver01
INFO:rdflib:version: 3.4.0
INFO:chronam.core.views.image:NativeImage backend '%s' not available.
INFO:chronam.core.views.image:NativeImage backend '%s' not available.
INFO:chronam.core.views.image:Using NativeImage backend 'graphicsmagick'
INFO:chronam.core.batch_loader:Assigned page sequence: 1 INFO:chronam.core.batch_loader:Saving page. issue date: 1886-07-17 00:00:00, page sequence: 1 ERROR:chronam.core.batch_loader:unable to load batch: Language matching query does not exist.
ERROR:chronam.core.batch_loader:Language matching query does not exist.
Traceback (most recent call last):
File "/opt/chronam/core/batch_loader.py", line 166, in load_batch
issue = self._load_issue(mets_url)
File "/opt/chronam/core/batch_loader.py", line 283, in _load_issue
page = self._load_page(doc, page_div, issue)
File "/opt/chronam/core/batch_loader.py", line 405, in _load_page
self.process_ocr(page)
File "/opt/chronam/core/batch_loader.py", line 433, in process_ocr
language = models.Language.objects.get(code='eng')
File "/opt/chronam/ENV/local/lib/python2.7/site-packages/django/db/models/manager.py", line 131, in get
return self.get_query_set().get(_args, *_kwargs)
File "/opt/chronam/ENV/local/lib/python2.7/site-packages/django/db/models/query.py", line 366, in get
% self.model._meta.object_name)
DoesNotExist: Language matching query does not exist.
WARNING:root:no OcrDump to delete for batch_vi_affirmed_ver01 (Library of Virginia; Richmond, VA) ERROR:chronam.core.management.commands.load_batch:unable to load batch: Language matching query does not exist.
Traceback (most recent call last):
File "/opt/chronam/core/management/commands/load_batch.py", line 39, in handle
batch = loader.load_batch(batch_name)
File "/opt/chronam/core/batch_loader.py", line 195, in load_batch
raise BatchLoaderException(msg)
BatchLoaderException: unable to load batch: Language matching query does not exist.
Error: unable to load batch. check the load_batch log for clues

Directory - holdings display does not include all locations and formats

Update holdings types & display to handle changes in holding records.
Example: This page - should display like this w/ the new holdings: http://chroniclingamerica.loc.gov/lccn/sn88085187/holdings/

"HTML display should look as follows: (using different example to show multiple formats on one institution.)"

<hr>
HOLDING: Tacoma Pub Libr, Tacoma, WA

View more titles from this institution

Available as: Microfilm Service Copy

Dates:

    <1894:8:19>
    <1909:1:1-1918:6:29>
    <1923:9:12, 10:1-1934:3:31>

Last updated: 08/1990

<hr>
HOLDING: Univ of Washington Libr, Seattle, WA

View more titles from this institution

Available as: Microfilm Service Copy

Dates:

    <1903:12:21-1904:9:21>
    <1907:5:18>
    <1918:4:13-1923:11:13>
    <1929:12:2-1930:8:30>
    <1930:10:1-1933:11:23>
    <1934:2:1-1943:4:12>

Last updated: 07/1989

Available as: Microfilm Master

Dates:

    <1903:12:21-1904:9:21>
    <1907:5:18>
    <1918:4:13-1923:11:13>
    <1929:12:2-1930:8:30>
    <1930:10:1-1933:11:23>
    <1934:2:1-1943:4:12>

Last updated: 07/1989
<hr>

Search highlights in OCR

ChronAm currently supports hit highlights on the item view, but not on the OCR page. There has been some discussion of adding highlights to the OCR page as well, such that when someone visits the page with words highlighted and clicks on the OCR page, this functionality persists there.

5 second timeout on SeaDragon

Up the timeout to something higher than 5 seconds, TBD. Doesn't seem there is much upside to having a timeout at all.

Disallow invalid date ranges

If someone picks a start year that is 1860, use a little reflection to either remove or gray out the end years before 1860.

Open Source JP2 viewer

Need to provide some baseline functionality for JP2 rendering using an open source tool.

clip image link broken

The "clip image" link on the page view appears to be broken, maybe since the change to the tiling URLs?

advanced search (page)

The advanced search functionality has not yet been added to the generic skin in core.

Fix duplicate requests made during title pull

Generate_requests, which creates the initial request for the queue of requests, already has executed the request when the list of requests is compiled.

When the requests are passed to grab_content, it sends the request again.
Fix it.

cts functionality and minicts dependency

Users of Chronicling America outside of the Library of Congress have no need for Content Transfer Services (CTS) related functionality. They also won't be able to install the minicts module since it is currently in LC's private git repository. At the very least minicts should be removed from the requirements.txt, and perhaps the tasks and management commands related to CTS should be moved from chronam.core into chronam.loc.

All Digitized Newspapers - skip leading articles in title alphabetization

skip "the", "le," "la", "das", etc.

fwiw, this is already done in the Directory where titles are alphabetized regardless of their leading article.


More info:
This also goes for ignoring capitalization.

Currently, all the capitalized titles are alphabetized above lower-cased titles. Please ignore caps.

(Since title capitalization is inconsistent. If capitalization can be made consistent, let's make another ticket.)

titles are mostly lower-cased per MARC cataloging rules, rather than standard grammar proper noun rules. But I think maybe someone ran a "capitalize" this on the directory titles at some point? Curt might know.)


Formerly internal trac ticket 1369

Add 852z (notes) to extractor & front-end display

Refs internal trac ticket (more details there): #1363

The record should look something like this on ChronAm:

SUMMARY HOLDING: Mercer Cnty Libr, Lawrenceville, NJ

Available as: Original
Retains current 2 months [Microfilm=1964- 0,4]

Last updated: 05/1990

chronam_sync dependent on access to ndnp-essays.rdc.lctl.gov

The chronam_sync management command will not work outside of the Library of Congress since it expects to be able to grab an Atom feed from ndnp-essays.rdc.lctl.gov, which is not publicly available. We have (at least) a few options for fixing this:

  • change the management command to not poll for essays by default
  • periodically make the essays part of the chronam repository (they aren't big) for loading locally
  • have chroniclingamerica.loc.gov make a feed of the essays available as well
  • make the essay editor publicly available at ndnp-essays.rdc.lctl.gov, or another awardee institution

Thoughts?

Fix page count on pages that is missing on

I am not sure why, but why is the page count missing from some pages in the newspaper dir?

When I refer to page count, I am referring to this statement:
"Pages currently available: 6,025,474"

Has page count:

Does not have page count:

It seems to me that state and language specific newspaper list filters have it, but not anything that has ethnicity.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.