Coder Social home page Coder Social logo

lmullen / cchc Goto Github PK

View Code? Open in Web Editor NEW
8.0 8.0 1.0 3.6 MB

America's Public Bible for Computing Cultural Heritage in the Cloud

License: Creative Commons Zero v1.0 Universal

Go 87.93% Makefile 0.97% Dockerfile 4.51% PLpgSQL 0.13% R 6.47%

cchc's Introduction

๐Ÿš€ Greetings

I'm Lincoln Mullen. I am a historian of American religion and the nineteenth-century United States, often using computational methods for texts and maps. I am history faculty at George Mason University, and Director of Computational History at the Roy Rosenzweig Center for History and New Media (@CHNM on GitHub).

My work on GitHub mostly involves my research projects. These are usually data analytical projects, but I have also developed some software packages. To give you a quick tour of the highlights:

You can find me at various places around the internet:

cchc's People

Contributors

lmullen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

goldfarb

cchc's Issues

Try to reconnect to channel if connection is lost

cchc-crawler  | time="2021-08-07T00:59:21Z" level=error msg="Error putting item in queue for metadata processing" error="Failed to publish item metadata message: Exception (504) Reason: \"channel/connection is not open\"" item_id="http://www.loc.gov/item/sn83030272/1865-11-01/ed-1/"

Slow down if rate limited

Sometimes the API gives an HTTP 429 Too Many Requests error even with the rate limiters in place. On receiving such an error, it would probably be best to take some kind of action for the crawler as a whole and delay. Otherwise, the IP address might get temporarily locked out.

Fix panic on empty collections

This probably happens because the API fails to return collections or something. The best thing to do would be to add a runtime check on the length of the slice, and log if there is some problem with the API response if that's not already being captured.

Sample log message:

cchc-crawler  | time="2021-08-30T08:05:38Z" level=info msg="Starting a crawl of all collections"
cchc-crawler  | time="2021-08-30T08:05:38Z" level=debug msg="Fetching all digital collections" url="https://www.loc.gov/collections/?at%21=aka%2Cbreadcrumbs%2Cbrowse%2Ccategories%2Ccontent%2Ccontent_is_post%2Cexpert_resources%2Cfacet_trail%2Cfacet_views%2Cfacets%2Cfeatured_items%2Cform_facets%2Clegacy-url%2Cnext%2Cnext_sibling%2Coptions%2Coriginal_formats%2Cpages%2Cpartof%2Cprevious%2Cprevious_sibling%2Cresearch-centers%2Cshards%2Csite_type%2Csubjects%2Ctimeline_1852_1880%2Ctimeline_1881_1900%2Ctimeline_1901_1925%2Ctimestamp%2Ctopics%2Cviews&c=1000&fa=subject_topic%3Aamerican+history&fo=json"
cchc-crawler  | panic: runtime error: index out of range [0] with length 0
cchc-crawler  |
cchc-crawler  | goroutine 18 [running]:
cchc-crawler  | main.Collection.Save(0x0, 0x0, 0x0, 0x17fe1, 0x0, 0x0, 0x0, 0x1, 0x2719c40, 0xed8bd183b, ...)
cchc-crawler  | 	/app/collections.go:80 +0x559
cchc-crawler  | main.StartFetchingCollections(0xc0001164e0)
cchc-crawler  | 	/app/entry-all-collections.go:28 +0x98
cchc-crawler  | created by main.main
cchc-crawler  | 	/app/main.go:56 +0x125
cchc-crawler  | time="2021-08-30T08:06:39Z" level=info msg="Starting the LOC.gov API crawler"

Crawler interrupts

The crawler handles SIGINT and the like well enough. But it could be a bit better about closing connections to resources and should be trapped just for good practices.

The RRCHNM Data API is an example of how to handle this.

Nil pointer problem with saving languages

cchc-language-detector-2  | panic: runtime error: invalid memory address or nil pointer dereference
cchc-language-detector-2  | [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x7a8b6b]
cchc-language-detector-2  |
cchc-language-detector-2  | goroutine 2738 [running]:
cchc-language-detector-2  | github.com/lmullen/cchc/common/results.(*Repo).SaveLanguages(0xc00000e1f0, 0x6856968, 0xc081a43b60, 0xb24667dedf490d90, 0xd9a48a9bc8032496, 0xc0117f1920, 0x23, 0xc000800db0, 0x0, 0x0)
cchc-language-detector-2  | 	/cchc/common/results/repository_pgx.go:47 +0xab
cchc-language-detector-2  | main.processDocument(0xc0b2a96d00, 0x0, 0x0)
cchc-language-detector-2  | 	/cchc/language-detector/process-document.go:59 +0x4a3
cchc-language-detector-2  | main.processJobs(0x68568f8, 0xc000215300, 0xc000024510)
cchc-language-detector-2  | 	/cchc/language-detector/process-jobs.go:40 +0x47b
cchc-language-detector-2  | created by main.main
cchc-language-detector-2  | 	/cchc/language-detector/main.go:53 +0x30a
cchc-language-detector-2  | panic: runtime error: invalid memory address or nil pointer dereference
cchc-language-detector-2  | [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x7a8b6b]
cchc-language-detector-2  |
cchc-language-detector-2  | goroutine 311 [running]:
cchc-language-detector-2  | github.com/lmullen/cchc/common/results.(*Repo).SaveLanguages(0xc000386018, 0x6856968, 0xc0b8b0bc80, 0xa84a30d203c74c0b, 0x2dd41ab07451f781, 0xc00093b5f0, 0x23, 0xc0014a3110, 0x0, 0x0)
cchc-language-detector-2  | 	/cchc/common/results/repository_pgx.go:47 +0xab
cchc-language-detector-2  | main.processDocument(0xc059645180, 0x0, 0x0)
cchc-language-detector-2  | 	/cchc/language-detector/process-document.go:59 +0x4a3
cchc-language-detector-2  | main.processJobs(0x68568f8, 0xc00027d300, 0xc000398010)
cchc-language-detector-2  | 	/cchc/language-detector/process-jobs.go:40 +0x47b
cchc-language-detector-2  | created by main.main
cchc-language-detector-2  | 	/cchc/language-detector/main.go:53 +0x30a

Retry on API errors

API errors are not uncommon, but they are hard to reproduce. HTTP 500 Internal Server Error and HTTP 503 Service Not Available seem to be the most common.

The crawler should gracefully retry those requests when it can rather than logging them and skipping them. That's especially important because in crawling the pages, if one page fails the crawler currently won't try subsequent pages. That would mean that a big collection might be entirely lost.

Probably there should be better logging of those errors too so I can understand why they are happening.

Release strategy

  • Have versioned releases
  • That are built and pushed locally (to avoid Git LFS problems)
  • Which can be controlled by an environment variable

Automate running migrations

This should be in the common/db package, and it should run the migrations only if necessary. It should take a database advisory lock first to make sure no other package is trying to migrate. But this will probably have a "stop the world" effect, hence the particular need to check first.

Handle items which have more than one page of full text

An example page (from Chronicling America) that has full text, but where the item (the newspaper issue) has multiple full text files.

Perhaps split full text into a separate table?

ID: http://www.loc.gov/item/2012218613/1884-08-13/ed-1/

{
  "item": {
    "id": "http://www.loc.gov/item/2012218613/1884-08-13/ed-1/",
    "aka": [
      "http://www.loc.gov/item/2012218613/1884-08-13/ed-1/",
      "http://www.loc.gov/resource/2012218613/1884-08-13/ed-1/"
    ],
    "url": "https://www.loc.gov/item/2012218613/1884-08-13/ed-1/",
    "date": "1884-08-13",
    "item": {
      "date": "1881",
      "batch": [
        "scu_henryjohnson_ver01"
      ],
      "dates": [
        "1884-08-13"
      ],
      "genre": [
        "Newspapers"
      ],
      "notes": [
        "Weekly",
        "Began in 1881. Ceased in 1900?",
        "Archived issues are available in digital format as part of the Library of Congress Chronicling America online collection.",
        "Description based on: Oct. 5, 1881.",
        "Latest issue consulted: Vol. 54, no. 26 (Dec. 19, 1900).",
        "News and herald (Winnsboro, S.C. : 1901) 2333-1763 (DLC)  2012218612 (OCoLC)783781677"
      ],
      "title": "The Fairfield news and herald (Winnsboro, S.C.), August 13, 1884",
      "format": [
        "newspaper"
      ],
      "medium": "4 pages",
      "language": [
        "eng"
      ],
      "location": [
        "Fairfield County (S.C.)",
        "South Carolina--Fairfield County"
      ],
      "raw_lccn": "  2012218613",
      "subjects": [
        "Fairfield County (S.C.)--Newspapers",
        "South Carolina--Fairfield County",
        "United States--South Carolina--Fairfield--Winnsboro"
      ],
      "call_number": [
        "Newspaper"
      ],
      "date_issued": "1884-08-13",
      "other_title": [
        "News and herald"
      ],
      "reel_numbers": [
        "00237288403"
      ],
      "other_formats": [
        "https://tile.loc.gov/storage-services/service/ndnp//scu/batch_scu_henryjohnson_ver01/data/2012218613/00237288403/1884081301/1884081301_1.xml",
        "http://lccn.loc.gov/2012218613/marcxml"
      ],
      "digitized_label": "Present",
      "newspaper_title": [
        "The Fairfield news and herald."
      ],
      "contributor_names": [
        "University of South Carolina"
      ],
      "created_published": [
        "Winnsboro, S.C., August 13, 1884"
      ],
      "place_of_publication": "Winnsboro, S.C.",
      "library_of_congress_control_number": "2012218613"
    },
    "site": [
      "chroniclingamerica"
    ],
    "type": [
      "newspaper"
    ],
    "batch": [
      "scu_henryjohnson_ver01"
    ],
    "dates": [
      {
        "1884": "https://www.loc.gov/search/?dates=1884/1884&fo=json"
      }
    ],
    "genre": [
      "Newspapers"
    ],
    "group": [
      "ndnp/scu",
      "university-of-south-carolina-columbia-sc-awardee"
    ],
    "index": 1,
    "notes": [
      "Weekly",
      "Began in 1881. Ceased in 1900?",
      "Archived issues are available in digital format as part of the Library of Congress Chronicling America online collection.",
      "Description based on: Oct. 5, 1881.",
      "Latest issue consulted: Vol. 54, no. 26 (Dec. 19, 1900).",
      "News and herald (Winnsboro, S.C. : 1901) 2333-1763 (DLC)  2012218612 (OCoLC)783781677"
    ],
    "score": 8.842241,
    "title": "The Fairfield news and herald (Winnsboro, S.C.), August 13, 1884",
    "format": [
      {
        "newspaper": "https://www.loc.gov/search/?fa=original_format:newspaper&fo=json"
      }
    ],
    "medium": "4 pages",
    "number": [
      "1",
      "2012218613",
      "00237288403"
    ],
    "partof": [
      {
        "url": "https://www.loc.gov/search/?fa=partof:the+fairfield+news+and+herald+%28winnsboro,+s.c.%29+1881-1900&fo=json",
        "count": 711,
        "title": "the fairfield news and herald (winnsboro, s.c.) 1881-1900"
      },
      {
        "url": "https://www.loc.gov/collections/chronicling-america/?fo=json",
        "count": 266086,
        "title": "chronicling america"
      },
      {
        "url": "https://www.loc.gov/search/?fa=partof:serial+and+government+publications+division&fo=json",
        "count": 275151,
        "title": "serial and government publications division"
      }
    ],
    "rights": [
      "<p>The Library of Congress believes that the newspapers in Chronicling America are in the public domain or have no known copyright restrictions.  Newspapers published in the United States more than 95 years ago are in the public domain in their entirety. Any newspapers in Chronicling America that were published less than 95 years ago are also believed to be in the public domain, but may contain some copyrighted third party materials. Researchers using newspapers published less than 95 years ago should be alert for modern content (for example, registered and renewed for copyright and published with notice) that may be copyrighted.  Responsibility for making an independent legal assessment of an item and securing any necessary permissions ultimately rests with persons desiring to use the item.</p>\n<p>The NEH awardee responsible for producing each digital object is presented in the Chronicling America page display, below the page image  โ€“ e.g. Image produced by the Library of Congress. For more information on current NDNP awardees, see <a href=\"https://www.loc.gov/ndnp/listawardees.html\">https://www.loc.gov/ndnp/listawardees.html</a>.</p>\n<p>For more information on Library of Congress policies and disclaimers regarding rights and reproductions, see <a href=\"https://www.loc.gov/homepage/legal.html\">https://www.loc.gov/homepage/legal.html</a></p>"
    ],
    "subject": [
      "newspapers",
      "fairfield county",
      "fairfield county (s.c.)",
      "fairfield",
      "united states",
      "winnsboro",
      "south carolina"
    ],
    "language": [
      "english"
    ],
    "location": [
      "united states",
      "south carolina",
      "fairfield county",
      "winnsboro",
      "fairfield"
    ],
    "raw_lccn": "  2012218613",
    "shelf_id": "2012218613, 1884-08-13, Edition 1",
    "subjects": [
      {
        "fairfield": "https://www.loc.gov/search/?fa=subject:fairfield&fo=json"
      },
      {
        "fairfield county": "https://www.loc.gov/search/?fa=subject:fairfield+county&fo=json"
      },
      {
        "fairfield county (s.c.)": "https://www.loc.gov/search/?fa=subject:fairfield+county+%28s.c.%29&fo=json"
      },
      {
        "newspapers": "https://www.loc.gov/search/?fa=subject:newspapers&fo=json"
      },
      {
        "south carolina": "https://www.loc.gov/search/?fa=subject:south+carolina&fo=json"
      },
      {
        "united states": "https://www.loc.gov/search/?fa=subject:united+states&fo=json"
      },
      {
        "winnsboro": "https://www.loc.gov/search/?fa=subject:winnsboro&fo=json"
      }
    ],
    "_version_": 1664850171677114400,
    "campaigns": [],
    "digitized": true,
    "image_url": [
      "https://tile.loc.gov/image-services/iiif/service:ndnp:scu:batch_scu_henryjohnson_ver01:data:2012218613:00237288403:1884081301:0129/full/pct:3.125/0/default.jpg#h=342&w=246",
      "https://tile.loc.gov/image-services/iiif/service:ndnp:scu:batch_scu_henryjohnson_ver01:data:2012218613:00237288403:1884081301:0129/full/pct:6.25/0/default.jpg#h=684&w=492"
    ],
    "item_type": "issue",
    "languages": [
      {
        "english": "https://www.loc.gov/search/?fa=language:english&fo=json"
      }
    ],
    "locations": [
      {
        "fairfield": "https://www.loc.gov/search/?fa=location:fairfield&fo=json"
      },
      {
        "fairfield county": "https://www.loc.gov/search/?fa=location:fairfield+county&fo=json"
      },
      {
        "south carolina": "https://www.loc.gov/search/?fa=location:south+carolina&fo=json"
      },
      {
        "united states": "https://www.loc.gov/search/?fa=location:united+states&fo=json"
      },
      {
        "winnsboro": "https://www.loc.gov/search/?fa=location:winnsboro&fo=json"
      }
    ],
    "mime_type": [
      "image/jp2",
      "application/pdf",
      "text/xml",
      "image/jpeg"
    ],
    "resources": [
      {
        "url": "https://www.loc.gov/resource/2012218613/1884-08-13/ed-1/",
        "files": 4,
        "image": "https://tile.loc.gov/image-services/iiif/service:ndnp:scu:batch_scu_henryjohnson_ver01:data:2012218613:00237288403:1884081301:0129/full/pct:3.125/0/default.jpg",
        "word_coordinates": "https://tile.loc.gov/text-services/word-coordinates-service"
      }
    ],
    "timestamp": "2020-04-24T10:40:49.058Z",
    "call_number": [
      "Newspaper"
    ],
    "date_issued": "1884-08-13",
    "description": [
      "Winnsboro, S.C."
    ],
    "hassegments": true,
    "number_lccn": [
      "2012218613"
    ],
    "number_reel": [
      "00237288403"
    ],
    "other_title": [
      "News and herald"
    ],
    "contributors": [
      {
        "university of south carolina": "https://www.loc.gov/search/?fa=contributor:university+of+south+carolina&fo=json"
      }
    ],
    "extract_urls": [
      "file:/service/ndnp//scu/batch_scu_henryjohnson_ver01/data/2012218613/00237288403/1884081301/1884081301_1.xml#ndnp/scu"
    ],
    "partof_title": [
      "the fairfield news and herald (winnsboro, s.c.) 1881-1900"
    ],
    "reel_numbers": [
      "00237288403"
    ],
    "location_city": [
      "winnsboro"
    ],
    "online_format": [
      "image",
      "pdf",
      "online text"
    ],
    "other_formats": [
      {
        "link": "//tile.loc.gov/storage-services/service/ndnp//scu/batch_scu_henryjohnson_ver01/data/2012218613/00237288403/1884081301/1884081301_1.xml",
        "label": "MODSXML Record"
      },
      {
        "link": "//lccn.loc.gov/2012218613/marcxml",
        "label": "MARCXML Record"
      }
    ],
    "location_state": [
      "south carolina"
    ],
    "locations_city": [
      {
        "winnsboro": "https://www.loc.gov/search/?fa=location_city:winnsboro&fo=json"
      }
    ],
    "number_edition": [
      "1"
    ],
    "digitized_label": "Present",
    "display_offsite": true,
    "location_county": [
      "fairfield"
    ],
    "locations_state": [
      {
        "south carolina": "https://www.loc.gov/search/?fa=location_state:south+carolina&fo=json"
      }
    ],
    "newspaper_title": [
      "The Fairfield news and herald."
    ],
    "original_format": [
      "newspaper"
    ],
    "partof_division": [
      "serial and government publications division"
    ],
    "location_country": [
      "united states"
    ],
    "locations_county": [
      {
        "fairfield": "https://www.loc.gov/search/?fa=location_county:fairfield&fo=json"
      }
    ],
    "numeric_shelf_id": 188408131,
    "subject_headings": [
      "Fairfield County (S.C.)--Newspapers",
      "South Carolina--Fairfield County",
      "United States--South Carolina--Fairfield--Winnsboro"
    ],
    "access_restricted": false,
    "contributor_names": [
      "University of South Carolina"
    ],
    "created_published": [
      "Winnsboro, S.C., August 13, 1884"
    ],
    "extract_timestamp": "2020-04-23T19:41:22.635Z",
    "locations_country": [
      {
        "united states": "https://www.loc.gov/search/?fa=location_country:united+states&fo=json"
      }
    ],
    "partof_collection": [
      "chronicling america"
    ],
    "composite_location": [
      "0/united states/",
      "1/united states/south carolina/",
      "2/united states/south carolina/fairfield/",
      "3/united states/south carolina/fairfield/winnsboro/"
    ],
    "dates_of_publication": "1881-1900",
    "place_of_publication": "Winnsboro, S.C.",
    "publication_frequency": [
      "weekly"
    ],
    "library_of_congress_control_number": "2012218613"
  },
  "options": {
    "id": "2012218613/1884-08-13/ed-1/",
    "all": null,
    "clip": null,
    "host": "www.loc.gov",
    "ical": false,
    "iiif": false,
    "item": null,
    "keys": null,
    "port": "443",
    "count": null,
    "dates": null,
    "embed": [],
    "field": null,
    "index": null,
    "items": null,
    "style": null,
    "embed!": [],
    "format": "json",
    "method": "GET",
    "onsite": false,
    "region": "",
    "scheme": "https",
    "sortBy": null,
    "target": null,
    "latlong": null,
    "referer": null,
    "site_id": null,
    "callback": null,
    "distance": null,
    "duration": 0.43028783798217773,
    "language": null,
    "operator": null,
    "resource": "",
    "searchIn": null,
    "segments": null,
    "template": "item/base",
    "attribute": null,
    "delimiter": null,
    "is_portal": null,
    "newSearch": null,
    "path_info": "/item/2012218613/1884-08-13/ed-1/",
    "proxypath": null,
    "site_type": null,
    "solrQuery": "",
    "sortOrder": null,
    "startPage": null,
    "suggested": null,
    "timestamp": 1628367944.321504,
    "attribute!": "more_like_this,related_items,cite_this",
    "cache_tags": [
      "ndnp/scu",
      "university-of-south-carolina-columbia-sc-awardee",
      "gmd.mar",
      "catalog",
      "general-maps",
      "main-catalog",
      "wpalh",
      "federal-writers-project"
    ],
    "digital_id": null,
    "release_id": 123456789,
    "api_version": "1",
    "app_context": null,
    "facetLimits": "",
    "facetPrefix": null,
    "facet_count": null,
    "facet_style": null,
    "request_url": "https://www.loc.gov/item/2012218613/1884-08-13/ed-1/?at%21=more_like_this%2Crelated_items%2Ccite_this&fo=json",
    "searchTerms": "",
    "unionFacets": "",
    "access_group": [
      ""
    ],
    "excludeTerms": null,
    "new_clip_url": false,
    "query_string": "at%21=more_like_this%2Crelated_items%2Ccite_this&fo=json",
    "attribute_map": null,
    "clip_rotation": null,
    "default_count": 25,
    "display_level": null,
    "inputEncoding": "UTF-8",
    "content_filter": null,
    "downloadOption": null,
    "outputEncoding": "UTF-8",
    "redirect_proxy": false,
    "request_params": {
      "fo": [
        "json"
      ],
      "at!": [
        "more_like_this,related_items,cite_this"
      ]
    },
    "access_group_raw": "",
    "clip_image_width": null,
    "redirect_to_item": null,
    "page_has_campaign": false,
    "resource_sequence": null,
    "webcast_permalink": null,
    "application_version": "$Revision$",
    "content_replacement": ""
  },
  "locations": null,
  "resources": [
    {
      "url": "https://www.loc.gov/resource/2012218613/1884-08-13/ed-1/",
      "files": [
        [
          {
            "url": "https://tile.loc.gov/storage-services/service/ndnp/scu/batch_scu_henryjohnson_ver01/data/2012218613/00237288403/1884081301/0129.jp2",
            "use": "",
            "info": "https://tile.loc.gov/image-services/iiif/service:ndnp:scu:batch_scu_henryjohnson_ver01:data:2012218613:00237288403:1884081301:0129/info.json",
            "size": 10797093,
            "width": 7886,
            "height": 10952,
            "mimetype": "image/jp2"
          },
          {
            "url": "https://tile.loc.gov/storage-services/service/ndnp/scu/batch_scu_henryjohnson_ver01/data/2012218613/00237288403/1884081301/0129.pdf",
            "use": "",
            "mimetype": "application/pdf"
          },
          {
            "url": "https://tile.loc.gov/storage-services/service/ndnp/scu/batch_scu_henryjohnson_ver01/data/2012218613/00237288403/1884081301/0129.xml",
            "use": "",
            "mimetype": "text/xml"
          },
          {
            "url": "https://tile.loc.gov/image-services/iiif/service:ndnp:scu:batch_scu_henryjohnson_ver01:data:2012218613:00237288403:1884081301:0129/full/pct:3.125/0/default.jpg",
            "width": 246,
            "height": 342,
            "levels": 1,
            "mimetype": "image/jpeg"
          },
          {
            "url": "https://tile.loc.gov/image-services/iiif/service:ndnp:scu:batch_scu_henryjohnson_ver01:data:2012218613:00237288403:1884081301:0129/full/pct:6.25/0/default.jpg",
            "width": 492,
            "height": 684,
            "levels": 1,
            "mimetype": "image/jpeg"
          },
          {
            "info": "https://tile.loc.gov/image-services/iiif/service:ndnp:scu:batch_scu_henryjohnson_ver01:data:2012218613:00237288403:1884081301:0129/info.json",
            "size": 10797093,
            "title": "Image 1 of The Fairfield news and herald (Winnsboro, S.C.), August 13, 1884",
            "width": 7886,
            "height": 10952,
            "language": [
              "English"
            ],
            "mimetype": "application/json",
            "reel_number": "00237288403",
            "section_label": null
          },
          {
            "use": "text",
            "mimetype": "text/plain",
            "fulltext_service": "https://tile.loc.gov/text-services/word-coordinates-service?segment=/service/ndnp/scu/batch_scu_henryjohnson_ver01/data/2012218613/00237288403/1884081301/0129.xml&format=alto_xml&full_text=1",
            "word_coordinates": "https://tile.loc.gov/text-services/word-coordinates-service?segment=/service/ndnp/scu/batch_scu_henryjohnson_ver01/data/2012218613/00237288403/1884081301/0129.xml&format=alto_xml"
          }
        ],
        [
          {
            "url": "https://tile.loc.gov/storage-services/service/ndnp/scu/batch_scu_henryjohnson_ver01/data/2012218613/00237288403/1884081301/0130.jp2",
            "use": "",
            "info": "https://tile.loc.gov/image-services/iiif/service:ndnp:scu:batch_scu_henryjohnson_ver01:data:2012218613:00237288403:1884081301:0130/info.json",
            "size": 10819646,
            "width": 7853,
            "height": 11021,
            "mimetype": "image/jp2"
          },
          {
            "url": "https://tile.loc.gov/storage-services/service/ndnp/scu/batch_scu_henryjohnson_ver01/data/2012218613/00237288403/1884081301/0130.pdf",
            "use": "",
            "mimetype": "application/pdf"
          },
          {
            "url": "https://tile.loc.gov/storage-services/service/ndnp/scu/batch_scu_henryjohnson_ver01/data/2012218613/00237288403/1884081301/0130.xml",
            "use": "",
            "mimetype": "text/xml"
          },
          {
            "url": "https://tile.loc.gov/image-services/iiif/service:ndnp:scu:batch_scu_henryjohnson_ver01:data:2012218613:00237288403:1884081301:0130/full/pct:3.125/0/default.jpg",
            "width": 245,
            "height": 344,
            "levels": 1,
            "mimetype": "image/jpeg"
          },
          {
            "url": "https://tile.loc.gov/image-services/iiif/service:ndnp:scu:batch_scu_henryjohnson_ver01:data:2012218613:00237288403:1884081301:0130/full/pct:6.25/0/default.jpg",
            "width": 490,
            "height": 688,
            "levels": 1,
            "mimetype": "image/jpeg"
          },
          {
            "info": "https://tile.loc.gov/image-services/iiif/service:ndnp:scu:batch_scu_henryjohnson_ver01:data:2012218613:00237288403:1884081301:0130/info.json",
            "size": 10819646,
            "title": "Image 2 of The Fairfield news and herald (Winnsboro, S.C.), August 13, 1884",
            "width": 7853,
            "height": 11021,
            "language": [
              "English"
            ],
            "mimetype": "application/json",
            "reel_number": "00237288403",
            "section_label": null
          },
          {
            "use": "text",
            "mimetype": "text/plain",
            "fulltext_service": "https://tile.loc.gov/text-services/word-coordinates-service?segment=/service/ndnp/scu/batch_scu_henryjohnson_ver01/data/2012218613/00237288403/1884081301/0130.xml&format=alto_xml&full_text=1",
            "word_coordinates": "https://tile.loc.gov/text-services/word-coordinates-service?segment=/service/ndnp/scu/batch_scu_henryjohnson_ver01/data/2012218613/00237288403/1884081301/0130.xml&format=alto_xml"
          }
        ],
        [
          {
            "url": "https://tile.loc.gov/storage-services/service/ndnp/scu/batch_scu_henryjohnson_ver01/data/2012218613/00237288403/1884081301/0131.jp2",
            "use": "",
            "info": "https://tile.loc.gov/image-services/iiif/service:ndnp:scu:batch_scu_henryjohnson_ver01:data:2012218613:00237288403:1884081301:0131/info.json",
            "size": 10925889,
            "width": 7931,
            "height": 11020,
            "mimetype": "image/jp2"
          },
          {
            "url": "https://tile.loc.gov/storage-services/service/ndnp/scu/batch_scu_henryjohnson_ver01/data/2012218613/00237288403/1884081301/0131.pdf",
            "use": "",
            "mimetype": "application/pdf"
          },
          {
            "url": "https://tile.loc.gov/storage-services/service/ndnp/scu/batch_scu_henryjohnson_ver01/data/2012218613/00237288403/1884081301/0131.xml",
            "use": "",
            "mimetype": "text/xml"
          },
          {
            "url": "https://tile.loc.gov/image-services/iiif/service:ndnp:scu:batch_scu_henryjohnson_ver01:data:2012218613:00237288403:1884081301:0131/full/pct:3.125/0/default.jpg",
            "width": 247,
            "height": 344,
            "levels": 1,
            "mimetype": "image/jpeg"
          },
          {
            "url": "https://tile.loc.gov/image-services/iiif/service:ndnp:scu:batch_scu_henryjohnson_ver01:data:2012218613:00237288403:1884081301:0131/full/pct:6.25/0/default.jpg",
            "width": 495,
            "height": 688,
            "levels": 1,
            "mimetype": "image/jpeg"
          },
          {
            "info": "https://tile.loc.gov/image-services/iiif/service:ndnp:scu:batch_scu_henryjohnson_ver01:data:2012218613:00237288403:1884081301:0131/info.json",
            "size": 10925889,
            "title": "Image 3 of The Fairfield news and herald (Winnsboro, S.C.), August 13, 1884",
            "width": 7931,
            "height": 11020,
            "language": [
              "English"
            ],
            "mimetype": "application/json",
            "reel_number": "00237288403",
            "section_label": null
          },
          {
            "use": "text",
            "mimetype": "text/plain",
            "fulltext_service": "https://tile.loc.gov/text-services/word-coordinates-service?segment=/service/ndnp/scu/batch_scu_henryjohnson_ver01/data/2012218613/00237288403/1884081301/0131.xml&format=alto_xml&full_text=1",
            "word_coordinates": "https://tile.loc.gov/text-services/word-coordinates-service?segment=/service/ndnp/scu/batch_scu_henryjohnson_ver01/data/2012218613/00237288403/1884081301/0131.xml&format=alto_xml"
          }
        ],
        [
          {
            "url": "https://tile.loc.gov/storage-services/service/ndnp/scu/batch_scu_henryjohnson_ver01/data/2012218613/00237288403/1884081301/0132.jp2",
            "use": "",
            "info": "https://tile.loc.gov/image-services/iiif/service:ndnp:scu:batch_scu_henryjohnson_ver01:data:2012218613:00237288403:1884081301:0132/info.json",
            "size": 10832223,
            "width": 7870,
            "height": 11010,
            "mimetype": "image/jp2"
          },
          {
            "url": "https://tile.loc.gov/storage-services/service/ndnp/scu/batch_scu_henryjohnson_ver01/data/2012218613/00237288403/1884081301/0132.pdf",
            "use": "",
            "mimetype": "application/pdf"
          },
          {
            "url": "https://tile.loc.gov/storage-services/service/ndnp/scu/batch_scu_henryjohnson_ver01/data/2012218613/00237288403/1884081301/0132.xml",
            "use": "",
            "mimetype": "text/xml"
          },
          {
            "url": "https://tile.loc.gov/image-services/iiif/service:ndnp:scu:batch_scu_henryjohnson_ver01:data:2012218613:00237288403:1884081301:0132/full/pct:3.125/0/default.jpg",
            "width": 245,
            "height": 344,
            "levels": 1,
            "mimetype": "image/jpeg"
          },
          {
            "url": "https://tile.loc.gov/image-services/iiif/service:ndnp:scu:batch_scu_henryjohnson_ver01:data:2012218613:00237288403:1884081301:0132/full/pct:6.25/0/default.jpg",
            "width": 491,
            "height": 688,
            "levels": 1,
            "mimetype": "image/jpeg"
          },
          {
            "info": "https://tile.loc.gov/image-services/iiif/service:ndnp:scu:batch_scu_henryjohnson_ver01:data:2012218613:00237288403:1884081301:0132/info.json",
            "size": 10832223,
            "title": "Image 4 of The Fairfield news and herald (Winnsboro, S.C.), August 13, 1884",
            "width": 7870,
            "height": 11010,
            "language": [
              "English"
            ],
            "mimetype": "application/json",
            "reel_number": "00237288403",
            "section_label": null
          },
          {
            "use": "text",
            "mimetype": "text/plain",
            "fulltext_service": "https://tile.loc.gov/text-services/word-coordinates-service?segment=/service/ndnp/scu/batch_scu_henryjohnson_ver01/data/2012218613/00237288403/1884081301/0132.xml&format=alto_xml&full_text=1",
            "word_coordinates": "https://tile.loc.gov/text-services/word-coordinates-service?segment=/service/ndnp/scu/batch_scu_henryjohnson_ver01/data/2012218613/00237288403/1884081301/0132.xml&format=alto_xml"
          }
        ]
      ],
      "image": "https://tile.loc.gov/image-services/iiif/service:ndnp:scu:batch_scu_henryjohnson_ver01:data:2012218613:00237288403:1884081301:0129/full/pct:3.125/0/default.jpg",
      "word_coordinates": "https://tile.loc.gov/text-services/word-coordinates-service"
    }
  ],
  "timestamp": 1628367944752,
  "title_url": "https://www.loc.gov/item/2012218613",
  "next_issue": "https://www.loc.gov/item/2012218613/1884-08-20/ed-1/?fo=json",
  "calendar_url": "https://www.loc.gov/item/2012218613/?st=calendar",
  "previous_issue": "https://www.loc.gov/item/2012218613/1884-08-06/ed-1/?fo=json",
  "articles_and_essays": null
}

Better checks for items that have already been fetched

When items are located in the digital collections, they are added to the queue for fetching. A check happens then to make sure that they aren't fetched twice. However, it is possible (likely) that items that belong to two collections will get added to the queue twice. This is because there can be a long gap between adding to the queue and processing. So when the message is pulled off the queue, it should check a second time whether it has been fetched.

This item and others like it are part of a couple of collections.

Sample log:

cchc-crawler  | time="2021-08-13T14:07:47Z" level=error msg="Error saving item to database" error="Error saving item http://www.loc.gov/item/amss.as200190/ to database: ERROR: duplicate key value violates unique constraint \"resources_pkey\" (SQLSTATE 23505)" id="http://www.loc.gov/item/amss.as200190/"

LOC.gov pagination limits make it impossible to get all of big collections

The pagination limits make it so that you can't go past 100,000 items. This means you can't get all of Chronicling America.

A sample log entry from the crawler

cchc-crawler  | time="2021-08-13T03:47:34Z" level=warning msg="HTTP error when fetching from API" http_code=400 http_error="400 Bad Request" url="https://www.loc.gov/collections/chronicling-america/?at%21=aka%2Cbreadcrumbs%2Cbrowse%2Ccategories%2Ccontent%2Ccontent_is_post%2Cexpert_resources%2Cfacet_trail%2Cfacet_views%2Cfacets%2Cfeatured_items%2Cform_facets%2Clegacy-url%2Cnext%2Cnext_sibling%2Coptions%2Coriginal_formats%2Cpages%2Cpartof%2Cprevious%2Cprevious_sibling%2Cresearch-centers%2Cshards%2Csite_type%2Csubjects%2Ctimeline_1852_1880%2Ctimeline_1881_1900%2Ctimeline_1901_1925%2Ctimestamp%2Ctopics%2Cviews&c=1000&fa=online-format%3Aonline+text&fo=json&sp=101&st=list"

Going to that URL in the pagination does in fact return a 400 error.

Probably need to ask if there is a way around this.

Cf. #18.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.