Abolitionist Speeches

Using Naive Bayes classification with text analysis to attempt to prioritise which newspaper articles to examine in order to discover speeches made by Black Abolitionist Speakers in the UK during the 1800s.

Introduction

Hannah Rose Murray's project focused on finding new and previously undiscovered or forgotten Black Abolitionist speeches between 1830–1895 in the British Library’s digitised nineteenth century newspaper collection through text and data mining techniques. The project attempted to illuminate and celebrate some of these performances through two events at the Library which included re-enactments using actors, as well as a walking tour through London citing places where performances took place. Finally, an interactive map tried to visualize how Black abolitionist performances reached nearly every corner of Britain.

Researcher: Hannah Rose Murray

Hannah-Rose Murray was a second year PhD student with the Department of American and Canadian Studies, University of Nottingham when she entered the competition. Her research examines the legacy of formerly enslaved African Americans on British society and the different ways they fought British racism.

Notes

OCR text from over 2 million digitised 19th Century newspaper pages were gathered and recompiled from XML that was originally generated by OCR software (ABBY Finereader) and uploaded onto a secure virtual server to make a queryable text dataset for this and other future projects.
A ‘ground truth’ human curated dataset was created which identified; 460 Black abolitionist ‘speeches’, 105 articles containing similar newspaper prose and 339 speeches not about abolitionism. This became a ‘training set’ which was used to help find Black Abolitionist performances amongst the digitised newspaper collection algorithmically.
A Juptyer notebook (http://jupyter.org/) instance was set up on a virtual server so that iterative instances of a classifier could be developed (and where necessary in combination with the use of ‘ground truth’ data) which could be executed to query the OCR text dataset in an attempt to retrieve 'Black Abolitionist performance’ subsets.
Several algorithmically generated datasets believed to identify ‘Black Abolitionist Performances’ gathered from the digitised newspaper corpus were created. We are in discussion with Gale Cengage to provide links to these articles via their new interface called ‘Primary sources’ to make examining whether they actually contain Black Abolitionist performances easier.
Re-OCR-ed (using ABBY Fine reader 11 - https://www.abbyy.com/en-gb/finereader/) and OCR-corrected (using Overproof - https://overproof.projectcomputing.com/) data generated from a representative sample of 970 pages some of which included Black Abolitionist speeches, as well as poor, good and excellent quality OCR from the 19th Century digitised newspaper corpus.
Re-OCR-ed and OCR-corrected sample datasets on initial inspection showed considerable improvement of OCR text quality compared to the original OCR text generated in 2011 for the dataset. This could be due in part to improvements in OCR software and correction algorithms/methods since 2011.
We believe it would be useful to carry out a small study to measure the percentage improvement on discoverability of text information resulting from the Re-OCR-ed and OCR-corrected sample datasets in comparison to the original OCR text. These findings could provide important evidence to the British Library to examine the economic viability of Re-OCRing or OCR-correcting our legacy digitised and OCR-ed text corpora. PhD student Amelia Joulain from the Spatial Humanities group at Lancaster has been recently asked to carry out the analysis and previously gave a presentation at the British Library about this subject on 4-May-2016 (https://goo.gl/s6zYu4).
2 events were organised and videoed through Labs; Black Abolitionists in 19th Century Britain (6-October-2016, https://goo.gl/XeQazq) and Black Abolitionist Walking Tour and Re-enactment (26-November-2016, https://goo.gl/22RS1)
A video interview with Hannah-Rose Murray summarising her work and videos of Black Abolitionists events on 6-October-2016 and 26-November-2016 are available through the Labs Youtube Channel (http://goo.gl/3cOSBm).

Method

A portion of the project was spent creating more palatable and accessible versions of the (bespoke) XML that contained the OCR-derived text and in getting a measure on the character errors present. The collection was processed quite some time ago, and the state-of-the-art in OCR has progressed since then. The text was flawed but usable, some newspapers noticeably fared worse than others and had far worse results.

Hannah's gold set of citations to reports and speeches were a crucial component, as were subsequent lists of Chartist speeches and of prose that were categorically not about Black Abolitionists talking in the UK. A random selection of 60% of the citations were used for training purposes.

A number of known keywords specific to the cause were used to find other relevant words in the training set texts. Lists of an 1 or 2 edit-distant words (Levenshtein distance algorithm) were gathered for the keywords and their common bigrams. It is hoped that the OCR software will make similar errors for similar words and so this list of extra words will have relevance. It would be much cleaner for this OCR training to be done with a different set of data to avoid overfitting and is something a future project could look at.

A Naive Bayes classifier was trained on the training set, with the features generated from basic textual analysis with features also generated by the keyword lists alongside (https://github.com/BL-Labs/abolitionistspeeches/blob/master/feature.py). A number of feature profiles were created (which algorithmic features to suppress or express) to explore their relevance (https://github.com/BL-Labs/abolitionistspeeches/blob/master/training_profiles.py).

One particular failure of note was including bibliographic metadata as a feature. The golden lists naturally favoured key points in time, and certain newspapers as it was based on famous speakers travelling the UK and being reported on by regional newspapers. This made the classifier think that certain years and certain newspapers were highly relevant in combination and this feature outweighed many of the more subtle features leading to a lot of false positive results!

A set of 4 related profiles were seemingly the most accurate, and these were tested on the rest of the citation data to a reasonable degree of success. The results of a scan of all newspaper articles in a large date range is included. (https://github.com/BL-Labs/abolitionistspeeches/blob/master/deepscan_1845_to_60.csv)

bl-labs / abolitionistspeeches Goto Github PK