Coder Social home page Coder Social logo

authorclassifier's Introduction

This README functions as a write-up for our implementation of a Bayes Author Classifier.

Bayes Author Classifier

Introduction

One of the key matters of import for this project was that it serve as a way for us to practice coding, as both members of this group have more of a background in linguistics than in programming. The project itself has us using a collection of text files from the Project Gutenberg ebook library. This ended up being a more involved task than originally conceived, as the collection includes a wide variety of text types and formats.

The core loop of the "author.py" component requires that it:

  • Generate a list of texts attributed to an author
  • Reduce the list according to language (English), number of authors (solo only), type (removing text collections) and format (simple text files)
  • Download the files from Project Gutenberg
  • Strip away whitespace, header, and footer data
  • Normalize the text (remove case and non-alphanumerics)
  • Save the text files both individually and as a single text

In addition to the core features, the module also allows for a user to pass the name of a file containing multiple lines of author names as download options, as well as a flag for accepting a string as an author name.

This saved text data is then used as the input for the second module, "make_model.py." Within this module, the actual guts of the Naive Bayes Categorizer is implemented. Within it, the output files from "author.py" are used to generate the model framework, and then each author's texts are integrated into a term matrix. Columns are created, representing document classes (authors). The texts in each author's directory are walked through, creating a list of all word-like tokens found. If a new token is found, a one-filled row is created to represent add-1 smoothing. The counts are incremented as each new token is found, and after all tokens for an author are found, the process is repeated with the next author. After all authors have their vocabulary token counts finalized, the probabilities of each token are calculated per author.

Unfortunately, we did not manage to get to the test component of our project. Implementation of such would require going through each document class (authors), summing the log of their previous probabilities, then summing the probability of the token matching a token in the document class in question with the log of the probability of the token in the test document. Once this is done for each document class, the one with the highest sum total is chosen as the category label for the test document.

Data

For this project, we used the previously mentioned Project Gutenberg library for our data. We specifically selected a collection of authors with a minimum of 25 English language releases listed on Project Gutenberg. These include William Shakespeare, Mark Twain, Arthur Conan Doyle, Jane Austen, Harriet Beecher Stowe, H.G. Wells, Charles Dickens, Jonathan Swift, Edith Wharton, and Edgar Allan Poe. A few comments regarding bias in the data should be mentioned. In particular, all of the selected authors are white, English speakers, and most are male. They are also all from American or English backgrounds. The selection stems largely from pre-existing bias, as several other authors were considered, but unfortunately lacked large enough corpuses to use. It might be possible in future iterations of this project to work with a wider selection of authors, as well as to include non-English texts, as their's no inherent technical reason to exclude works in many differing languages. The ultimate mechanism for choosing authors and language came down to personal familiarity with them from our perspective and availability of the texts via Project Gutenberg.

Results

With the existing data we have available, the results are currently represented as a rather large set of CSVs, with some 295,908 tokens tracked for each document class. The contextual probability of each token is represented along the rows, with the columns listed for each author in the order represented in the authors.csv, from left to right. Edgar Allan Poe, Mark Twain, Edith Wharton, William Shakespeare, Jane Austen, Charles Dickens, H. G. (Herbert George) Wells, Arthur Conan Doyle, Jonathan Swift, and Harriet Beecher Stowe. The token terms are represented in terms.csv, with their row matching the associated row in freqs.csv.

Since we don't have test information, there isn't a very straightforward way to describe the data, but we did want to comment on a particular aspect of the output. Since the table is generated with the authors as the (invisible) column heads and the data was generated piecewise by author, after an author's texts are completed, any remaining tokens will be zero by default, with the result representing a probability calculated based on the add-1 smoothing. As a result, skimming from top to bottom in the CSV shows very clearly where each text started and ended its contribution to the total number of tokens.

References

General References and Data Source

For author.py module:

For make_model.py module:

authorclassifier's People

Contributors

arcorion avatar

Watchers

 avatar

authorclassifier's Issues

Texts not normalized for training

  • It doesn't separate the texts out for training, validation, and test data. I'll actually want to get input on recommendations for this, but it should be pretty straightforward to do, in that I can just read the size of the text for an author and split it into roughly suitable chunks based on document size.

download_doc() being called twice

The code's rather sloppy - I have it calling the download_doc() function twice 1) to get the standalone file, 2) to get the concatenated file. This is why it actually lists the Downloading prompt twice. This should be easy to fix, but a tiny bit time consuming, as I'll have to refactor some code as well.

Accept File Input on Command Line

Right now, it takes the name of the author from user input. I'll want to give it the option of taking a file from the command line with the author names and necessary options.

Doc seems to be an audio file info text

I could be wrong, but at least one of the documents that it's downloading looks to be associated with an audio file. I'll look into that more to see what's going on.

Downloads multi-authored works

Since we're trying to capture writing style, multi-author works need to be excluded. If frozenset returns two+ authors, remove.

Data not blinded

The data isn't blinded yet, though I'm not totally sure we need to do this, given that it's easy enough to just randomize the download order of the authors.

Stalls when downloading a specific document from Mark Twain (3199)

  • I'm running into a glitch where the whole thing stalls when downloading a specific document from Mark Twain (3199). I haven't done any real troubleshooting yet, as I've been able to workaround it by hitting CTRL+C in my console, which sends an extra error code to drop out of the stall. I think perhaps the file is just too big or something, but I'll need to look into it more.

Duplicate texts/wrong languages downloaded

Right now, it downloads everything the person ever wrote, including duplicate texts, different languages, and collections. It does seem to exclude multi-author works automatically, which is convenient, but I haven't checked it thoroughly yet.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.