jonathanreeve / corpus-db Goto Github PK

A textual corpus database for the digital humanities.

License: GNU General Public License v3.0

Haskell 7.72% Shell 0.17% Jupyter Notebook 92.12%

corpus-linguistics digital-humanities literature literary-studies literary-criticism literary-analysis text-analysis natural-language-processing

corpus-db's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger wangziyi2016 chunlitj gdevanla mkwassmason evantilley damian-romero wiwern

corpus-db's Issues

Make production / development environments

This'll just make it easier to switch between port 80 and 8000 (cloud/local), and between database paths.

Make project website

It'd be great to do this in Scotty, so that it'll be the same codebase and language as the API.

Make local test DB and add to GitHub repo

This will facilitate collaboration and building of the system for those that don't need or want the full 16GB database.

I'm guessing the process will be:

make a new DB, attach it using the ATTACH command
copy over the first 20 or so texts to the new DB
add the DB to GitHub
add a new environment (default) that will use the relative path to this test DB

Create documentation for new RISE endpoints

This is documentation for PR #30, which creates new API endpoints for compatibility with external tools.

Missing full texts for some books

Hey, got item # 807 back as a detective novel when I query for metadata, but it returns an empty list when I request the full text.

http://corpus-db.org/api/id/807/fulltext

Full text wrapped in square brackets

When getting a full text, like in this example:

http://corpus-db.org/api/id/108/fulltext

the result is wrapped in square brackets. That means that when loaded into JSON it's a list with 1 item in it (the dictionary) rather than just a dictionary. Think it would be better without the brackets so it wouldn't need to be parsed out.

More semantics

Thinking about semantics again. In this example:

http://corpus-db.org/api/id/807/fulltext

I think this has some issues. As a user, if I see the endpoint for getting some specific data attached to a resource, I should a) also know how to get the resource metadata and b) see how to get the metadata for all resources of that type.

If the endpoint was something like:

http://corpus-db.org/api/v1/books/id/10?fulltext=true

http://corpus-db.org/api/v1/books?fulltext=true&id=101

you could infer that

    http://corpus-db.org/api/v1/books

gives you all the metadata for the books resource and

http://corpus-db.org/api/v1/books?id=101

gives you metadata, probably with an excerpt, without the full text. Also, if you wind up adding more resources, you've painted yourself into a corner with the approach that only has the id. Imagine if you want to do this:

http://corpus-db.org/api/v1/author?name="Margaret+Atwood"
http://corpus-db.org/api/v1/author?id=391

that doesn't really jive with the format you've established for books, making books a special case rather than a template that allows you to infer how the rest of the system works. I think consistency here makes the whole API a lot more usable and intuitive.

New API endpoint for Wikipedia categories

This would allow dowloading books with the Wikipedia category "Novels set in London," for instance:

corpus-db.org/api/category/Novels set in London

And its full-text counterpart:

corpus-db.org/api/category/Novels set in London/fulltext

Import Gutenberg metadata to Fuseki container

There's already a Fuseki docker container on DockerHub. I'd like to get this to import Project Gutenberg metadata automatically.

I imagine this would be:

write a script that downloads PG metadata (RDF XML)
write something that hooks this metadata up to a Fuseki container
expose endpoint on server

Max results query parametr

Would be nice to have a query parameter that can return x number of results to speed up examples in the notebook.

Clean database

A lot of fields in the database are just string representations of Python objects (lists, dictionaries, etc.). It would help to put this into a more structured format, if possible.

api returns a prefix of the json when requesting fulltext

When requesting the full text, in some cases only a prefix of the json is returned. For example:

http://corpus-db.org/api/author/Dickens,%20Charles/fulltext

Doing so does not return the full json file. Instead each time I visit the page I get a different prefix of the entire json file.

Speed up full-text search response

Full-text searches take a really long time, but they return results iteratively through the sqlite interface. Speeding them up could probably be achieved by treating the results of a database query more like a stream. Haskell might already be doing some kind of streaming. It might be good to investigate this a little.

Migrate to Docker

It'd be nice to have the whole setup procedure (stack build, etc.) containerized on docker. This would save time when migrating to a new server. I'm not quite sure how data volumes would work, though. Low-priority for the moment, since this is a dev-ops issue.

Write out API spec

This would, in the beginning, just be a list of URL patterns and descriptions, like:

api.corpus-db.org/author/Dickens -- Should give you the metadata of Dickens novels
api.corpus-db.org/fulltext/author/Dickens -- Should give the full text of all Dickens novels

This could eventually become more formal API documentation.

Find a way to automatically create API docs from code comments

It seems like the usual way of writing documentation is using Haddock. Documentation is automatically generated using code comments.

This isn't necessary, but it would be nice, since that way we'd have everything in one place (without having to update the docs page every time there's an API change).

Write author API endpoint

Having the ability to generate single-author subcorpora would be a fun first feature to have.

Create full text search (FTS5) table

https://www.sqlite.org/fts5.html

Steps are:

Create a new FTS5 virtual table modeled on the text table
Copy all data from text to the new table
???
Profit!

I have no idea how big this will make the database, though, since I had to kill the process on my laptop after the database more than doubled in size (>18G). I think I'll need to temporarily buy a new DigitalOcean volume and hook it up to the server in order to test this.

Rewrite database layer using Persistent

The nice thing about using Haskell is type safety, and HDBC isn't as typesafe as a more ORM-like database layer like Persistent. Persistent would also allow us to effortlessly migrate our database, if need be.

Add more example analyses

More comparisons between single-author corpora would be interesting, as well as more metadata analyses. Example text analyses in languages other than Python might be interesing, too. How about some R analyses? Haskell, even. Jupyter supports all of these languages, now, so they could all be in Jupyter notebooks, and thus displayable on GitHub.

Parse Gutenberg RDF-XML straight from the source

I've started looking into this. Still haven't decided on an XML parsing solution, so I posted on StackOverflow about it.

Need a title search

It would be great to have a way to search by title, if that is all the info we have (i.e. don't know Gutenberg ID, author, etc.). Thanks!

Make better site design

This could probably be accomplished with a drop-in Bootstrap theme of some sort, like bootstrap-material-design. As @smythp pointed out, the current site looks very vanilla bootstrap.

Compile statistics about data field completeness

Wikipedia data only exists for about 1-2K of the ~45K books in PG, if I recall correctly. To figure out where there is room for improvement, it would first help to know the completeness of all the data fields. Then we can identify patterns in the books that have very little metadata.

Make favicon and get it serving correctly

The stock Scotty favicon is 404ing with every request. Should probably fix this. There's probably an easy fix for it.

New endpoint for Library of Congress subject headings (LCSH)

I've already started doing this on the develop branch.

Make a robots.txt

The server has been saying there have been a lot of requests for robots.txt. Better make something that tells bots where the content pages are (and to stay away from the API).

IDs formatted as floats

The IDs when getting subject metadata all have .0 at the end. Seems like they should be integers, not floats.

Consider other database formats

It would be worth considering other database formats for this project, since SQLite only supports certain data types. It's definitely nice to have everything in one file, though.

Set up log file system and hook them up to fail2ban

There are just too many requests to wp_config.php. Tons of hacking attempts. They need to be banned.

Semantic endpoints (singular/plural)

According to what I've seen/read, typically the endpoints stay plural, like

baseurl + "/api/subjects/detectivefiction"

In terms of meaning, the category above, subjects, is inclusive of all the subjects. The next delineation, detectivefiction, is the singular.

If the API was like:

baseurl + '/api?subject="detectivefiction"&output=json'

then that makes sense because subject isn't the category above, it's the key to which detectivefiction is the value.

Also it's confusing if "subjects" lists all the subjects and then that's not continued when getting specific subjects.

Version number

Think it will prevent some trouble if you add a version number to your URIs:

/api/v0.1/subjects/detectivefiction

Interesting discussion here, though some of it is probably overkill for this:

https://stackoverflow.com/questions/389169/best-practices-for-api-versioning