Coder Social home page Coder Social logo

jonathanreeve / corpus-db Goto Github PK

View Code? Open in Web Editor NEW
57.0 7.0 8.0 26.67 MB

A textual corpus database for the digital humanities.

Home Page: http://corpus-db.org

License: GNU General Public License v3.0

Haskell 7.72% Shell 0.17% Jupyter Notebook 92.12%
corpus-linguistics digital-humanities literature literary-studies literary-criticism literary-analysis text-analysis natural-language-processing

corpus-db's People

Contributors

evantilley avatar gdevanla avatar gitter-badger avatar jonathanreeve avatar mkwassmason avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

corpus-db's Issues

Make project website

It'd be great to do this in Scotty, so that it'll be the same codebase and language as the API.

Make local test DB and add to GitHub repo

This will facilitate collaboration and building of the system for those that don't need or want the full 16GB database.

I'm guessing the process will be:

  • make a new DB, attach it using the ATTACH command
  • copy over the first 20 or so texts to the new DB
  • add the DB to GitHub
  • add a new environment (default) that will use the relative path to this test DB

Full text wrapped in square brackets

When getting a full text, like in this example:

http://corpus-db.org/api/id/108/fulltext

the result is wrapped in square brackets. That means that when loaded into JSON it's a list with 1 item in it (the dictionary) rather than just a dictionary. Think it would be better without the brackets so it wouldn't need to be parsed out.

More semantics

Thinking about semantics again. In this example:

http://corpus-db.org/api/id/807/fulltext

I think this has some issues. As a user, if I see the endpoint for getting some specific data attached to a resource, I should a) also know how to get the resource metadata and b) see how to get the metadata for all resources of that type.

If the endpoint was something like:

http://corpus-db.org/api/v1/books/id/10?fulltext=true

or

http://corpus-db.org/api/v1/books?fulltext=true&id=101

you could infer that

    http://corpus-db.org/api/v1/books

gives you all the metadata for the books resource and

http://corpus-db.org/api/v1/books?id=101

gives you metadata, probably with an excerpt, without the full text. Also, if you wind up adding more resources, you've painted yourself into a corner with the approach that only has the id. Imagine if you want to do this:

http://corpus-db.org/api/v1/author?name="Margaret+Atwood"
http://corpus-db.org/api/v1/author?id=391

that doesn't really jive with the format you've established for books, making books a special case rather than a template that allows you to infer how the rest of the system works. I think consistency here makes the whole API a lot more usable and intuitive.

New API endpoint for Wikipedia categories

This would allow dowloading books with the Wikipedia category "Novels set in London," for instance:

corpus-db.org/api/category/Novels set in London

And its full-text counterpart:

corpus-db.org/api/category/Novels set in London/fulltext

Max results query parametr

Would be nice to have a query parameter that can return x number of results to speed up examples in the notebook.

Clean database

A lot of fields in the database are just string representations of Python objects (lists, dictionaries, etc.). It would help to put this into a more structured format, if possible.

Speed up full-text search response

Full-text searches take a really long time, but they return results iteratively through the sqlite interface. Speeding them up could probably be achieved by treating the results of a database query more like a stream. Haskell might already be doing some kind of streaming. It might be good to investigate this a little.

Migrate to Docker

It'd be nice to have the whole setup procedure (stack build, etc.) containerized on docker. This would save time when migrating to a new server. I'm not quite sure how data volumes would work, though. Low-priority for the moment, since this is a dev-ops issue.

Write out API spec

This would, in the beginning, just be a list of URL patterns and descriptions, like:

  • api.corpus-db.org/author/Dickens -- Should give you the metadata of Dickens novels
  • api.corpus-db.org/fulltext/author/Dickens -- Should give the full text of all Dickens novels

This could eventually become more formal API documentation.

Find a way to automatically create API docs from code comments

It seems like the usual way of writing documentation is using Haddock. Documentation is automatically generated using code comments.

This isn't necessary, but it would be nice, since that way we'd have everything in one place (without having to update the docs page every time there's an API change).

Create full text search (FTS5) table

https://www.sqlite.org/fts5.html

Steps are:

  1. Create a new FTS5 virtual table modeled on the text table
  2. Copy all data from text to the new table
  3. ???
  4. Profit!

I have no idea how big this will make the database, though, since I had to kill the process on my laptop after the database more than doubled in size (>18G). I think I'll need to temporarily buy a new DigitalOcean volume and hook it up to the server in order to test this.

Rewrite database layer using Persistent

The nice thing about using Haskell is type safety, and HDBC isn't as typesafe as a more ORM-like database layer like Persistent. Persistent would also allow us to effortlessly migrate our database, if need be.

Add more example analyses

More comparisons between single-author corpora would be interesting, as well as more metadata analyses. Example text analyses in languages other than Python might be interesing, too. How about some R analyses? Haskell, even. Jupyter supports all of these languages, now, so they could all be in Jupyter notebooks, and thus displayable on GitHub.

Need a title search

It would be great to have a way to search by title, if that is all the info we have (i.e. don't know Gutenberg ID, author, etc.). Thanks!

Compile statistics about data field completeness

Wikipedia data only exists for about 1-2K of the ~45K books in PG, if I recall correctly. To figure out where there is room for improvement, it would first help to know the completeness of all the data fields. Then we can identify patterns in the books that have very little metadata.

Make a robots.txt

The server has been saying there have been a lot of requests for robots.txt. Better make something that tells bots where the content pages are (and to stay away from the API).

IDs formatted as floats

The IDs when getting subject metadata all have .0 at the end. Seems like they should be integers, not floats.

Consider other database formats

It would be worth considering other database formats for this project, since SQLite only supports certain data types. It's definitely nice to have everything in one file, though.

Semantic endpoints (singular/plural)

According to what I've seen/read, typically the endpoints stay plural, like

baseurl + "/api/subjects/detectivefiction"

In terms of meaning, the category above, subjects, is inclusive of all the subjects. The next delineation, detectivefiction, is the singular.

If the API was like:

baseurl + '/api?subject="detectivefiction"&output=json'

then that makes sense because subject isn't the category above, it's the key to which detectivefiction is the value.

Also it's confusing if "subjects" lists all the subjects and then that's not continued when getting specific subjects.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.