I'm starting to think that there would be some advantages to creating a sort-of matrix

For the corpus attributes, here's how I'm structuring them now: <div class="snippe

Corpus attributes I taken a first shot at a corpora attributes file: <a href

Create multiple base classes for corpora,about cltk/cltk

Comments (16)

$fractaledmind avatar$ fractaledmind commented on September 18, 2024

Having read some more, this may be more cleanly done using Composition rather than Inheritance. As I'm a fan of explicit over implicit behavior, this seems a bit easier to read and understand. But, functionally, both methods will produce the same result, and I feel confident I could write both. So, do you have a stylistic preference?

from cltk.

$fractaledmind avatar$ fractaledmind commented on September 18, 2024

Wait, what if we reverse this? What if each corpus specific class is a sub-class of Corpus, but Corpus requires (at least) 4 parameters:

name
retrieval type
language
content type

Then, in the Corpus class, it hooks out the various options depending on the parameters given. So, instead of using classes, we use sub-methods in the Corpus class. For example, the retrieve()method, under this schema, would look like this:

def retrieve(self):
    if self.retrieval_type == 'local':
        self._copy()
    elif self.retrieval_type == 'remote':
        self._download()

The corpus specific class passes the retrieval_type param on initialization, and only calls retrieve(), but we define various code paths on the backend.

Thinking through this now leads me to believe this the cleanest and best option.

from cltk.

kylepjohnson commented on September 18, 2024

I'll confess to being a newbie in writing both Inheritance and Composition Classes. Please excuse this rudimentary source, but it may help me think this though:

Inheritance is for "is-a" relationships. Is a child a parent? Not necessarily. Composition is for "has-a" relationships. A child has a parent (and a parent has a child). You would use inheritance if you had a person class, then a child is a person, so child would inherit from person

In the example of the TLG, by what you've sketched out, we would have an "inheritance" like so: LocalCorpus --> GreekCorpus --> BinaryCorpus --> TLGCorpus. Is a GreekCorpus a LocalCorpus? Not necessarily. Is a BinaryCorpus a GreekCorpus? Not necessarily (doesn't seem to fit the analogy). And is the TLGCorpus a LocalCorpus or GreekCorpus or BinaryCorpus? Well ... sorta ... but not really.

On the other hand, TLGCorpus necessarily has a LocalCorpus and GreekCorpus. Excluding BinaryCorpus (which seems to me at first blush to be an outlier) I think we are looking at a "has-a" relationship. If I am right about BinaryCorpus being a little bit different from the others, could we denote this as an argument to, eg, TLGCorpus? Besides, are not encoding (say some random old DOS format) and transliteration of any given corpus (say ITRANS to Unicode Devangari) different things?

All this said, Composition sounds easier and indeed more explicit (which I too prefer). What do you think? If I've gone off the rains on this one don't hesitate to tell me!

from cltk.

kylepjohnson commented on September 18, 2024

Before posting this last comment I hadn't seen your latest.

Wait, what if we reverse this? What if each corpus specific class is a sub-class of Corpus, but Corpus requires (at least) 4 parameters:

name
retrieval type
language
content type

I cannot answer authoritatively whether this is technically better, but it is immediately intuitive to me. Perhaps a fifth argument could be encoding (utf-8, Latin-1, etc).

So yes I think this is a great solution.

from cltk.

$fractaledmind avatar$ fractaledmind commented on September 18, 2024

Perfect. Glad to be in agreement. I've only ever played with single inheritance, so I'm not much more knowledgable than anyone on multiple inheritance. But parameter-driven code paths in a single class make a lot of sense to me, and I've written a lot of Python in that paradigm. It will require me to completely rework the code I've already written, but it shouldn't be too hard, since it's mostly just moving and renaming things. Hopefully I can push some basics tomorrow.

Since you know the corpora better than I do, if you think of any parameters that distinguish any two corpora, please let me know here. If we can isolate what unites and what separates any two corpora from one another, we ought to have a pretty good foundation for a schema. This schema will in turn make all further development way easier, so it ought to be worth the conceptual work.

Also, the encoding param makes sense. Will add.

from cltk.

kylepjohnson commented on September 18, 2024

Sounds like a plan. I will work up a list of corpora of which I am aware and their attributes.

And not exactly here nor there, but something to think about are indices for each corpora. Do you imagine these in the corpora/ dir or accompanying the downloaded corpora themselves?

from cltk.

$fractaledmind avatar$ fractaledmind commented on September 18, 2024

For the corpus attributes, here's how I'm structuring them now:

'tlg': {
        'language': 'greek',
        'retrieval_type': 'local',
        'content_type': 'binary',
        'encoding': 'latin-1'
    }

What I'm doing now for indices is to create JSON files and grant access to that data via properties. You can see this well in the TLG class. You can access TLG().authors, which will either read the data from the JSON file, or generate the JSON file and return the data. Are there indices for other corpora besides TLG and PHI?

from cltk.

kylepjohnson commented on September 18, 2024

Corpus attributes
I taken a first shot at a corpora attributes file: https://github.com/kylepjohnson/cltk/blob/corpora/corpora/common/corpora_attributes.json

It may not be everything you need to know to parse all of the corpora, but it should be enough to handle most of them.

We can talk further (maybe in particular tickets) about how to parse particular files. For example, there are numerous ways that one might want to parse the treebank data, to generate files of different types (eg, lists, tuples, etc) with different data (eg, inflected form, lemma, POS, syntax tag).

Indices
The TLG().authors style access is perfect. To compile indices ourselves or not? Here's my train of thought: if we do it once why make others do it on the fly over and over? Besides, an index won't take much space. If so, then the question is where to put it, in the core or in the root of each repository in ~/cltk_data/? For remote corpora, I am leaning towards the latter in order to encourage others to compose their own corpora for the CLTK without needing to have their index in the core. However for local corpora (currently tlg, phi5, phi7) we either need to compose indices on the fly (as I have been experimenting with at ~/cltk_data/compiled/tlg/index_author_works.txt & index_file_author.txt & index_meta.txt; see also index_* in phi5 and index_file_author.txt in phi7) or include the indices in the core. I kinda prefer the latter for the same reason of composing these only once if need be.

All of this leads me to think about something you've been thinking about for a few days, that is an API for accessing languages/corpora/authors/texts/chapters/sections. In the purest conception, this would be language and corpus agnostic, though this may be too high a bar to set. I'm sure we could do it, though. The first question I would ask is 'What kind of corpora are there?'

literary texts (with authors or author-like attributes, eg "Pseudo-Apollodorus")
documentary texts (epigraphs and papyri without authors, but usu. contained within regional collectsion, eg "Attica")
linguistic training sets including POS, syntactic relationships, sentence tokenization sets, word tokenization (important for eg Chinese), parallel texts (for machine translation)
the actual 'trainer', which will almost surely be stored as a .pickle. Note that these can and maybe should be distributed separately from the training sets. Should these pickles be included in the core? I lean towards yes.
many others that aren't coming to mind at the moment … lots of potentials like bag of words, synsets, etc … though I do not currently plan on having these for some months (POS tagging my big goal)

Anyways, let me know about whether the .json is good enough for what you need or if you'd like me to specify further. And if you need more info about the corpora and their contents, ask away. I'm surely leaving out some important detail.

from cltk.

$fractaledmind avatar$ fractaledmind commented on September 18, 2024

On (1), this looks like a great start. It will be impossible to say if it is sufficient until I get closer to finalizing the code. But, right now, it looks really good. The only thing that I will change is language to languages, since it is a list. Then I will make even corpora with only one language to have a one item list. This will standardize and make the parsing code easier.

On (2), my current approach was to generate the indices on the fly the first time (so not distributed in the core), but then saved and read every time after that. To me, that seems optimal. First, it is dynamic in that it indexes what data it actually has. Second, I rewrote your code for the TLG and PHI indices and radically increased the speed. On my machine, I can create the authors index in 2 seconds. So, there is no real speed loss. Then, you have an actual file for use every other time and in other ways. I confess that this is a data model that I love and use in a lot of scripts I write (have a self-creating property that saves to file).

On (3), you are right. A good API is key. It will really help adoption if the API is sensible and simple. And it will greatly help development as well. Of course, a great API is hard as hell. I've already changed the Corpus API three times. We will need to be reflective and self-critical, but other than that I don't have any specific thoughts yet. One thing I do think tho is that writing an API before writing code can sometimes help to create a clean, simple API. So, I think it would be smart to try to work it out now, and then write code to implement it.

from cltk.

kylepjohnson commented on September 18, 2024

re 1) we're set for now on that. Just speak up when you need my input
re 2) You're right, no harm in parsing on the fly. Added benefit that the parsing logic is available for all to see and improve upon.
er 3) I admire how seriously you're taking this. I do indeed want the CLTK's interface to have maximum adoption. You are more than capable to take the lead on this, so please run with it. If we both do some sketching, I suspect that we'll arrive at something similar. Another factor to keep in mind is genre attributes and other types of tags (date, location, etc), such as those in the yet-to-be-parsed meta files in the tlg and phi5. If we're thinking about an all-inclusive API from scratch, these will definitely need to be accounted for.

from cltk.

kylepjohnson commented on September 18, 2024

I've been thinking abuot corpora and indices. I see your point in actually scanning what's available. This works for well organized corpora. But what about those that are not?

For example, look at how the Perseus Greek data comes: https://github.com/kylepjohnson/corpus_perseus_greek/tree/master/perseus_greek. And I can show you far messier examples.

We can write parsers for every messy corpus, and this might be necessary in some cases, though there are some other options:

(a) We can document a default or desired CLTK corpus structure.
(b) We can document a default or desired CLTK index structure.

(a) has the benefit of being easy to parse. (b) has the benefit that any ugly unstructured or badly structured corpus can be easily included to our system. As an example of what I'm thinking, something similar to the JSON file you've done for Loeb: https://github.com/smargh/Classical-Studies-Resources/blob/master/loeb_volumes.json

I think we can do both, though at the moment I am inclined toward (b). This could be very clean and nice, with one JSON parser for any corpora containing author-work type relationships. I would be happy to help write these indices for the author-text corpora which I have assembled so far. Just a thought ...

from cltk.

$fractaledmind avatar$ fractaledmind commented on September 18, 2024

That makes sense. Most of thoughts on this have been about the corpora already included. I hadn't really thought about adding new corpora. That perspective does change things. The main point is that we need structured data for each corpora to make CLTK as usable as possible.

I also think putting certain types of work in contributors and not on maintainers is healthy. If you have a corpus to add, you should have an index that we can use, instead of having us figure out your corpus for you. This reminds me of the other scheme we are setting up. What we need are clear schemes for the data required to "register" a corpus with CLTK. We don't need to create each one ourselves, just the ones for the current corpora. But the clear scheme then makes it clear what is required to add a corpus to the set. So, I totally agree. You make a great point I hadn't considered. I think writing the indices for the main corpora now should help in generating the schematic template.

You can see the code I've already written to parse the TLG data into an index. Using tlgu has currently broken the works parsing, so I haven't gotten the final step down. Once we figure out how to do that tho, we can see what type of JSON structure is best.

I say go for it and see what scheme presents itself, then we can start checking it against other corpora. Then we do as you suggested earlier and package the indices with the CLTK. This would have the added benefit of allowing users to see what they will get if they retrieve and compile a corpus. So, maybe create a new directory in cltk_data for indices?

from cltk.

$fractaledmind avatar$ fractaledmind commented on September 18, 2024

Actually, an index is exactly what we need for smart compiling. I just realized this when trying to write compiling code the PHI5. Since it has multiple languages, there's no one size fits all approach. But, if there were an index that gave the:

file name
author
contained works
language
file encoding
font encoding
markup

for each file, that would make compiling so much saner. So, I think we need to index our current corpora.

from cltk.

kylepjohnson commented on September 18, 2024

I also think putting certain types of work in contributors and not on maintainers is healthy.

Great. I'll make up some examples for the different corpora we can expect.

The question remains with what to do making and persisting indices for local corpora. I'm good with however you want to do that, parsing them on the fly or keeping indices in the core.

What we need are clear schemes for the data required to "register"

How about checking for the presence of something called (say) index.json in the root dir of a corpus? (These would already be somewhere in ~/cltk_data/.) The file would need to meet our markup requirements, of course, but its presence would be enough to trigger loading.

Using tlgu has currently broken the works parsing

I'm not sure what you mean here, but you can see some crude code I use to parse the works in my old common.py. As you've probably noticed, work title begin (IIRC) between {1 and 1}. The benefit of offering a pre-compiled index would be that TLG users would not need to use the tlgu program. Those looking for special markup though have that option. This gets me back to my interest in "plain text" files, which I do understand not all are as keen on.

from cltk.

kylepjohnson commented on September 18, 2024

I'm assigning this to myself as a reminder for when I am writing indices for corpora.

from cltk.

kylepjohnson commented on September 18, 2024

I have split this ticket into two, #56 for writing indices and #57 for the json file of corpus attributes.

from cltk.

Create multiple base classes for corpora about cltk HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent