Coder Social home page Coder Social logo

trilogy's People

Contributors

j-hagedorn avatar sdaranyi avatar skontopo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

kirtana jakobytes

trilogy's Issues

Some tale texts are not in complete form found on Ashliman folktales website

Here is an example of the differences between the 'text' part of the dataset and the version on the Ashliman folktext website, the Ant and the Grasshopper:

  • an ant formerly a man - aesop (l’estrange, 1692) just has the moral bit at the end, not the actual story

  • the ant and the grasshopper - aesop (croxall, 1775) misses out the first para, same with the bewick 1818 version (same title), the James 1848, Jacobs 1894, and Bierce.

On the other hand, Jones (1912), and the other Bierce (the grasshopper and the ant) are complete.

Identify textual markers to concatenate motifs and represent them as tale variants

@salmonix identified the following issue in the consistency of variant depictions in the original documentation. For instance, in tale 1692 from atu_df.csv: "A fool joins a band of robbers. They send him into a house to steal while the rest of them wait outside. He bungles the job in one of several ways [J2136]: He takes the robbers instructions literally. They tell him to bring something substantial (i.e. valuable), and he brings something heavy (e.g. a mortar) [J2461.1.7]. They tell him to bring something shiny (i.e. gold), and he brings a mirror [J2461.1.7.1]. The fool awakens the household. He wants to take more than he can carry, so he wakes the owner and asks him for help [J2136.5.6]. The fool finds a musical instrument and plays it loudly [J2136.5.7]. He decides to cook something to eat. Hearing the owner sigh in his sleep, the fool thinks he must be hungry, so he puts hot food in his mouth (hand) [J2136.5.5]."

Because the ATU original documents these in separate brackets, this is in our data as: "1692", 1, “J2136", "J2461.1.7", "J2461.1.7.1", "J2136.5.6","J2136.5.7","J2136.5.5". The problem with that this tale should look like: "1692",1,"J2461.1.7",["J2136.5.6","J2136.5.7","J2136.5.5"]. In order to clean this appropriately, we would need to identify textual markers that would let us to combine bracketed motifs and represent them as variants, rather than a prolonged sequence.

Manually add tales from pages generating errors

See p. 11, section 2.2.2 “In addition, the pages for the following tale types were unable to be scraped, due to errors generated in the R session: The Fisherman and His Wife, A Fool Does Not Count the Animal He Is Riding, Singing Thieves, Stages of Life, Straw, Bean, and Coal, The Wandering Jew”.

Tag root and terminal (leaf) node in ATU datasets

@sdaranyi asked that we: "List all terminals and roots. Group both lists into topics based on TMI labels. As L161 and L162 are frequently terminals, list all types for individual inspection where the motif string continues after them (or after any other terminal)". Planning to resolve this by doing two things:

  1. In atu_seq dataset, creating a new column identifying the first and last motif within each tale_variant. This will allow for easy filtering of these by people using the .csv file.
  2. In atu_graph.graphml file, adding two T/F fields to the node data: is_root and is_leaf

@sdaranyi and @salmonix, please note that this approach will identify the first and last items present in the data. It will be an entirely different effort to identify what "should be" a terminal node, if the ATU does not identify these in the correct sequence.

Flag tales within sub-pages with multiple tale-types

Q: Under 'Boccaccio', 'Cat and Mouse', we find multiple tales with separate type numbers. For all other pages, we assign the same type to all tales in the page. This would require separate logic to assign separate types to each tale in a page. How valuable is this to our effort?

A (from @sdaranyi ): _From the Boccaccio label, I would leave out the Decameron Web. It is too concise to have a single type number. The rest have type numbers, so for "The boy who has never seen a woman", the subsumed three stories could be perhaps tagged as 1678_A, 1678_B and 1678_C. Does this violate your coding ideas? (For some reason, below the Boccaccio heading, "The boy..." is a standalone title as well, with the indication "Tales of types 1678 and 1459". I have no idea how 1459 entered the game. ) Likewise, "The enchanted pear tree" with its 8 variants could be labelled 1423_A to H. "The Three-Ring Parable", type 972, returns a 404 error at my end, and could be left out. The Griselda item sounds flat as a punctured tire yet has a number, so it should stay.

The "Cat and Mouse" category looks interesting, because in it, keywords (ie cat and mouse) clash with tale typology. I'd suggest to keep the existing type labels. If we run into evidence that keyword-level semantics is strong enough to blur type-level (topical) semantics, we can return to the problem._

Resolve non-valid ATU IDs in AFT dataset

Pasting your e-mail to track its resolution, @sdaranyi
"Hi Josh, I started to work with the AFT to exemplify on an existing corpus how things would look if people started using an existing Python workflow in Orange...:

  • I am using the aft.csv (21-03-25 in my folder -- pretty old, if the current version has mended the problem to be described below, please ignore this);
  • It has atu-Id values well beyond 2400, the upper limit of tale types in the ATU. With not even the list of Discontinued types containing identifiers between 3000-7080, I simply have no idea where Ashliman may have gotten these types from, but chances are that he used something else, not the ATU;
  • The column type_identifier contains strings not found in the ATU. No matches whatsoever, so again the question is if Ashliman used an older typology or his own."

Resolve identified data cleaning issues with TMI dataset

@sdaranyi noted that "Looking at today's pulled latest TMI, there's something wrong with the structure. Column 1 has mostly motif numbers only with labels in Column 3, but in a number of rows both the motif number and the label are present. Worse, based on this table you said there are 46222 individual motifs, whereas it has 54906 lines. Which one is the correct number? Chances are that this could change the proportion of motifs used in the ATU."

@sdaranyi , can you please provide examples where both the motif number and the label are present in the id field? That would help me to find the issue.

@sdaranyi , my tmi dataset still shows 46,222 unique values in the id field, and 46,230 rows in the dataset, with the following items being duplicated (and needing to be resolved):

image

@sdaranyi and @salmonix , please identify any additional data cleaning issues you may find related to the tmi dataset in the comments below, and I will try to resolve those quickly, if they are minor, rather than opening new issues to track them.

Licensing documentation

To do:

  • Remove copyright field
  • Add documentation that it is assumed that items are in the public domain, released under Creative Commons.
  • Add citation in BibTex for citing use of dataset and related publications

Include tales with duplicate titles within the same tale type

Separate and include tales with duplicate titles in the same tale type. These include:
Ant and Grasshopper, Bearskin, Beauty and the Beast, The Bremen Town Musicians, The Runaway Pancake, The End of the World, Devils Bridge Tales, Dividing Souls in the Graveyard, Fairy Cup Legends, Foolish Wishes, The Fox and the Cat, The Grateful Animals and the Ungrateful Man, Hanging Game, The Himphamp, Jack and the Beanstalk, The Language of Animals, The Lion in the Water, Llewellyn and His Dog Gellert, Town Mouse and Country Mouse

Modify visuals for summary stats

Per reviewers:

  • Figure 2 would be better as a table
  • Figure 1 (Distribution of tale lengths) is maybe misleading (78-1133-1-SP): Since most of the tales have numbers of words below 1000 and only few are with more words the scale up to more than 10000 words give only a poor picture what is in your date. Choose a better visualization to make your case stronger.
  • Table of key statistics: The paper is missing a summary table of the key statistics of the corpus. This should list things like: # tales, # types, # unique provenances, min, avg, and max of various measures such as # tokens/tale, #tales/type, etc.

Transforming/sequencing ATU data

To do:

  • Convert separate variants within a single atu_id to numbered columns.
  • Each atu_id may have different variants, even if these contain the same motif IDs we want to keep them separately
  • Extract the motif_id sequence from each story variant of each atu_id
  • Extract the "combinations" for each atu_id to allow analysis of co-occurrences

Add biclustering example use case

Per request from @sdaranyi : "2-way clustering (biclustering, block clustering, co-clustering) on the AFT for first results. I know it exists in R because I used it a few years ago but it would be killing to find the relevant knowhow in my files right now. Its output is a heatmap, or it should be combined with one, the point being that we could immediately smuggle in the ATU thereby, without much ado. It would be nice to create a visual to reproduce the ATU contents, separating animal tales, magical tales etc into separate boxes so that eg these categories label the rows and the respective ATU numbers flagging texts label the columns (or vice versa, with more texts than categories). I'm not sure if the figure will make much sense for us, but for others not being familiar with this technique, it will. Plus herefrom we could move ahead toward text analysis vs motif string based block clustering in pursuit of motifs etc. So to say, an exemplification of the possibilities for future users with the willingness but no idea of what next."

Add lookup table for Proppian functions

@sdaranyi , you mentioned that you'd like to: "In a separate list, start collecting motifs that correspond to functions. Eg L161 and L162 would be F31 Wedding, maybe part of it F30 Punishment, whereas the root could be screened for F0 Initial situation, F1 Absentation (leaving from home)."

In order to do this, we would want to be able to join our motif table to a table of Proppian functions. Would we want any information about the functions in this table other than what's noted here (i.e. a table with propp_id, definition, function)?

Clean motif IDs

Clean patterns from atu_seq$motif found using:

un <-
  atu_seq %>% 
  select(motif) %>%
  distinct() %>%
  anti_join(tmi, by = c("motif" = "id"))

Clean patterns from aft$atu_id found using:

un <-
  aft %>%
  select(atu_id) %>%
  distinct() %>%
  anti_join(atu_seq %>% select(atu_id) %>% distinct(), by = "atu_id")

Find a way to merge the ATU with the AT

The AT Uther updated had useful extra information ignored or condensed by the ATU, prominently on the motif strings of type variants. To integrate both sources could help us disambiguate too long motif chains, lasting often beyond the usual terminal. See eg suspicious cases where L161 is non-terminal, although quite often it is.

Address issues with sequencing logic

Need to address these issues, here (before expanding):

  • Reference to motif at beginning of sequence which is not explicitly named (e.g. "A1750ff."). Current plan: Subset tmi dataset dataset for all values in section ending "ff.", then resolve this with a between() join to that portion of the tmi dataset.
  • Reference to range of motifs in sequence not explicitly named (e.g. "F611.1.11�F611.1.15"). Current plan: resolve this with a between() join to the tmi dataset.

Various small housekeeping fixes

A running list of tiny tweaks, filters and cosmetic changes:

  • Instances where tale_title different so longer text isn't selected (e.g. "buttermilk jack","King Bluebeard")

Clean atu_combos

Dataframe contains multiple notes on possible combinations that require regex and unnesting

Cleaning atu text

To do:

  • Chapters to str_to_title()
  • Pad atu_id to support joins
  • Remove period at end of tale_name
  • Split previous tale_name portion into new field
  • Remove � from text
  • Remove prefix (e.g. "Remarks:") from text field

Various missing tales due to unknown reasons

These include:

  • Crop Division Between Man and Ogre: Only 'The Troll Outwitted' is missing... why?
  • Llewellyn and His Dog Gellert: Only 'The Dog and the Snake and the Child', why?

Include tales with links whose pages don't appear to be scraped

These include:

  • Tales missing with a manually assigned id: Ali Baba, Animal Brides and Animal Bridegrooms, Big Peter and Little Peter (norway133) , Friday, Frog Kings, The Hand of Glory, Hog Bridegrooms, The Husband Who Was to Mind the House, Mastermaid (norway120), Midwife for the Elves

  • Tales which appear to have a type in the URL but are missing: The Blood Brothers (type0303), The Three-Ring Parable, The Fox and the Crow (type0057)

Missing texts from Ashliman corpus

Notes from review by @sdaranyi (thank you!):

  • Air Castles: The page lists 20 tales, your list has 17;
  • Ali Baba is missing;
  • Animal Brides: Ashliman's 11th link lists 6 other respective tales, to be added to 402;
  • Animal Brides and Animal Bridegrooms are missing;
  • Ant and Grashopper: there are 8 in the table and 12 in Ashliman;
  • Bearskin: 7 in the csv, 8 on the page (the general case seems to be that duplicate titles by different authors cause the difference);
  • Beauty and the Beast: 14 in the table, 20 on the page;
  • The Bell of Justice: please remove The Bell of Atri (another copy). Written by Longfellow, the question is, why do we keep Tolstoy, Bierce, Mark Twain, and indeed Aesop, and I have no good answer;
  • Sunken Bells: 14 in the table, 15 in the list;
  • Big Peter and Little Peter is missing;
  • The Birthmarks of the Princess: 12 in the table, 13 in the list;
  • The Blood Brothers is missing;
  • Bluebeard: 6 in the table, 7 in the list;
  • Under Boccaccio, we find 4 tales with type numbers. Please add Griselda with the type number, it was earlier deleted because of the lack of it;
  • The Enchanted Pear-Tree: please remove the footnote;
  • The Three-Ring Parable is missing;
  • Breaking Wind is missing;
  • The Bremen Town Musicians: 12 in the table, 13 in the list;
  • Bride Tests: please modify 1541 according to the five variants, numbered. The fifth one is segmented in 3 types, see footnote. Maybe we could split it in three as well?
  • The Brothers Who Were Turned Into Birds: 10 in the table, 12 in the list;
  • Cat and Mouse is missing;
  • Type 2015: Nanny Who Wouldn't Go Home to Supper missing;
  • Type 2025: The Runaway Pancake: 10 in the table, 13 in the list;
  • Type 20C is still indexed as 2033; 9 in the table, 10 in the list;
  • Type 2035: The House That Jack Built is missing;
  • Child Custody: 4 in the table, 5 in the list;
  • Cinderella is missing;
  • The Father Who Wanted To Marry...: 26 in the table (???), 24 in the list.
  • The Crane, the Crab and the Fish is missing;
  • Crop Division Between Man and Ogre: 11 in table, 13 in list;
  • Dancing in Thorns is missing;
  • Death's Messengers is missing;
  • How the Devil Married Three Sisters: 10 in the table, 11 i the list;
  • Deceiving the Devil with a Rope of Sand is missing;
  • Deceiving the Devil by Breaking Wind is missing;
  • Devil's Bridge: 9 in table, 15 in list;
  • Dividing Souls in the Graveyard: 5 in table, 6 in list;
  • Dream Bread: 6 in table, 7 in list;
  • Please add "East of the Sun and West of the Moon" as it turns out to be an animal brides tale, Type 402;
  • End of the World: 9 on table, 10 in list;
  • Fairies Hope for Christian Salvation: 8 in table, 11 in list;
  • A Fairy Captured: 5 in the table, 7 in the list, the last one is a link to another 9 tales of the same type;
  • Fairy Cup Legends: 15 in table, 18 in list;
  • Fairy Gifts: not sure what footnotes 1-3 refer to;
  • The Father Who Wanted to Marry His Daughter is listed twice in the table;
  • The Flying Dutchman: 3 in the table, 5 in the list (the 6th is useless links). The correct type is 777* ;
  • The Foolish Friend: 15 in the table, 16 in the list;
  • Foolish Wishes: 7 in the table, 9 in the list;
  • The Fox and the Cat: 11 in table, 13 in list;
  • The Fox and the Crow is missing;
  • The Fox, the Wolf and the Horse: 3 in the table, 4 in the list;
  • Frau Holle is missing;
  • Friday is missing;
  • Frog Kings is missing;
  • The Two Frogs is missing;
  • Haunted by the Ghost...: 5 in table, 7 in list;
  • The Girl Without Hands is missing;
  • Grateful Animals...: 6 in table, 7 in list;
  • Grateful Dead: 2 (!!!) in table, 19 (!!!) in list;
  • Hand From the Grave: 7 in table, 12 in list;
  • The Hand of Glory_958E* is missing;
  • Hanging Game: 3 in table, 4 in list;
  • Hansel and Gretel: 10 in table, 8 in list (????);
  • The Himphamp: 11 in table, 13 in list;
  • Hog Bridegrooms is missing;
  • The Husband Who Was to Mind the House is missing;
  • Jack and the Beanstalk; 2 in table, 8 in list;
  • The Language of Animals: 13 in table, 14 in list;
  • The Lion in the Water: 10 in table, 12 in list;
  • The Sick Lion: 3 in table, 4 in list;
  • Llewellyn and His Dog: 6 in table, 8 in list;
  • Please remove the Bell of Atri;
  • Not sure if we should keep Longfellow's poem "The Bell of Atri" -- what do you think?
  • The Man, the Boy and the Donkey: 5 in table, 7 in list;
  • Mastermaid is missing;
  • We should consider including Melusina, apparently related to The Mermaid Wife 4080;
  • Midwife for the Elves is missing;
  • Monkey Bridegrooms is missing (could be related to Hog Bridegrooms and Animal Wives and Bridegrooms);
  • The Moon in the Well: 2 in table, 4 in list;
  • Every Mother Thinks... : 6 in table, 7 in list;
  • Town Mouse and Country Mouse: 2 in table, 5 in list.

Add new tale types for scraping

Add pages with identified tales types to the list for scraping

  • Add Melusina, coded as The Mermaid Wife 4080
  • Add 'Monkey Bridegrooms' (monkey), coded as 0441
  • Add "East of the Sun and West of the Moon" coded as Type 0402

Remove tale variants from atu_seq when a motif is repeated in sequence

This is based on an issue identified by @salmonix. In the example identified, the sequences of the tale variants are as follows for tale 1341A:

  1. "J2356","J2136","J2136"
  2. "J2356","J2136","J581"
  3. "J2356","J581","J2136"
  4. "J2356","J581","J581"

The text runs as following: "...The thieves kill him, too [J581, J2136]. (3) Two foolish slaves are recaptured because of their talkativeness [J581, J2136]..." The motifs identified are: J581 (Wisdom and Folly, Foolishness Of Noise-Making When Enemies Overhear) and J2136.1 (Wisdom and Folly).

@salmonix suggests that, in our cleared data perhaps we should only retain variants 2 and 3 above, and remove 1 and 4, where the same motif is repeated in a row.

Adding this as an issue for discussion: @sdaranyi and @salmonix, are we certain that we want to remove all tale variants where a motif is repeated 2 or more times in a row?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.