Coder Social home page Coder Social logo

Review frus1964-68v23 (Congo) about frus-tei HOT 8 CLOSED

joewiz avatar joewiz commented on July 28, 2024
Review frus1964-68v23 (Congo)

from frus-tei.

Comments (8)

joewiz avatar joewiz commented on July 28, 2024

The Schematron passed just fine, but the RelaxNG schema produced 159 problems initially. I fixed these using a combination of approaches:

  • The schema flagged hi/@rend="smallcaps" and hi/@rend="roman". Added these values to our ODD file, frus.odd.
  • The schema flagged <opener> and <salute>; these will be nice to have to translate into flush-left paragraphs as an alternative to p/@rend="flushleft" and a good complement to closer/signed. Added these elements to our ODD.
  • Added <gap> to our ODD, constraining the attributes to just @quantity and @unit; will need to notify DSCS to use @quantity instead of @extent.
  • Deleted instances of orgName, affiliation not allowed in our ODD; while nice from a semantic perspective, they don't add particular analytical value in the mode applied by DSCS.
  • Changed ref/@ana in d290 to ref/@target

Also spotted these problems in the course of the schema review:

  • Line breaks <lb/> needed on pgII between lines, "DEPARTMENT OF STATE Office of the Historian...". Added the missing line break elements.

from frus-tei.

joewiz avatar joewiz commented on July 28, 2024

From previous SVN commits:

  • Added missing <lb> line break elements in multi-line signatures. Found these instances with this XPath in oXygen: //closer[.//affiliation and not(.//lb)]. This leverages vendor's use of <affiliation> elements for the 2nd and subsequent lines following the signature.
  • Also, worked on Published & Unpublished Sources headings in the source note.
  • Added missing @type attributes to subject and participant lists (vendor seems to have been thrown by lists whose headings were variants of the usual entries, i.e., PRESENT, PRECIS, RE, CRYPTONYM, etc.) TODO add to our guidelines.
  • Fixed missing space in d83fn4: "Congo Crisis,Document 71"
  • In scanning cross references to other volumes, found generally good tagging, but (TODO) we should standardize our style guide for linked cross references. There's a lot of room for interpretation about how much of, and which portions of a cross reference to tag, and when to take enumerated volume, document, or footnote numbers.
  • Another issue is paragraphs that were tightly spaced (vertically) in the PDF but are tagged simply as paragraphs, indistinguishable from other paragraphs. We often use this tight spacing to set off lists, quotes, etc. from the normal flow of paragraphs. We should decide if tight spacing needs to be tagged or not, and whether to continue with the current practice (of list/item, sans @type). (TODO)
  • Added space missing at start of numbered paragraphs (#19-27) in d579, e.g.: <p>27.The US enjoys...
  • Similarly, a space was missing between document number and heading of d580: <head>580.Memorandum From...
  • Based on these two examples, I searched with this regular expression: \d\.[A-Z] (i.e., one digit followed by a period and a capital letter) and found other instances of this in d342, d495, d496, d501, d504-6, d509-10, d512-5, d517-8, d520-2, d524-5, d528-9, d531, d533, d535, d539, d541, d544, d548, d550, d560, d565, d568-70. Besides conjoined document numbers/headings, this phenomenon was manifest in cable numbers, e.g.: <p>2402.Ref... The document heading cases could be a candidate for a schematron error. The paragraph-level instances could be a warning, since they're not strictly forbidden?

from frus-tei.

joewiz avatar joewiz commented on July 28, 2024

Initial notes on the random sample:

  • The PDF has 921 pages. 5% = 46 pages.
  • Setting aside the front matter, which I already looked at closely, the body has 887 pages. 5% = 44 pages. Pages 1-44 would cover documents 1-32. Going by documents, 5% of 582 documents would be 29 documents.
  • Best to take a random 30 documents. How about documents 1-5 of each 100 documents? (Other reviews could take other approaches - best that we vary our approaches.)
  • d1: for page 2 broke in the middle of the word. Our guidelines have always been not to break a word, but to place the pb after the final word of a page.
  • d3: dang, I should've replaced (in signatures) with .
  • d3: "Conakat" not tagged as a term (CONAKAT is in the terms list)

from frus-tei.

joewiz avatar joewiz commented on July 28, 2024

#d100-#104

  • no issues

#d200-#d204

  • Noticed in this range of documents that here and throughout Stan and Leop were not tagged with the <gloss> element. Added it in this range, but should be added elsewhere.
  • Silently corrected typo in #d208: assasinate > assassinate
  • #d203 for tight spacing text in 4A-E, changed <p> to <list>-nested <item> elements sans @type. (TODO: clarify guidelines on this, esp. wrt. @type.)
  • #d204 tagged ChiCom as <gloss>. Also caught 5 instances elsewhere with regex search for \schicoms?\s (whitespace + chicom + optional s + whitespace). Wondering why this (and Stan and Leop) were missed - perhaps because of case variation? If so, perhaps this was a prudent, intentional omission. And this could point to something we should be on the lookout for.

Also

  • fixed all instances of <pb> breaking in the middle of words, moving the <pb> to the end of the word: Find: ([^\s]+)(<pb[^>]+?>)([^\s]+) Replace with: $1$3 $2

from frus-tei.

joewiz avatar joewiz commented on July 28, 2024

#d300-#d304

  • #d300: Noticed that Leo was tagged - this matched case of entry in terms list. But noticed that there is a "Leo G. Cyr" in the persons list. A possibility for mistagging, especially in cryptic telegrams? Similarly, many names are tagged, even if only the last name is present. I recall our guidance was to tag people only if the full name or title + last name was present. The concern about tagging instances where only the last name is present is that there could be ambiguity and thus mistagging.
  • #d301: Noticed smooshed spacing in item C, between <hi> and <gloss>. TODO: add check for sibling elements like these, which results in a space being inserted if serialization parameter indent=no. Similarly, sibling <gloss> elements (e.g., #d402 "AmbLeo")

from frus-tei.

joewiz avatar joewiz commented on July 28, 2024

#d400-#d404

  • #d402: odd double accent mark on the "e" in "Chargé" in the PDF was luckily not preserved in the XML!
  • #d402: noticed extraneous @corresp on the <signed> element - `. Deleted all 235 instances of this in the volume. Tell DSCS to omit this in the future.

#d500-#d504

  • #d501: since the decision options follow the signature (and TEI doesn't allow paragraph content to appear following a <closer> element, DSCS followed our previous practice and tagged the signature with a <p rend="right">'. but we now encase the material following the closer like this decision option block in aelement, which is allowed following a. (TODO: document use of, as well asfrus:attachmentif we don't just usein its stead - perhaps better to use a core TEI element rather than creating a new element, but only if we're not abusing the tag.) Applied this closer/postscript change to #d86, #d226, #d246 (I moved the interesting right-aligned phrase right above the signature from its own paragraph into the signed element... I'm thinking closers should make bold explicit instead of implicit; and should @rend="roman" reset both italic and bold or just italic?). there are still about 20 cases of this, which can be found with//p[@rend='right']` - should be addressed when we flesh out the guidelines on this. many good cases of this here that can be used as illustrations for the guidelines.
  • #d501: the 3 options in <p> elements at the end @rend="flushleft" to ensure they're rendered flushleft.
  • #d503: telegraph number (?) - the thing to the left of the dateline - needs @rend="flushleft"

In summary:

  • No significant issues in the volume to hold up release, but many areas where DSCS can improve for next time, illustrating where our guidelines could be tighter.

from frus-tei.

joewiz avatar joewiz commented on July 28, 2024

Spotted a few things during the ebook review:

  • #d142 has a table - do we need borders? no, but there is a "total" line that is missing. TODO figure out how to encode/render these total/subtotal lines.
  • #d569 fn2 is empty - indeed, the footnote is missing in the PDF too. resolved: delete the empty footnote.

from frus-tei.

joewiz avatar joewiz commented on July 28, 2024

This issue was moved to HistoryAtState/frus#10

from frus-tei.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.