Coder Social home page Coder Social logo

drongbulobsang / derge-kangyur Goto Github PK

View Code? Open in Web Editor NEW

This project forked from esukhia/derge-kangyur-old

0.0 0.0 0.0 737.16 MB

Ongoing proofreading of the 2013 THL-SOAS input on behalf of Barom Theksum Choling

TeX 1.62% Shell 3.39% HTML 14.25% Python 80.74%

derge-kangyur's Introduction

Digital Derge Kangyur

Welcome to the working repository of the ongoing 2014-2018 Esukhia-Barom proofreading project!

The Digital Derge Kangyur you'll find on our repository is based on the UVA-SOAS 2013 eKangyur and is currently undergoing many changes -- use at your own risk!

The 2013 UVA-SOAS eKangyur

The UVA-SOAS 2013 eKangyur was created by diff-proofreading the previous UVA input against BDRC's OCRed etexts, ACIP's etexts, and Adharsha's early etexts; in a 2013 project overviewed by UVA and funded by SOAS and KF (for 84000). This version is currently published on UVA, Adharsha, BDRC, and as part of SOAS's ACTIB corpus.

It was intended as an exact representation of the Derge Kangyur edition help by the Library of Congress (available on BDRC). As an exact representation it preserved the likes of spelling mistakes, carving mistakes, archaic spellings and mistakes caused by wood-block damage.

The Esukhia-Barom revision

The current digital version is an attempt at using linguistics and informatics to improve and normalize the digital Kangyur while preserving the spelling of the Derge woodblocks.

For more information on the workflow please refer to:

Image Sources

Each time an issue is found, our team checks the LOC scans and sometimes falls back on the edition printed by the 16th Karmapa in case of missing pages or unreadable passages. The Karmapa edition isn't used as a main source because it was retouched with marker pens before printing in Delhi.

LOC scan: c Karmapa edition: d

Other interesting differences appear on:

  • vol. 83, page 190a (end of the 6th line)
  • vol. 75, page 119a (third syllable)
  • vol. 33, page 246b (end of first line)

Format

The texts contain the following structural markup at beginning of lines:

  • [1b] is [Page and folio markers]
  • [1b.1] is [Page and folio markers.line number]

We follow the page numbers indicated in the original, this means that sometimes the page numbers go back to 1a (ex: vol. 31 after p. 256). Pages numbers that appear twice in a row are indicated with an x, example in volume 102: [355xa].

They also contain a few error suggestions noted as example. It is far from an exhausted list of the issues found in the original, the staff was actually discouraged to add these.

  • (X,Y) is (potential error, correction suggestion) , example: མཁའ་ལ་(མི་,མེ་)ཏོག་དམར་པོ་

  • [X] signals obvious errors or highly suspicious spellings (ex: མཎྜལ་ཐིག་[ལ་]ལྔ་པ་ལ།), or un-transcribable characters

  • # signals an unreadable graphical unit

  • {TX} signals the beginning of the text with Tohoku catalog number X. We use the following conventions:

    • when a text is missing from the Tohoku catalog, we indicate it with the preceding number followed by a, ex: T7, T7a, T8
    • when a text has subindexes, we separate them with a dash, ex: T841-1, T841-2, etc. The source of the subindexes are 84000, Adarsha and The Nyingma Edition of the sDe dGe bKa' 'Gyur and bsTan 'Gyur: Research Catalogue and Bibliography.

The end of lines sometimes are preceded by a space character (when they end with a shad) so that the result of appending all the lines content is useabletext is correct.

Encoding

Unicode

The files are UTF8 with no BOM, in NFD. The following representations are used:

  • \u0F68\u0F7C\u0F7E (ཨོཾ) is used instead of \u0F00 ()
  • \u0F62\u0FB1 (རྱ) is used instead of \u0F6A\u0FB1 (ཪྱ)
  • \u0F62\u0F99 (རྙ) is used instead of \u0F6A\u0F99 (ཪྙ)
  • \u0F62\u0FB3 (རླ) is used instead of \u0F6A\u0FB3 (ཪླ)
  • \u0F6A\u0FBB (ཪྻ) is used for the most common form instead of \u0F62\u0FBB (རྻ)

Punctuation

We apply the following normalization without keeping the original in parenthesis:

  • ༄༅། ། at beginning of pages are removed (, )they should be straightforward to reinsert
  • ༄༅། ། are also removed at beginning of volumes when the beginning of a volume is in the middle of a text
  • are replaced by

We keep the original punctuation in parenthesis (see above) but normalize the following:

  • ༄༅། ། are added at beginning of texts when they're missing
  • ག། །། instead of ག།། །།, or with any character conforming [གཀཤ][ོེིུ]? instead of ག
  • a tshek is inserted between characters conforming ང[ོེིུ]? and

Volume numbers

Each physical volume is one file. We follow the volume order of the Parphud edition ; in the LoC edition, the main difference is that vol. 102 (of Parphud) is before vol. 100 (of Parphud).

Page numbering issues

  • vol. 41, page 33 is duplicated
  • vol. 48, page 211 was skipped (both #210 and #211 are written on 210a as ང་ ཉིས་བརྒྱ་ བཅུ་ བཅུ་གཅིག་)
  • vol. 77, page 21b, 22a are blank (#22 is written on 22b)
  • vol. 77, page 150b, 151a are blank (#151 is written on 151b)
  • vol. 77, page 212b, 213a are blank (#213 is written on 213b)
  • vol. 86, page 93 is doubled (marked as གོ་གསུམ་གོང་མ་ on 93a/93b and གོ་གསུམ་འོག་མ་ on 93xa/93xb)
  • vol. 86, page 261 was skipped (#260 marked as ཉིས་བརྒྱ་དྲུག་ཅུ on 260a and #261 as ཉིས་བརྒྱ་ རེ་གཅིག་ རྒྱུད་འབུམ་ on 260b)
  • vol. 90, page 63 was skipped
  • vol. 93, page 205 was skipped (#204 marked as ཉིས་བརྒྱ་བཞི་ རྒྱུད་འབུམ་ on 204a and #205 as ཉིས་བརྒྱ་ལྔ་ རྒྱུད་འབུམ་ on 204b)
  • vol. 100, page 57 was skipped (#56 marked as ང་དྲུག་ གཟུངས་བསྡུས་ on 56a and #57 as ང་བདུན་ གཟུངས་བསྡུས་ on 56b)

Completion status

The catalog, volume 103, wasn't digitized as part of this project since it isn't Buddha's words and probably won't be translated by 84,000. Esukhia is hoping to prepare it towards the end of 2018.

TEI Export

You can find a script in the scripts/ directory to validate the files and export into a TEI format that can be ingested by BDRC. Other exports should be straightforward taking this script as a template. Note that it exports the volumes in the LoC order.

Feedback

The files are on Github hoping they'll improve, don't hesitate to signal errors with a pull request!

How to cite

Use the following statemnent or the bibtex file.

 ཆོས་ཀྱི་འབྱུང་གནས། [1721–31], བཀའ་འགྱུར་སྡེ་དགེ་པར་མ།, Etexts from UVA, BDRC OCR, ACIP, and Adarsha combined and further proofread by Esukhia, 2012-2018, https://github.com/Esukhia/derge-kangyur

License

This work is a mechanical reproduction of a Public Domain work, and as such is also in the Public Domain.

derge-kangyur's People

Contributors

eroux avatar ngawangtrinley avatar tadhondup avatar rekongrabten avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.