lcnetdev / scriptshifter Goto Github PK

View Code? Open in Web Editor NEW

11.0 11.0 5.0 3.07 MB

License: Creative Commons Zero v1.0 Universal

Dockerfile 0.25% Shell 0.45% Python 20.32% HTML 1.87% AutoIt 73.50% JavaScript 1.90% TypeScript 1.71%

scriptshifter's People

Contributors

Stargazers

Watchers

Forkers

jimfhahn libris jamrapatel ttsilvanus abalewis

scriptshifter's Issues

Additional Cyrillic scripts for Armenian & Georgian

Suggested additions: Cyrillic scripts like Abkhaz, Tatar, etc.

Korean: FKR050

In https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/hooks/korean/Functions_KoreanRomanizer.au3#L1169-L1171:

   If StringRegExp($NClipB," 제[0-9]") Then
	  $NClipB = StringReplace($NClipB," 제"," 제 ")
   EndIf

Is the intent of this code to replace every instance of 제 followed by a digit with 제 ? In that case it seems to me that this may not yield the expected results, because as long as there is at least one occurrence of 제 followed by a digit, a space will be added to all 제 in the text, regardless of what follows them.

In Python I'm proposing a regex substitution:

data = re.sub(" 제([0-9])", "제 \\1", data)

@hyoungl let me know if I am indeed interpreting this correctly.

Korean: R2L flag

In https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/hooks/korean/Functions_KoreanRomanizer.au3#L1849:

   If $ConvertR2L = "On" AND $IsL_initial > 0 Then

$ConvertR2L is defined in L17 and looks like a user settings flag.

@hyoungl does ScriptShifter need to distinguish between the two settings (it would be a major framework change)?

Korean: leftover non-name test failures

@hyoungl Attached here is a list of the last failing Korean tests from the test strings you provided. I thought it would be more practical for you to review the log as a whole and comment on the individual issues, as many of them seem related to personal names.
korean_tests.log

Thanks.

Chinese (Hanzi)

Generally following ALA/LC table, but some rarely used Chinese characters are not Romanized

An extra space always shown in between romanized character and punctuation; Most punctuation are romanized correctly (except for no romanization of wite corner brackets 『』)

All "ü" (the only diacritic in ALA/LC Chinese Romanization Table) are showing incorrectly as "u"

Diacritics on Two Georgian Characters Need Correction

ფ (U+10E4) should be pʻ [the diacritic is (U+02BB)], it is currently p̌
ქ (U+10E5) should be kʻ [the diacritic is (U+02BB)], it is currently ǩ

Korean: space before or after parentheses

In #42 I noticed a few of failures like this one:

- Chungso kiŏp ŭi chŏllyakchŏk sŏngkwa kwalli(BSC) ironp'yŏn
?        ^^
+ Chungsogiŏp ŭi chŏllyakchŏk sŏngkwa kwalli (BSC) ironp'yŏn
?        ^                                  +
Original: 중소 기업 의 전략적 성과 관리(BSC) 이론편

(In order: test result, expected result, original string; ignore the mismatch in the first part for now)

I imagine that whether the original was a typo or a legit string for Korean language, the Romanized string should have consistent spacing for parentheses and other punctuation.

@hyoungl Is there a generic cataloging practice for this? I already added some normalization rules for other punctuation signs (;, :, etc.), but it would be good to have a general normalization table for all punctuation signs of this kind on all Romanized strings if guidelines exist in that direction.

Thai

Add Thai support.

Unavailable languages reporting being available

Some languages like Greek (classical) are reported in the /languages/ endpoint but don't have a table in the directory and returns that error message if used.

Korean: `saranamgi` vs. `saranamki`

당신 이 살아남기 위해 알아야 할 사장 의 비밀 is romanized as Tangsin i saranamgi wihae araya hal sajang ŭi pimil but the test string expects Tangsin i saranamki wihae araya hal sajang ŭi pimil.

There is a conversion from ["f16~i0#","m~g"] in FKR103.

Is this correct? Or am I missing a step where g should become k?

Korean: names longer than 7 characters

In https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/hooks/korean/Functions_KoreanRomanizer.au3#L458-L466:

   If StringLen($TargetKor) > 7 OR StringLen($TargetKor) = 1 OR StringInStr($TargetKorOrig," ",0,1)>3 Then
	  If $ForeignNameConversion = "Yes" Then
		 ClipPut($TargetKorOrig)
		 KorCorpNameRomOCLC()
	  Else
		 TrayTip("Error!",@LF & $TargetKorOrig & @LF & "may not be a Korean name",10)
		 ClipPut("Error!")
	  EndIf

This seems to throw an error on source names longer than 7 characters, or only 1 character long, or if there is a space after the third character; unless $foreignNameConversion is set to "yes", in which case KorCorpNameRomOCLC() is run.

2 questions for @hyoungl :

Shall I always assume that we always want foreign name conversion in ScriptShifter, or shall the user have the choice (if this can remain fixed, it would make things much easier)?
KorCorpNameRomOCLC() never throws an error. Does that mean that names longer than 7 characters are always acceptable if foreign conversion is on?

Korean: translate FKR111

In https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/hooks/korean/FKR_index.csv#L112 the Korean description of FKR111 is not translated. @hyoungl can you please translate it?

Korean: spacing before punctuation

In https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/hooks/korean/Functions_KoreanRomanizer.au3#L1150 various punctuation signs surrounded by spaces are replaced by a code, e.g. ' : ' is replaced by ' SB16KQ ' (FKR050).

Then, in https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/hooks/korean/Functions_KoreanRomanizer.au3#L1257 this is being reverted after romanization, e.g. ' SB16KQ ' becomes ' : ' again (FKR066).

This for me results in a string such as Hŏsang kwa silsang : Han'guk chŏngch'i ŭi sŏngsuk ŭl kalmang hamyŏ instead of Hŏsang kwa silsang: Han'guk chŏngch'i ŭi sŏngsuk ŭl kalmang hamyŏ (note the extra space before :).

I didn't find a place in your code where space is removed before those punctuation signs. Can you point it out?

Or can I simply replace ' SB16KQ ' with ': ' (and similar punctuation cases that don't have a leading space in Roman) in FKR066?

Feedback mechanism

Add a mechanism for users to provide feedback with the goal of improving transliterations.

This functionality should be available both as an API endpoint and a form add-on to the HTML UI. The latter should appear after a transliteration attempt and should present the user with a partly pre-filled form.

The function should accept the following inputs:

Language
direction (S2R or R2S)
Original string
Expected result
Options applied
Comments (optional)
User email (optional - for feedback)

Burmese

Add support for Burmese: https://www.loc.gov/catdir/cpso/romanization/burmese.pdf

Ukrainian

It doesn’t appear that the apostrophe (which functions, I believe, like the hard sign in Russian) is differentiated from the soft sign in ScriptShifter transliteration

Azerbaijani (Cyrillic) Kazakh (Cyrillic) Kyrgyz (Cyrillic) Tajik (Cyrillic) Tatar (Cyrillic) Uzbek (Cyrillic)

I have tested this language over and over again and also managed to get the screenshots of the books to make sure that nothing is missed. I really like the script shifter and was hoping that it would work well for Cyrillic Azeri since Latin Azeri was not part of it. But as I mentioned earlier unfortunately it doesn't work for certain letters. I will try to if I can insert the examples : This is the title of the book (LCCN#00655035 ) in original Azeri Cyrillic ( I have a screenshot of the cover and title page) Азəрбајҹан Совет əдəбијјатынын поетика мəсəлəлəри : ǂb елми əсəрлəрин тематик мəҹмyеси This is how the ScriptShifter romanized it Azərbai̐jan Sovet ədəbиi̐i̐atыnыn poetиka məsələlərи : ǂb elmи əsərlərиn tematиk məjmyesи This is how correctly it should have been romanized based on LC-ALA romanization table Azărbai̐jan Sovet ădăbii̐i̐atynyn poetika măsălălări : ǂb elmi ăsărlărin tematik măjmu̇esi I will highlight which of the letters were correctly romanized and which failed: This how this letter looks like in the table: Vernacular Romanization ә ă While in some other instances this letter was romanized fine in this particular case as you can see it stayed as vernacular during the romanization process. These two letters: ј,ҹ were romanized correctly both in the 1st word of the tile and others based on the table: ј i̐ ҹ j The next incorrectly romanized letter is a basic и which should be represented by i и i The next incorrect letter is ы which should have been romanized as y ы y The last incorrect letter which is contained in the last word "мəҹмyеси" is the letter "ү" which should have been romanized as u̇ ү u̇ I want to repeat that I have tested several titles. This is a second example: Азәрбајҹан XXI әсрин астанасында" Республика елми-практики конфрансынын : материаллары Azărbai̐jan XXI ăsrиn astanasыnda" Respublиka elmи-praktиkи konfransыnыn : materиallarы The set of letters here is more limited but the in accuracies with the romanization are almost the same. What surprised me here was the fact that ә in the first word "Азәрбајҹан" was romanized correctly which was not a case in previous example. BBut the problem with the letters и and ы stayed the same.

Korean: unexpected vocalization

도란 도란 들려주는 말 이야기 transliterates into Toran toran tŭllyŏjunŭn Mal iyagi but the test string expects Toran toran tŭllyŏ chunŭn mal iyagi.

I see a vocalization rule in FKR103 that inserts the j but nothing else changes the string from tŭllyŏjunŭn.

Korean: glottalization

@hyoungl I am not clear of what happens with the glottalization process in https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/hooks/korean/Functions_KoreanRomanizer.au3#L1192-L1205 .

There is a substitution of the ^ character (added earlier) with the string GLOTTAL. Then, a few lines after, GLOTTAL is replaced with the empty string. I can't see any change happening in KorRom, which separates it from the Korean characters and returns it unchanged. In practice, the ^ character is simply removed with no further action.

In the case of 결단력, I get to 결 GLOTTAL 단+력 which eventually becomes Kyŏli.

Am I missing something? Is this how it's supposed to work?

Korean: incorrect romanization of `식민`

근대 와 식민 의 서곡 is romanized to Kŭndae wa sikmin ŭi sŏgok, but Kŭndae wa singmin ŭi sŏgok is expected.

Specifically, 식민 becomes sikmin instead of singmin.

Debugging the script at the 식민 word, the code point conversion yields i9#m20#f1~i6#m20#f4E.

FKR073÷100 don't change anything in this string.

This translates into sikmin as per FKR109.

No other modification is made after that point.

I verified the mappings I transcribed from K-Romanizer are identical to the original

@hyoungl Can you tell what is going wrong?

Korean: `isNonKor`

In https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/hooks/korean/Functions_KoreanRomanizer.au3#L1741-L1758 :

   For $i = 0 To Ubound($Rule1)-1
	  If StringLeft(StringStripWS($TargetKorOrig,8),1)=$Rule1[$i] Then
		 $IsNonKor=$IsNonKor+1
	  EndIf
   Next

etc.

And further down https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/hooks/korean/Functions_KoreanRomanizer.au3#L1773:

If $Len > 1 and $IsNonKor=0 and $IsParticle=0 Then

This is the only place where $isNonKor is checked, and it's only checked if it equals 0. @hyoungl can I break the loops in LL1741-1750 as soon as $isNonKor > 0 to avoid redundant processing?

Korean doesn't seem to work

Submitting the test string to the /trans/ endpoint "曉城　趙　明基　博士　追慕　佛教　史學　論文集" I get an error:

scriptshifter     | INFO:scriptshifter.trans:Transliteration is from korean to Latin.
scriptshifter     | ERROR:scriptshifter.rest_api:Exception on /trans/korean [POST]
scriptshifter     | Traceback (most recent call last):
scriptshifter     |   File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 2525, in wsgi_app
scriptshifter     |     response = self.full_dispatch_request()
scriptshifter     |   File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1822, in full_dispatch_request
scriptshifter     |     rv = self.handle_user_exception(e)
scriptshifter     |   File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1820, in full_dispatch_request
scriptshifter     |     rv = self.dispatch_request()
scriptshifter     |   File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1796, in dispatch_request
scriptshifter     |     return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
scriptshifter     |   File "/usr/local/scriptshifter/src/./scriptshifter/rest_api.py", line 81, in transliterate_req
scriptshifter     |     out = transliterate(in_txt, lang, r2s, capitalize)
scriptshifter     |   File "/usr/local/scriptshifter/src/./scriptshifter/trans.py", line 59, in transliterate
scriptshifter     |     cfg = load_table(lang)
scriptshifter     |   File "/usr/local/scriptshifter/src/./scriptshifter/tables/__init__.py", line 117, in load_table
scriptshifter     |     tdata = load(fh, Loader=Loader)
scriptshifter     |   File "/usr/local/lib/python3.9/site-packages/yaml/__init__.py", line 81, in load
scriptshifter     |     return loader.get_single_data()
scriptshifter     |   File "/usr/local/lib/python3.9/site-packages/yaml/constructor.py", line 49, in get_single_data
scriptshifter     |     node = self.get_single_node()
scriptshifter     |   File "/usr/local/lib/python3.9/site-packages/yaml/composer.py", line 36, in get_single_node
scriptshifter     |     document = self.compose_document()
scriptshifter     |   File "/usr/local/lib/python3.9/site-packages/yaml/composer.py", line 55, in compose_document
scriptshifter     |     node = self.compose_node(None, None)
scriptshifter     |   File "/usr/local/lib/python3.9/site-packages/yaml/composer.py", line 84, in compose_node
scriptshifter     |     node = self.compose_mapping_node(anchor)
scriptshifter     |   File "/usr/local/lib/python3.9/site-packages/yaml/composer.py", line 133, in compose_mapping_node
scriptshifter     |     item_value = self.compose_node(node, item_key)
scriptshifter     |   File "/usr/local/lib/python3.9/site-packages/yaml/composer.py", line 84, in compose_node
scriptshifter     |     node = self.compose_mapping_node(anchor)
scriptshifter     |   File "/usr/local/lib/python3.9/site-packages/yaml/composer.py", line 133, in compose_mapping_node
scriptshifter     |     item_value = self.compose_node(node, item_key)
scriptshifter     |   File "/usr/local/lib/python3.9/site-packages/yaml/composer.py", line 82, in compose_node
scriptshifter     |     node = self.compose_sequence_node(anchor)
scriptshifter     |   File "/usr/local/lib/python3.9/site-packages/yaml/composer.py", line 111, in compose_sequence_node
scriptshifter     |     node.value.append(self.compose_node(node, index))
scriptshifter     |   File "/usr/local/lib/python3.9/site-packages/yaml/composer.py", line 84, in compose_node
scriptshifter     |     node = self.compose_mapping_node(anchor)
scriptshifter     |   File "/usr/local/lib/python3.9/site-packages/yaml/composer.py", line 133, in compose_mapping_node
scriptshifter     |     item_value = self.compose_node(node, item_key)
scriptshifter     |   File "/usr/local/lib/python3.9/site-packages/yaml/composer.py", line 64, in compose_node
scriptshifter     |     if self.check_event(AliasEvent):
scriptshifter     |   File "/usr/local/lib/python3.9/site-packages/yaml/parser.py", line 98, in check_event
scriptshifter     |     self.current_event = self.state()
scriptshifter     |   File "/usr/local/lib/python3.9/site-packages/yaml/parser.py", line 449, in parse_block_mapping_value
scriptshifter     |     if not self.check_token(KeyToken, ValueToken, BlockEndToken):
scriptshifter     |   File "/usr/local/lib/python3.9/site-packages/yaml/scanner.py", line 116, in check_token
scriptshifter     |     self.fetch_more_tokens()
scriptshifter     |   File "/usr/local/lib/python3.9/site-packages/yaml/scanner.py", line 251, in fetch_more_tokens
scriptshifter     |     return self.fetch_double()
scriptshifter     |   File "/usr/local/lib/python3.9/site-packages/yaml/scanner.py", line 655, in fetch_double
scriptshifter     |     self.fetch_flow_scalar(style='"')
scriptshifter     |   File "/usr/local/lib/python3.9/site-packages/yaml/scanner.py", line 666, in fetch_flow_scalar
scriptshifter     |     self.tokens.append(self.scan_flow_scalar(style))
scriptshifter     |   File "/usr/local/lib/python3.9/site-packages/yaml/scanner.py", line 1152, in scan_flow_scalar
scriptshifter     |     chunks.extend(self.scan_flow_scalar_non_spaces(double, start_mark))
scriptshifter     |   File "/usr/local/lib/python3.9/site-packages/yaml/scanner.py", line 1213, in scan_flow_scalar_non_spaces
scriptshifter     |     raise ScannerError("while scanning a double-quoted scalar", start_mark,
scriptshifter     | yaml.scanner.ScannerError: while scanning a double-quoted scalar
scriptshifter     |   in "/usr/local/scriptshifter/src/scriptshifter/tables/data/korean.yml", line 17104, column 29
scriptshifter     | expected escape sequence of 4 hexadecimal numbers, but found 'n'
scriptshifter     |   in "/usr/local/scriptshifter/src/scriptshifter/tables/data/korean.yml", line 17104, column 43

Chinese: personal name recognition

Even though recognizing Chinese personal names is a very complex task based on context which may require machine learning, some simple rules can help catch at least some common cases, e.g. prefixes such as "Professor", "Teacher", "Mr.", "Ms.", etc.

Breaking API change

@thisismattmiller @kefo During the work in the korean branch it became necessary to change the output of the /trans API endpoint. Currently the output is a plain text string of the result; but K-Romanizer also displays important warnings on the Windows UI that need to be reproduced in the SS webapp, so I changed the output to a JSON object with the following structure:

{
  "output": <string>,
  "warnings": <list of strings>
}

I imagine that this would break Marva, probably by dumping the JSON object as plain text in the wrong place. Can we coordinate the deployment of the Korean module with updates to Marva?

The work-in-progress branch is named korean. I'm getting close to deploying at least the part that handles strings without personal names.

Also, I noticed that @thisismattmiller added some very useful JS in the index.html page that fills a results div with the output: https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/templates/index.html#L69-L115

Would you mind making a small improvement? It would be about adding another div with an ID of "warnings" or similar (milligram, the CSS framework I used, should have some classes to highlight it), hidden by default, that gets displayed and filled with the warnings (maybe one <p> or <li> per list item), and change the content of the results div to only the value of "ouptut"? Warnings would be empty most of the time so the warnings div should be hidden if they are empty.

Adlam doesn't appear in pull down menu in editor

I double checked the Pulaar (Adlam) is present in the index.yml file but in the Marva Editor it doesn't appear as an option in the Romanization manual select pulldown menu. Is there another connection that has to be made?

Korean: space check in names

In https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/hooks/korean/Functions_KoreanRomanizer.au3#L469-L470:

	  If StringInStr($TargetKorOrig," ",0,3)>0 Then
		 TrayTip("Error!",@LF & $TargetKorOrig & @LF & "may not be a Korean name (too many spaces)",10)

This condition seems to never be evaluated because it tests for 3 or more spaces, but on L459 it's tested for 2 or more spaces. @hyoungl is that correct? Can this condition be removed?

Korean: popular names missing

In https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/hooks/korean/Functions_KoreanRomanizer.au3#L1244 several popular name conversions are made. Some names used in test strings seem to be missing.

The failing tests are:

김 정일 공포 를 쏘아 올리다 -> Kim chŏngil kongp'o rŭl ssoa ollida (expected: Kim Chŏng-il kongp'o rŭl ssoa ollida)
김 창모 의 대한 민국 선물 옵션 교과서 -> Kim ch'angmo ŭi Taehan Min'guk sŏnmul opsyŏn kyogwasŏ (expected: Kim Ch'ang-mo ŭi Taehan Min'guk sŏnmul opsyŏn kyogwasŏ)
다석 류 영모 - 우리 말 과 우리 글 로 철학한 큰 사상가 -> Tasŏk ryu yŏngmo - uri Mal kwa uri kŭl ro ch'ŏrhakhan k'ŭn sasangga (expected: Tasŏk Yu Yŏng-mo - uri mal kwa uri kŭl ro ch'ŏrhakhan k'ŭn sasangga - I am not sure if this is a personal name)

Do these names need to be added to the list of special names?

Diacritic Not Displaying Correctly in Georgian

The character ფ (both Pʻ and pʻ) becomes p2BB in transliteration.
The character ქ (both Kʻ and kʻ) becomes k2BB in transliteration.

Korean: FKR103 seems to run when it's not supposed to

일본 중심적 transliterates into chungsimjŏk but chungsimchŏk is expected.

The code sequence is i12#m13#f21~i9#m20#f16~i12#m4#f1E

There is a substitution for ["f16~i12#","m~j"] in FKR 103.

This seems like the same problem as #36.

Regarding FKR103, I can see that in https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/hooks/korean/Functions_KoreanRomanizer.au3#L1673 a series of variables are checked, but as far as I can see, they are all set to 0 only once and never changed. Is FKR103 supposed to run on some conditions?

Korean: R to L option

In https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/hooks/korean/Functions_KoreanRomanizer.au3#L1849 a ConvertR2L option is checked.

As for other options I encountered, I'm hoping that ScriptShifter could offer only one way of handling romanized strings. Which cases is this option relevant for, and can it be set to a fixed value?

Korean: OCLC Breve

In https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/hooks/korean/Functions_KoreanRomanizer.au3#L964-L968:

  If WinActive("Voyager Cataloging") Then
	  $NoOCLCBreve = "On"
   Else
	  $NoOCLCBreve = "Off"
   EndIf

@hyoungl How would this condition occur in ScriptShifter? Would it be always true, always false, or depend on some external factor?

Korean: spaces around hyphen

한국 에서의 다문화 주의 - 현실 과 쟁점 is romanized as Han'guk esŏŭi tamunhwa chuŭi - hyŏnsil kwa chaengchŏm but Han'guk esŏŭi tamunhwa chuŭi-hyŏnsil kwa chaengchŏm is expected.

Is there supposed to be no spaces between the hyphen and the two words, even though there is in the script?

Non-slavic Cyrillic tables

Complete Cyrillic tables for non-Slavic languages.

@RandyBarry can you add a list of the languages involved here?

Bulgarian

Bulgarian letter "Ъ" and "'ь" do not Romanize, which makes the text unreadable. Amazingly, the following letters were Romanized perfectly: Ѣ ѫ ѧ ю я

Bulgarian Er-malak (small yer) with ALA-LOC sign " ʹ " (soft sign) missing after ScriptShift romanization

Romanized Ŭ for Bulgarian Ъ in the beginning and middle of words does not appear when using ScriptShift. The same is true for the romanized ʺ (hard sign) for Bulgarian Ъ at the end of the words.

Bulgarian "Ъ, ъ", known as Er-golyam (large yer), phonetic transcription “ă”, ALA-LC Romanization “ŭ” or ʺ - not present after ScriptShifter romanization

Romanized Ŭ for Bulgarian Ъ in the beginning and middle of words does not appear when using ScriptShift. The same is true for the romanized ʺ (hard sign) for Bulgarian Ъ at the end of the words.

Serbian-Macedonian test line

Test string at https://github.com/lcnetdev/scriptshifter/blob/main/tests/data/script_samples/cyrillic.csv#L13 is indicated as serbian_macedonian; however, the two languages have been split up and serrbian_macedonian no longer exists, and that sentence must be tested with either one of those two scripts. @RandyBarry which script is that a better fit for?

Korean: FKR171-179

In the original K-Romanizer, a common logic of FKR 171 to 179, is to replace a Hangul character with a Roman X, e.g.

   If StringInStr($Hangul,"不")>0 THEN
	  StringReplace($Hangul,"不","X")

I don't see this X being replaced anywhere else in Functions_KoreanHancha.au3 nor in Function_KoreanRomanizer.au3. Is this intentional?

Korean: Name test failures

Current list of failing names (from Y. Lee's list:
korean_names_tests.log

Khmer

Add support for Khmer: https://www.loc.gov/catdir/cpso/romanization/khmer.pdf

Add Armenian Ligature Letters and a Diacritic

Add five Armenian lowercase-only ligature letters and a diacritic:

ﬓ (U+FB13) should become mn
ﬔ (U+FB14) should become me
ﬕ (U+FB15) should become mi
ﬖ (U+FB16) should become vn
ﬗ (U+FB17) should become mkh
եվ (U+0565) (U+057E) should become eʹv [the diacritic is (U+02B9)]
Եվ (U+0535) (U+057E) should become Eʹv [the diacritic is (U+02B9)]

Korean: duplicates in conversion map

The (Hancha?) conversion map at https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/hooks/korean/Functions_KoreanRomanizer.au3#L809 has some duplicates:

['肖','소'] followed by ['肖','초']
['葉','섭'] followed by ['葉','엽']

Also, on https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/hooks/korean/Functions_KoreanRomanizer.au3#L1182

[' 백골단 ',' 백^골^단 '] followed by [' 백골단 ',' 백골^단 ']

@hyoungl how is the substitution working here? It seems to me that the second duplicates will never be used.

Chinese: implement numerals parsing

Implement a parser for Chinese numerals similar to the PArallelogram logic: https://github.com/pulibrary/parallelogram/blob/8be9d46ca6b9b5f85e255f9823f82c5b5b0ada27/cloudapp/src/app/pinyin.service.ts#L160

Refer to email exchange with Tom Ventimiglia for details.

Korean: clarify FKR172÷179

@hyoungl I need some help clarifying what's happening in https://github.com/lcnetdev/scriptshifter/blob/main/legacy/Functions_KoreanHancha.au3#L241-L256 (repeated for FKR172 to 179 with different characters - comments are mine):

   ;FKR172
   If StringInStr($Hangul,"列")>0 THEN
	  StringReplace($Hangul,"列","X")
	  $R_Initial_Count = @Extended
          ; ^^ This gets the number of occurrences of "列"
	  For $i=1 to $R_Initial_Count
          ; ^^ loops for as many occurrences are found
		 $R_Initial_Str = StringMid($Hangul,StringInStr($Hangul,"列",0,1)-1,1)
                 ; ^^ This looks like is extracting one character before the current "列", right?
		 ClipPut($R_Initial_Str)
		 IdentifyCoda()
                 ; ^^ identifyCoda returns the modulo of the ASCII value of the character before "列" minus 44032, and 28
		 $CodaValue = ClipGet()
		 If $CodaValue = "0" OR $CodaValue = "4" OR StringToBinary($R_Initial_Str,4)<100 Then
                 ; ^^ What does StringToBinary? I looked at the AutoIt doc and examples online but I'm not sure how that can be compared with an integer (100). Would 100 be the code point value?
			$Hangul=StringReplace($Hangul,"列","열",1)
		 Else
			$Hangul=StringReplace($Hangul,"列","렬",1)
		 EndIf
                 ; These replacements ensure that the next loop looks for the next occurrence of "列"
	  Next
   EndIf

Thanks.

Korean: in-sentence capitalization

After capitalizing all first words of the test strings, many tests are failing because of a capitalization mismatch for non-first words.

A sample, with actual result first and expected result after:

- Minpŏp kwa Pŏphak ŭi chungyo Munje
?            ^                 ^
+ Minpŏp kwa pŏphak ŭi chungyo munje
?            ^                 ^
Original: 民法　과　法學　의　重要　問題
----------------------------------------------------------------------
- Kŭndae kyemonggi Munhak kwa tokcha ŭi palgyŏn
?                  ^
+ Kŭndae kyemonggi munhak kwa tokcha ŭi palgyŏn
?                  ^
Original: 근대 계몽기 문학 과 독자 의 발견
----------------------------------------------------------------------
- Kŭllobŏl sidae ŭi Kyŏngyŏnghak wŏllon
?                   ^
+ Kŭllobŏl sidae ŭi kyŏngyŏnghak wŏllon
?                   ^
Original: 글로벌 시대 의 경영학 원론
----------------------------------------------------------------------
- Kŭmganggyŏng toam Sŏnsa ŭi kŭmganggyŏng haesŏlsŏ
?                   ^
+ Kŭmganggyŏng toam sŏnsa ŭi kŭmganggyŏng haesŏlsŏ
?                   ^
Original: 금강경 도암 선사 의 금강경 해설서
----------------------------------------------------------------------
- Kŭmyung kaebang ŭi Kyŏngjejŏk hyogwa wa Kwaje
?                    ^                    ^
+ Kŭmyung kaebang ŭi kyŏngjejŏk hyogwa wa kwaje
?                    ^                    ^
Original: 금융 개방 의 경제적 효과 와 과제
----------------------------------------------------------------------
- Kŭmyung Hoesa ŭi p'asaeng sangp'um unyong e ttarŭn risŭk'ŭ kwalli
?         ^
+ Kŭmyung hoesa ŭi p'asaeng sangp'um unyong e ttarŭn risŭk'ŭ kwalli
?         ^
Original: 금융 회사 의 파생 상품 운용 에 따른 리스크 관리
----------------------------------------------------------------------
- Kidokkyo wa Sahoehak ŭi chŏpchŏm
?             ^
+ Kidokkyo wa sahoehak ŭi chŏpchŏm
?             ^
Original: 기독교 와 사회학 의 접점
----------------------------------------------------------------------
- Kidokkyo chŏnsŭng ŭi Sosŏljŏk Hyŏngsanghwa wa chakka Ŭisik
?                      ^        ^                      ^
+ Kidokkyo chŏnsŭng ŭi sosŏljŏk hyŏngsanghwa wa chakka ŭisik
?                      ^        ^                      ^
Original: 기독교 전승 의 소설적 형상화 와 작가 의식
----------------------------------------------------------------------
- Kiŏp kyŏngyŏng Kukche kyŏngyŏng pumun
?                ^
+ Kiŏp kyŏngyŏng kukche kyŏngyŏng pumun
?                ^
Original: 기업 경영 국제 경영 부문
----------------------------------------------------------------------
- Kiŏp kyŏngjaengnyŏk Kanghwa rŭl wihan naebu kobal kwa yulli kyŏngyŏng
?                     ^
+ Kiŏp kyŏngjaengnyŏk kanghwa rŭl wihan naebu kobal kwa yulli kyŏngyŏng
?                     ^
Original: 기업 경쟁력 강화 를 위한 내부 고발 과 윤리 경영
----------------------------------------------------------------------
- Kiŏp ŭi Nosa Munje hyŏnjang i tap ida
?         ^    ^
+ Kiŏp ŭi nosa munje hyŏnjang i tap ida
?         ^    ^
Original: 기업 의 노사 문제 현장 이 답 이다
----------------------------------------------------------------------
- Kiŏp i araya hal 100-kaji Sop'ŭt'ŭweŏ chŏjakkwŏn sangdam sarye
?                           ^
+ Kiŏp i araya hal 100-kaji sop'ŭt'ŭweŏ chŏjakkwŏn sangdam sarye
?                           ^
Original: 기업 이 알아야 할 100가지 소프트웨어 저작권 상담 사례
----------------------------------------------------------------------
- Kiŏppŏp kaesŏl Che 12-p'an
?                ^
+ Kiŏppŏp kaesŏl che 12-p'an
?                ^
Original: 기업법 개설 제 12판
----------------------------------------------------------------------
- Kich'o yŏn'gu t'uja ŭi Kyŏngjejŏk p'agŭp hyogwa punsŏk
?                        ^
+ Kich'o yŏn'gu t'uja ŭi kyŏngjejŏk p'agŭp hyogwa punsŏk
?                        ^
Original: 기초 연구 투자 의 경제적 파급 효과 분석
----------------------------------------------------------------------
- Kihu Pyŏnhwa wa chŏnyŏmpyŏng chilbyŏng pudam
?      ^
+ Kihu pyŏnhwa wa chŏnyŏmpyŏng chilbyŏng pudam
?      ^
Original: 기후 변화 와 전염병 질병 부담
----------------------------------------------------------------------
- Kihu Pyŏnhwa e taeŭng han chisok kanŭng han kukt'o kwalli chŏllyak
?      ^
+ Kihu pyŏnhwa e taeŭng han chisok kanŭng han kukt'o kwalli chŏllyak
?      ^
Original: 기후 변화 에 대응 한 지속 가능 한 국토 관리 전략
----------------------------------------------------------------------
- Kihu Pyŏnhwa e ttarŭn nongŏp pumun yŏnghyang punsŏk
?      ^
+ Kihu pyŏnhwa e ttarŭn nongŏp pumun yŏnghyang punsŏk
?      ^
Original: 기후 변화 에 따른 농업 부문 영향 분석
----------------------------------------------------------------------
- Kim chŏngil kongp'o rŭl ssoa ollida
?     ^
+ Kim Chŏng-il kongp'o rŭl ssoa ollida
?     ^    +
Original: 김 정일 공포 를 쏘아 올리다
----------------------------------------------------------------------
- Kim ch'angmo ŭi Taehan Min'guk sŏnmul opsyŏn kyogwasŏ
?     ^
+ Kim Ch'ang-mo ŭi Taehan Min'guk sŏnmul opsyŏn kyogwasŏ
?     ^     +
Original: 김 창모 의 대한 민국 선물 옵션 교과서
----------------------------------------------------------------------
- Kkŭnnaji anŭn Yŏksa ap esŏ
?               ^
+ Kkŭnnaji anŭn yŏksa ap esŏ
?               ^
Original: 끝나지 않은 역사 앞 에서
----------------------------------------------------------------------
- Na rŭl titko sesang ŭl hyanghae Ttwiŏ ollara
?                                 ^
+ Na rŭl titko sesang ŭl hyanghae ttwiŏ ollara
?                                 ^
Original: 나 를 딛고 세상 을 향해 뛰어 올라라
----------------------------------------------------------------------
- Nam-Pukhan kan pogŏn ŭiryu kyoryu hyŏmnyŏk ŭi hyoyulChŏk Suhaeng Ch'egye kuch'uk pangan yŏn'gu
?                                                     ^    ^       ^
+ Nam-Pukhan kan pogŏn ŭiryu kyoryu hyŏmnyŏk ŭi hyoyulchŏk suhaeng ch'egye kuch'uk pangan yŏn'gu
?                                                     ^    ^       ^
Original: 남북한 간 보건 의류 교류 협력 의 효율적 수행 체계 구축 방안 연구
----------------------------------------------------------------------
- Namsŏngbok K'ŭllaesik p'aet'ŏn
?            ^
+ Namsŏngbok k'ŭllaesik p'aet'ŏn
?            ^
Original: 남성복 클래식 패턴
----------------------------------------------------------------------
- Namwŏn Kosa wŏnjŏn pip'yŏng
?        ^
+ Namwŏn kosa wŏnjŏn pip'yŏng
?        ^
Original: 남원 고사 원전 비평
----------------------------------------------------------------------
- Nae ka sara on Han'guk hyŏndae Munhaksa
?                                ^
+ Nae ka sara on Han'guk hyŏndae munhaksa
?                                ^
Original: 내 가 살아 온 한국 현대 문학사
----------------------------------------------------------------------
- Nae maŭm sok ŭi Han'guk Munhak
?                         ^
+ Nae maŭm sok ŭi Han'guk munhak
?                         ^
Original: 내 마음 속 의 한국 문학
----------------------------------------------------------------------
- Noin changgi yoyang pojang Ch'egye ŭi hyŏnhwang kwa kaesŏn pangan
?                            ^
+ Noin changgi yoyang pojang ch'egye ŭi hyŏnhwang kwa kaesŏn pangan
?                            ^
Original: 노인 장기 요양 보장 체계 의 현황 과 개선 방안
----------------------------------------------------------------------

Clarification on Asian Cyrillic mappings

The various ALA_LC romanisation tables for Cyrillic script languages have complex encoding issues, and it seems that the mappings in the file below diverge from the intend of the romanisation tables.

In the Asian Cyriillic mapping at

scriptshifter/scriptshifter/tables/data/asian_cyrillic.yml

Lines 334 to 336 in 4fca3e8

    
           # CONVERION OF "I/i" LIGATED TO "E/e", SOME WITH MACRON (0304) AND OGONEK (0328) 
        
           "\u0464": "I\uFE20E\uFE21\u0304" 
        
           "\u0468": "I\uFE20E\uFE21\u0328"

Are there the correct mappings?

Let's take a look at “Ѩ” (U+0468) which in the file maps to I\uFE20E\uFE21\u0328, this is the unnormalised form. \uFE21 and \u0328 have different canonical classes and when you normalise them to NFD, the diacritics will be canonically ordered, yielding I\uFE20E\u0328\uFE21.

With “Ѥ” (U+0464) is mapped in the file to I\uFE20E\uFE21\u0304. \uFE21 and \u0304 belong to the same combining class, so the diacritics interact typographically and the two orders are not canonically equivalent. Order matters, and the two sequences constitute different graphemes or perceivable characters.

So visually (with a properly designed font for these characters)

I\uFE20E\uFE21\u0304 would render as the letter I with a left half ligature tie directly above the I, followed by an E with a right half ligature tie directly above the E, and a macron above the right ligature tie centred on the right ligature tie and the E.

While I\uFE20E\u0304\uFE21 would render as the letter I with a left half ligature tie directly above the I, followed by an E macron with a right half ligature tie directly above the E macron.

The sequences are not normalised, so I am wondering what the intended sequence is?

This gets much more complicated with

scriptshifter/scriptshifter/tables/data/asian_cyrillic.yml

Line 412 in 4fca3e8

"\u04B4": "T\uFE20S\uFE21\u0307"

This sequence should render as the letter T with the left half ligature tie positioned directly above the T, followed by S with a right ligature tie positioned above it, and a dot-above positioned above the right half ligature tie, i.e. centred above the S.

It is important to note that U+0307 belongs to S + U+FE21, and not to T + U+FE20 + S + U+FE21. I assume what you are trying to map here is the sequence equivalent to T + ◌͡ + CGJ +◌̇ + S, i.e. U+0054 U+0361 U+034F U+0307 U+0053

The current mapping using ligature ties is not equivalent to the double diacritic sequence U+0054 U+0361 U+0053 U+0307 or U+0054 U+0361 U+034F U+0307 U+0053.

If we map half ligature tie forms to double spanning diacritic equivalents, we get the following:

Half forms	Double forms
T + U+FE20 + S + U+0307 + U+FE21	T + U+0361 + S + U+0307
T + U+FE20 + S + U+FE21 + U+0307	Undefined
Undefined	T + U+0361 + U+034F + U+0307 + S

compare with:

Half forms	Double forms
I + U+FE20 + E + U+FE21 + U+0328	I + U+0361 + E + U+0328
I + U+FE20 + E + U+0328 + U+FE21	I + U+0361 + E + U+0328

Clarifications on intend and usage would be useful.

I do understand that bibliographic data can be dirty, and this is reflected in the Latin to Cyrillic mappings, but I am curious about the Cyrillic to Latin mappings mentioned above.

Belarusian

I see и here not being Romanized. During the meeting I said that Belarusian и И gets Romanized when typed (as I remembered it) which turns out is incorrect. It does not Romanize.

Ѣ however, does not Romanize as Cammeron had reported.

Example 1

Example 2

Korean: breve normalization

In https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/hooks/korean/Functions_KoreanRomanizer.au3#L295C1-L301C1 :

	  If $OCLC="No" Then
		 Local $MARC8[4][2] = [["ŏ","ŏ"],["ŭ","ŭ"],["Ŏ","Ŏ"],["Ŭ","Ŭ"]]
		 For $i = 0 To Ubound($MARC8, 1) - 1
			$KorNameRom = StringRegExpReplace($KorNameRom, "\Q" & $MARC8[$i][0] & "\E",$MARC8[$i][1])
		 Next
	  EndIf

This is replacing the vowel + combining breve pair with a single vowel with breve. I remember discussing this as a general normalization step, and I verified this is running in my code.

However, the test strings for the expected results have the combined version, e.g. https://github.com/lcnetdev/scriptshifter/blob/korean/tests/data/sample_strings.csv#L791:

허상 과 실상 : 한국 정치 의 성숙 을 갈망 하며,Hŏsang kwa silsang: Han'guk chŏngch'i ŭi sŏngsuk ŭl kalmang hamyŏ

@hyoungl Were the test strings written with $OCLC="yes" in mind? If so, shall I replace all the combined breve letters with their one-character versions?

Armenian & Georgian

Didn't transliterate: (The punctuation mark ՝ ("Armenian comma") did not transliterate)

Armenian "dz"
The Armenian character ձ does not romanize, i.e., the text Արձակ էջեր becomes Arձak ējer instead of Ardzak ējer

Armenian "ev" and "p"

Armenian յ

There's a potential issue with the Armenian character յ. The character is transliterated differently depending on whether it is in modern or classical Armenian. In modern Armenian, it is transliterated as y. According to the transliteration table, the character is transliterated as ḥ "only when the letter is in initial position of a word or of a stem in a compound, in Classical orthography." Scriptshifter only transliterates յ as y.

The "Armenian comma" or "book" should transliterate to a comma, but it doesn't transliterate at all. See the punctuation mark after "nvirvum em" in the attached screenshot.

Georgian ფ and ქ

The character ფ should be romanized with an ayn and not a hacek (pʻ not p̌). The character ქ should be romanized with an ayn and not a hacek (kʻ not ǩ).

Russian quotation marks

Seems to work. (However, we usually use “regular” American quotation marks in place of vernacular ones when transcribing in Romanization

Macedonian

Macedonian LC Romanization table was updated a year or two ago, and now two characters, Ѕ/ѕ and Џ/џ, use combining ligatures; the latter is now distinct from Serbian. (The Transliterator application does not reflect this update, so we have to keep an eye out for these characters and update them ourselves afterwards.) ScriptShifter is transliterating as would be appropriate for Serbian, but not for Macedonian.

Korean: Capitalization of test strings

In some of the test strings, the romanization start with a capital letter. In some others it doesn't.

If I run the tests without capitalizing the first word, most tests fail because the first word wouldn't be normally capitalized. If I capitalize the first word of all strings, the ones that don't start with a capital letter will fail.

I am not sure if the tests are failing with capitalization off because of an error in my code that is not capitalizing a word that should always be so, or because some test strings have been capitalized on the first word.

Examples of test strings not starting with a capital:

기술 경영 의 이해 -> kisul kyŏngyŏng ŭi ihae
기업 경영 국제 경영 부문 -> kiŏp kyŏngyŏng kukche kyŏngyŏng pumun
기업 경쟁력 강화 를 위한 내부 고발 과 윤리 경영 -> kiŏp kyŏngjaengnyŏk kanghwa rŭl wihan naebu kobal kwa yulli kyŏngyŏng

Strings starting with a capital:

기업 이 알아야 할 100가지 소프트웨어 저작권 상담 사례 -> Kiŏp i araya hal 100-kaji sop'ŭt'ŭweŏ chŏjakkwŏn sangdam sarye
기업법 개설 제 12판 -> Kiŏppŏp kaesŏl che 12-p'an
기초 연구 투자 의 경제적 파급 효과 분석 -> Kich'o yŏn'gu t'uja ŭi kyŏngjejŏk p'agŭp hyogwa punsŏk

@hyoungl is there a fixed rule here?

	# CONVERION OF "I/i" LIGATED TO "E/e", SOME WITH MACRON (0304) AND OGONEK (0328)
	"\u0464": "I\uFE20E\uFE21\u0304"
	"\u0468": "I\uFE20E\uFE21\u0328"