lcnetdev / scriptshifter Goto Github PK
View Code? Open in Web Editor NEWLicense: Creative Commons Zero v1.0 Universal
License: Creative Commons Zero v1.0 Universal
Suggested additions: Cyrillic scripts like Abkhaz, Tatar, etc.
If StringRegExp($NClipB," 제[0-9]") Then
$NClipB = StringReplace($NClipB," 제"," 제 ")
EndIf
Is the intent of this code to replace every instance of 제
followed by a digit with 제
? In that case it seems to me that this may not yield the expected results, because as long as there is at least one occurrence of 제
followed by a digit, a space will be added to all 제
in the text, regardless of what follows them.
In Python I'm proposing a regex substitution:
data = re.sub(" 제([0-9])", "제 \\1", data)
@hyoungl let me know if I am indeed interpreting this correctly.
If $ConvertR2L = "On" AND $IsL_initial > 0 Then
$ConvertR2L
is defined in L17 and looks like a user settings flag.
@hyoungl does ScriptShifter need to distinguish between the two settings (it would be a major framework change)?
@hyoungl Attached here is a list of the last failing Korean tests from the test strings you provided. I thought it would be more practical for you to review the log as a whole and comment on the individual issues, as many of them seem related to personal names.
korean_tests.log
Thanks.
Generally following ALA/LC table, but some rarely used Chinese characters are not Romanized
An extra space always shown in between romanized character and punctuation; Most punctuation are romanized correctly (except for no romanization of wite corner brackets 『 』)
All "ü" (the only diacritic in ALA/LC Chinese Romanization Table) are showing incorrectly as "u"
ფ (U+10E4) should be pʻ [the diacritic is (U+02BB)], it is currently p̌
ქ (U+10E5) should be kʻ [the diacritic is (U+02BB)], it is currently ǩ
In #42 I noticed a few of failures like this one:
- Chungso kiŏp ŭi chŏllyakchŏk sŏngkwa kwalli(BSC) ironp'yŏn
? ^^
+ Chungsogiŏp ŭi chŏllyakchŏk sŏngkwa kwalli (BSC) ironp'yŏn
? ^ +
Original: 중소 기업 의 전략적 성과 관리(BSC) 이론편
(In order: test result, expected result, original string; ignore the mismatch in the first part for now)
I imagine that whether the original was a typo or a legit string for Korean language, the Romanized string should have consistent spacing for parentheses and other punctuation.
@hyoungl Is there a generic cataloging practice for this? I already added some normalization rules for other punctuation signs (;
, :
, etc.), but it would be good to have a general normalization table for all punctuation signs of this kind on all Romanized strings if guidelines exist in that direction.
Add Thai support.
Some languages like Greek (classical) are reported in the /languages/ endpoint but don't have a table in the directory and returns that error message if used.
당신 이 살아남기 위해 알아야 할 사장 의 비밀
is romanized as Tangsin i saranamgi wihae araya hal sajang ŭi pimil
but the test string expects Tangsin i saranamki wihae araya hal sajang ŭi pimil
.
There is a conversion from ["f16~i0#","m~g"]
in FKR103.
Is this correct? Or am I missing a step where g
should become k
?
If StringLen($TargetKor) > 7 OR StringLen($TargetKor) = 1 OR StringInStr($TargetKorOrig," ",0,1)>3 Then
If $ForeignNameConversion = "Yes" Then
ClipPut($TargetKorOrig)
KorCorpNameRomOCLC()
Else
TrayTip("Error!",@LF & $TargetKorOrig & @LF & "may not be a Korean name",10)
ClipPut("Error!")
EndIf
This seems to throw an error on source names longer than 7 characters, or only 1 character long, or if there is a space after the third character; unless $foreignNameConversion
is set to "yes"
, in which case KorCorpNameRomOCLC()
is run.
2 questions for @hyoungl :
KorCorpNameRomOCLC()
never throws an error. Does that mean that names longer than 7 characters are always acceptable if foreign conversion is on?In https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/hooks/korean/FKR_index.csv#L112 the Korean description of FKR111 is not translated. @hyoungl can you please translate it?
In https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/hooks/korean/Functions_KoreanRomanizer.au3#L1150 various punctuation signs surrounded by spaces are replaced by a code, e.g. ' : '
is replaced by ' SB16KQ '
(FKR050).
Then, in https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/hooks/korean/Functions_KoreanRomanizer.au3#L1257 this is being reverted after romanization, e.g. ' SB16KQ '
becomes ' : '
again (FKR066).
This for me results in a string such as Hŏsang kwa silsang : Han'guk chŏngch'i ŭi sŏngsuk ŭl kalmang hamyŏ
instead of Hŏsang kwa silsang: Han'guk chŏngch'i ŭi sŏngsuk ŭl kalmang hamyŏ
(note the extra space before :
).
I didn't find a place in your code where space is removed before those punctuation signs. Can you point it out?
Or can I simply replace ' SB16KQ '
with ': '
(and similar punctuation cases that don't have a leading space in Roman) in FKR066?
Add a mechanism for users to provide feedback with the goal of improving transliterations.
This functionality should be available both as an API endpoint and a form add-on to the HTML UI. The latter should appear after a transliteration attempt and should present the user with a partly pre-filled form.
The function should accept the following inputs:
Add support for Burmese: https://www.loc.gov/catdir/cpso/romanization/burmese.pdf
It doesn’t appear that the apostrophe (which functions, I believe, like the hard sign in Russian) is differentiated from the soft sign in ScriptShifter transliteration
I have tested this language over and over again and also managed to get the screenshots of the books to make sure that nothing is missed. I really like the script shifter and was hoping that it would work well for Cyrillic Azeri since Latin Azeri was not part of it. But as I mentioned earlier unfortunately it doesn't work for certain letters. I will try to if I can insert the examples : This is the title of the book (LCCN#00655035 ) in original Azeri Cyrillic ( I have a screenshot of the cover and title page) Азəрбајҹан Совет əдəбијјатынын поетика мəсəлəлəри : ǂb елми əсəрлəрин тематик мəҹмyеси This is how the ScriptShifter romanized it Azərbai̐jan Sovet ədəbиi̐i̐atыnыn poetиka məsələlərи : ǂb elmи əsərlərиn tematиk məjmyesи This is how correctly it should have been romanized based on LC-ALA romanization table Azărbai̐jan Sovet ădăbii̐i̐atynyn poetika măsălălări : ǂb elmi ăsărlărin tematik măjmu̇esi I will highlight which of the letters were correctly romanized and which failed: This how this letter looks like in the table: Vernacular Romanization ә ă While in some other instances this letter was romanized fine in this particular case as you can see it stayed as vernacular during the romanization process. These two letters: ј,ҹ were romanized correctly both in the 1st word of the tile and others based on the table: ј i̐ ҹ j The next incorrectly romanized letter is a basic и which should be represented by i и i The next incorrect letter is ы which should have been romanized as y ы y The last incorrect letter which is contained in the last word "мəҹмyеси" is the letter "ү" which should have been romanized as u̇ ү u̇ I want to repeat that I have tested several titles. This is a second example: Азәрбајҹан XXI әсрин астанасында" Республика елми-практики конфрансынын : материаллары Azărbai̐jan XXI ăsrиn astanasыnda" Respublиka elmи-praktиkи konfransыnыn : materиallarы The set of letters here is more limited but the in accuracies with the romanization are almost the same. What surprised me here was the fact that ә in the first word "Азәрбајҹан" was romanized correctly which was not a case in previous example. BBut the problem with the letters и and ы stayed the same.
도란 도란 들려주는 말 이야기
transliterates into Toran toran tŭllyŏjunŭn Mal iyagi
but the test string expects Toran toran tŭllyŏ chunŭn mal iyagi
.
I see a vocalization rule in FKR103 that inserts the j
but nothing else changes the string from tŭllyŏjunŭn
.
@hyoungl I am not clear of what happens with the glottalization process in https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/hooks/korean/Functions_KoreanRomanizer.au3#L1192-L1205 .
There is a substitution of the ^
character (added earlier) with the string GLOTTAL
. Then, a few lines after, GLOTTAL
is replaced with the empty string. I can't see any change happening in KorRom
, which separates it from the Korean characters and returns it unchanged. In practice, the ^
character is simply removed with no further action.
In the case of 결단력
, I get to 결 GLOTTAL 단+력
which eventually becomes Kyŏli
.
Am I missing something? Is this how it's supposed to work?
근대 와 식민 의 서곡
is romanized to Kŭndae wa sikmin ŭi sŏgok
, but Kŭndae wa singmin ŭi sŏgok
is expected.
Specifically, 식민
becomes sikmin
instead of singmin
.
Debugging the script at the 식민
word, the code point conversion yields i9#m20#f1~i6#m20#f4E
.
FKR073÷100 don't change anything in this string.
This translates into sikmin
as per FKR109.
No other modification is made after that point.
I verified the mappings I transcribed from K-Romanizer are identical to the original
@hyoungl Can you tell what is going wrong?
For $i = 0 To Ubound($Rule1)-1
If StringLeft(StringStripWS($TargetKorOrig,8),1)=$Rule1[$i] Then
$IsNonKor=$IsNonKor+1
EndIf
Next
etc.
And further down https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/hooks/korean/Functions_KoreanRomanizer.au3#L1773:
If $Len > 1 and $IsNonKor=0 and $IsParticle=0 Then
This is the only place where $isNonKor
is checked, and it's only checked if it equals 0. @hyoungl can I break the loops in LL1741-1750 as soon as $isNonKor
> 0 to avoid redundant processing?
Submitting the test string to the /trans/ endpoint "曉城 趙 明基 博士 追慕 佛教 史學 論文集" I get an error:
scriptshifter | INFO:scriptshifter.trans:Transliteration is from korean to Latin.
scriptshifter | ERROR:scriptshifter.rest_api:Exception on /trans/korean [POST]
scriptshifter | Traceback (most recent call last):
scriptshifter | File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 2525, in wsgi_app
scriptshifter | response = self.full_dispatch_request()
scriptshifter | File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1822, in full_dispatch_request
scriptshifter | rv = self.handle_user_exception(e)
scriptshifter | File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1820, in full_dispatch_request
scriptshifter | rv = self.dispatch_request()
scriptshifter | File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1796, in dispatch_request
scriptshifter | return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
scriptshifter | File "/usr/local/scriptshifter/src/./scriptshifter/rest_api.py", line 81, in transliterate_req
scriptshifter | out = transliterate(in_txt, lang, r2s, capitalize)
scriptshifter | File "/usr/local/scriptshifter/src/./scriptshifter/trans.py", line 59, in transliterate
scriptshifter | cfg = load_table(lang)
scriptshifter | File "/usr/local/scriptshifter/src/./scriptshifter/tables/__init__.py", line 117, in load_table
scriptshifter | tdata = load(fh, Loader=Loader)
scriptshifter | File "/usr/local/lib/python3.9/site-packages/yaml/__init__.py", line 81, in load
scriptshifter | return loader.get_single_data()
scriptshifter | File "/usr/local/lib/python3.9/site-packages/yaml/constructor.py", line 49, in get_single_data
scriptshifter | node = self.get_single_node()
scriptshifter | File "/usr/local/lib/python3.9/site-packages/yaml/composer.py", line 36, in get_single_node
scriptshifter | document = self.compose_document()
scriptshifter | File "/usr/local/lib/python3.9/site-packages/yaml/composer.py", line 55, in compose_document
scriptshifter | node = self.compose_node(None, None)
scriptshifter | File "/usr/local/lib/python3.9/site-packages/yaml/composer.py", line 84, in compose_node
scriptshifter | node = self.compose_mapping_node(anchor)
scriptshifter | File "/usr/local/lib/python3.9/site-packages/yaml/composer.py", line 133, in compose_mapping_node
scriptshifter | item_value = self.compose_node(node, item_key)
scriptshifter | File "/usr/local/lib/python3.9/site-packages/yaml/composer.py", line 84, in compose_node
scriptshifter | node = self.compose_mapping_node(anchor)
scriptshifter | File "/usr/local/lib/python3.9/site-packages/yaml/composer.py", line 133, in compose_mapping_node
scriptshifter | item_value = self.compose_node(node, item_key)
scriptshifter | File "/usr/local/lib/python3.9/site-packages/yaml/composer.py", line 82, in compose_node
scriptshifter | node = self.compose_sequence_node(anchor)
scriptshifter | File "/usr/local/lib/python3.9/site-packages/yaml/composer.py", line 111, in compose_sequence_node
scriptshifter | node.value.append(self.compose_node(node, index))
scriptshifter | File "/usr/local/lib/python3.9/site-packages/yaml/composer.py", line 84, in compose_node
scriptshifter | node = self.compose_mapping_node(anchor)
scriptshifter | File "/usr/local/lib/python3.9/site-packages/yaml/composer.py", line 133, in compose_mapping_node
scriptshifter | item_value = self.compose_node(node, item_key)
scriptshifter | File "/usr/local/lib/python3.9/site-packages/yaml/composer.py", line 64, in compose_node
scriptshifter | if self.check_event(AliasEvent):
scriptshifter | File "/usr/local/lib/python3.9/site-packages/yaml/parser.py", line 98, in check_event
scriptshifter | self.current_event = self.state()
scriptshifter | File "/usr/local/lib/python3.9/site-packages/yaml/parser.py", line 449, in parse_block_mapping_value
scriptshifter | if not self.check_token(KeyToken, ValueToken, BlockEndToken):
scriptshifter | File "/usr/local/lib/python3.9/site-packages/yaml/scanner.py", line 116, in check_token
scriptshifter | self.fetch_more_tokens()
scriptshifter | File "/usr/local/lib/python3.9/site-packages/yaml/scanner.py", line 251, in fetch_more_tokens
scriptshifter | return self.fetch_double()
scriptshifter | File "/usr/local/lib/python3.9/site-packages/yaml/scanner.py", line 655, in fetch_double
scriptshifter | self.fetch_flow_scalar(style='"')
scriptshifter | File "/usr/local/lib/python3.9/site-packages/yaml/scanner.py", line 666, in fetch_flow_scalar
scriptshifter | self.tokens.append(self.scan_flow_scalar(style))
scriptshifter | File "/usr/local/lib/python3.9/site-packages/yaml/scanner.py", line 1152, in scan_flow_scalar
scriptshifter | chunks.extend(self.scan_flow_scalar_non_spaces(double, start_mark))
scriptshifter | File "/usr/local/lib/python3.9/site-packages/yaml/scanner.py", line 1213, in scan_flow_scalar_non_spaces
scriptshifter | raise ScannerError("while scanning a double-quoted scalar", start_mark,
scriptshifter | yaml.scanner.ScannerError: while scanning a double-quoted scalar
scriptshifter | in "/usr/local/scriptshifter/src/scriptshifter/tables/data/korean.yml", line 17104, column 29
scriptshifter | expected escape sequence of 4 hexadecimal numbers, but found 'n'
scriptshifter | in "/usr/local/scriptshifter/src/scriptshifter/tables/data/korean.yml", line 17104, column 43
Even though recognizing Chinese personal names is a very complex task based on context which may require machine learning, some simple rules can help catch at least some common cases, e.g. prefixes such as "Professor", "Teacher", "Mr.", "Ms.", etc.
@thisismattmiller @kefo During the work in the korean
branch it became necessary to change the output of the /trans
API endpoint. Currently the output is a plain text string of the result; but K-Romanizer also displays important warnings on the Windows UI that need to be reproduced in the SS webapp, so I changed the output to a JSON object with the following structure:
{
"output": <string>,
"warnings": <list of strings>
}
I imagine that this would break Marva, probably by dumping the JSON object as plain text in the wrong place. Can we coordinate the deployment of the Korean module with updates to Marva?
The work-in-progress branch is named korean
. I'm getting close to deploying at least the part that handles strings without personal names.
Also, I noticed that @thisismattmiller added some very useful JS in the index.html page that fills a results div with the output: https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/templates/index.html#L69-L115
Would you mind making a small improvement? It would be about adding another div with an ID of "warnings" or similar (milligram, the CSS framework I used, should have some classes to highlight it), hidden by default, that gets displayed and filled with the warnings (maybe one <p>
or <li>
per list item), and change the content of the results div to only the value of "ouptut"? Warnings would be empty most of the time so the warnings div should be hidden if they are empty.
I double checked the Pulaar (Adlam) is present in the index.yml
file but in the Marva Editor it doesn't appear as an option in the Romanization manual select pulldown menu. Is there another connection that has to be made?
If StringInStr($TargetKorOrig," ",0,3)>0 Then
TrayTip("Error!",@LF & $TargetKorOrig & @LF & "may not be a Korean name (too many spaces)",10)
This condition seems to never be evaluated because it tests for 3 or more spaces, but on L459 it's tested for 2 or more spaces. @hyoungl is that correct? Can this condition be removed?
In https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/hooks/korean/Functions_KoreanRomanizer.au3#L1244 several popular name conversions are made. Some names used in test strings seem to be missing.
The failing tests are:
김 정일 공포 를 쏘아 올리다
-> Kim chŏngil kongp'o rŭl ssoa ollida
(expected: Kim Chŏng-il kongp'o rŭl ssoa ollida
)
김 창모 의 대한 민국 선물 옵션 교과서
-> Kim ch'angmo ŭi Taehan Min'guk sŏnmul opsyŏn kyogwasŏ
(expected: Kim Ch'ang-mo ŭi Taehan Min'guk sŏnmul opsyŏn kyogwasŏ
)
다석 류 영모 - 우리 말 과 우리 글 로 철학한 큰 사상가
-> Tasŏk ryu yŏngmo - uri Mal kwa uri kŭl ro ch'ŏrhakhan k'ŭn sasangga
(expected: Tasŏk Yu Yŏng-mo - uri mal kwa uri kŭl ro ch'ŏrhakhan k'ŭn sasangga
- I am not sure if this is a personal name)
Do these names need to be added to the list of special names?
The character ფ (both Pʻ and pʻ) becomes p2BB in transliteration.
The character ქ (both Kʻ and kʻ) becomes k2BB in transliteration.
일본 중심적
transliterates into chungsimjŏk
but chungsimchŏk
is expected.
The code sequence is i12#m13#f21~i9#m20#f16~i12#m4#f1E
There is a substitution for ["f16~i12#","m~j"]
in FKR 103.
This seems like the same problem as #36.
Regarding FKR103, I can see that in https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/hooks/korean/Functions_KoreanRomanizer.au3#L1673 a series of variables are checked, but as far as I can see, they are all set to 0 only once and never changed. Is FKR103 supposed to run on some conditions?
In https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/hooks/korean/Functions_KoreanRomanizer.au3#L1849 a ConvertR2L
option is checked.
As for other options I encountered, I'm hoping that ScriptShifter could offer only one way of handling romanized strings. Which cases is this option relevant for, and can it be set to a fixed value?
If WinActive("Voyager Cataloging") Then
$NoOCLCBreve = "On"
Else
$NoOCLCBreve = "Off"
EndIf
@hyoungl How would this condition occur in ScriptShifter? Would it be always true, always false, or depend on some external factor?
한국 에서의 다문화 주의 - 현실 과 쟁점
is romanized as Han'guk esŏŭi tamunhwa chuŭi - hyŏnsil kwa chaengchŏm
but Han'guk esŏŭi tamunhwa chuŭi-hyŏnsil kwa chaengchŏm
is expected.
Is there supposed to be no spaces between the hyphen and the two words, even though there is in the script?
Complete Cyrillic tables for non-Slavic languages.
@RandyBarry can you add a list of the languages involved here?
Bulgarian letter "Ъ" and "'ь" do not Romanize, which makes the text unreadable. Amazingly, the following letters were Romanized perfectly: Ѣ ѫ ѧ ю я
Bulgarian Er-malak (small yer) with ALA-LOC sign " ʹ " (soft sign) missing after ScriptShift romanization
Romanized Ŭ for Bulgarian Ъ in the beginning and middle of words does not appear when using ScriptShift. The same is true for the romanized ʺ (hard sign) for Bulgarian Ъ at the end of the words.
Bulgarian "Ъ, ъ", known as Er-golyam (large yer), phonetic transcription “ă”, ALA-LC Romanization “ŭ” or ʺ - not present after ScriptShifter romanization
Romanized Ŭ for Bulgarian Ъ in the beginning and middle of words does not appear when using ScriptShift. The same is true for the romanized ʺ (hard sign) for Bulgarian Ъ at the end of the words.
Test string at https://github.com/lcnetdev/scriptshifter/blob/main/tests/data/script_samples/cyrillic.csv#L13 is indicated as serbian_macedonian
; however, the two languages have been split up and serrbian_macedonian
no longer exists, and that sentence must be tested with either one of those two scripts. @RandyBarry which script is that a better fit for?
In the original K-Romanizer, a common logic of FKR 171 to 179, is to replace a Hangul character with a Roman X, e.g.
If StringInStr($Hangul,"不")>0 THEN
StringReplace($Hangul,"不","X")
I don't see this X
being replaced anywhere else in Functions_KoreanHancha.au3
nor in Function_KoreanRomanizer.au3
. Is this intentional?
Current list of failing names (from Y. Lee's list:
korean_names_tests.log
Add support for Khmer: https://www.loc.gov/catdir/cpso/romanization/khmer.pdf
Add five Armenian lowercase-only ligature letters and a diacritic:
ﬓ (U+FB13) should become mn
ﬔ (U+FB14) should become me
ﬕ (U+FB15) should become mi
ﬖ (U+FB16) should become vn
ﬗ (U+FB17) should become mkh
եվ (U+0565) (U+057E) should become eʹv [the diacritic is (U+02B9)]
Եվ (U+0535) (U+057E) should become Eʹv [the diacritic is (U+02B9)]
The (Hancha?) conversion map at https://github.com/lcnetdev/scriptshifter/blob/korean/scriptshifter/hooks/korean/Functions_KoreanRomanizer.au3#L809 has some duplicates:
['肖','소']
followed by ['肖','초']
['葉','섭']
followed by ['葉','엽']
[' 백골단 ',' 백^골^단 ']
followed by [' 백골단 ',' 백골^단 ']
@hyoungl how is the substitution working here? It seems to me that the second duplicates will never be used.
Implement a parser for Chinese numerals similar to the PArallelogram logic: https://github.com/pulibrary/parallelogram/blob/8be9d46ca6b9b5f85e255f9823f82c5b5b0ada27/cloudapp/src/app/pinyin.service.ts#L160
Refer to email exchange with Tom Ventimiglia for details.
@hyoungl I need some help clarifying what's happening in https://github.com/lcnetdev/scriptshifter/blob/main/legacy/Functions_KoreanHancha.au3#L241-L256 (repeated for FKR172 to 179 with different characters - comments are mine):
;FKR172
If StringInStr($Hangul,"列")>0 THEN
StringReplace($Hangul,"列","X")
$R_Initial_Count = @Extended
; ^^ This gets the number of occurrences of "列"
For $i=1 to $R_Initial_Count
; ^^ loops for as many occurrences are found
$R_Initial_Str = StringMid($Hangul,StringInStr($Hangul,"列",0,1)-1,1)
; ^^ This looks like is extracting one character before the current "列", right?
ClipPut($R_Initial_Str)
IdentifyCoda()
; ^^ identifyCoda returns the modulo of the ASCII value of the character before "列" minus 44032, and 28
$CodaValue = ClipGet()
If $CodaValue = "0" OR $CodaValue = "4" OR StringToBinary($R_Initial_Str,4)<100 Then
; ^^ What does StringToBinary? I looked at the AutoIt doc and examples online but I'm not sure how that can be compared with an integer (100). Would 100 be the code point value?
$Hangul=StringReplace($Hangul,"列","열",1)
Else
$Hangul=StringReplace($Hangul,"列","렬",1)
EndIf
; These replacements ensure that the next loop looks for the next occurrence of "列"
Next
EndIf
Thanks.
After capitalizing all first words of the test strings, many tests are failing because of a capitalization mismatch for non-first words.
A sample, with actual result first and expected result after:
- Minpŏp kwa Pŏphak ŭi chungyo Munje
? ^ ^
+ Minpŏp kwa pŏphak ŭi chungyo munje
? ^ ^
Original: 民法 과 法學 의 重要 問題
----------------------------------------------------------------------
- Kŭndae kyemonggi Munhak kwa tokcha ŭi palgyŏn
? ^
+ Kŭndae kyemonggi munhak kwa tokcha ŭi palgyŏn
? ^
Original: 근대 계몽기 문학 과 독자 의 발견
----------------------------------------------------------------------
- Kŭllobŏl sidae ŭi Kyŏngyŏnghak wŏllon
? ^
+ Kŭllobŏl sidae ŭi kyŏngyŏnghak wŏllon
? ^
Original: 글로벌 시대 의 경영학 원론
----------------------------------------------------------------------
- Kŭmganggyŏng toam Sŏnsa ŭi kŭmganggyŏng haesŏlsŏ
? ^
+ Kŭmganggyŏng toam sŏnsa ŭi kŭmganggyŏng haesŏlsŏ
? ^
Original: 금강경 도암 선사 의 금강경 해설서
----------------------------------------------------------------------
- Kŭmyung kaebang ŭi Kyŏngjejŏk hyogwa wa Kwaje
? ^ ^
+ Kŭmyung kaebang ŭi kyŏngjejŏk hyogwa wa kwaje
? ^ ^
Original: 금융 개방 의 경제적 효과 와 과제
----------------------------------------------------------------------
- Kŭmyung Hoesa ŭi p'asaeng sangp'um unyong e ttarŭn risŭk'ŭ kwalli
? ^
+ Kŭmyung hoesa ŭi p'asaeng sangp'um unyong e ttarŭn risŭk'ŭ kwalli
? ^
Original: 금융 회사 의 파생 상품 운용 에 따른 리스크 관리
----------------------------------------------------------------------
- Kidokkyo wa Sahoehak ŭi chŏpchŏm
? ^
+ Kidokkyo wa sahoehak ŭi chŏpchŏm
? ^
Original: 기독교 와 사회학 의 접점
----------------------------------------------------------------------
- Kidokkyo chŏnsŭng ŭi Sosŏljŏk Hyŏngsanghwa wa chakka Ŭisik
? ^ ^ ^
+ Kidokkyo chŏnsŭng ŭi sosŏljŏk hyŏngsanghwa wa chakka ŭisik
? ^ ^ ^
Original: 기독교 전승 의 소설적 형상화 와 작가 의식
----------------------------------------------------------------------
- Kiŏp kyŏngyŏng Kukche kyŏngyŏng pumun
? ^
+ Kiŏp kyŏngyŏng kukche kyŏngyŏng pumun
? ^
Original: 기업 경영 국제 경영 부문
----------------------------------------------------------------------
- Kiŏp kyŏngjaengnyŏk Kanghwa rŭl wihan naebu kobal kwa yulli kyŏngyŏng
? ^
+ Kiŏp kyŏngjaengnyŏk kanghwa rŭl wihan naebu kobal kwa yulli kyŏngyŏng
? ^
Original: 기업 경쟁력 강화 를 위한 내부 고발 과 윤리 경영
----------------------------------------------------------------------
- Kiŏp ŭi Nosa Munje hyŏnjang i tap ida
? ^ ^
+ Kiŏp ŭi nosa munje hyŏnjang i tap ida
? ^ ^
Original: 기업 의 노사 문제 현장 이 답 이다
----------------------------------------------------------------------
- Kiŏp i araya hal 100-kaji Sop'ŭt'ŭweŏ chŏjakkwŏn sangdam sarye
? ^
+ Kiŏp i araya hal 100-kaji sop'ŭt'ŭweŏ chŏjakkwŏn sangdam sarye
? ^
Original: 기업 이 알아야 할 100가지 소프트웨어 저작권 상담 사례
----------------------------------------------------------------------
- Kiŏppŏp kaesŏl Che 12-p'an
? ^
+ Kiŏppŏp kaesŏl che 12-p'an
? ^
Original: 기업법 개설 제 12판
----------------------------------------------------------------------
- Kich'o yŏn'gu t'uja ŭi Kyŏngjejŏk p'agŭp hyogwa punsŏk
? ^
+ Kich'o yŏn'gu t'uja ŭi kyŏngjejŏk p'agŭp hyogwa punsŏk
? ^
Original: 기초 연구 투자 의 경제적 파급 효과 분석
----------------------------------------------------------------------
- Kihu Pyŏnhwa wa chŏnyŏmpyŏng chilbyŏng pudam
? ^
+ Kihu pyŏnhwa wa chŏnyŏmpyŏng chilbyŏng pudam
? ^
Original: 기후 변화 와 전염병 질병 부담
----------------------------------------------------------------------
- Kihu Pyŏnhwa e taeŭng han chisok kanŭng han kukt'o kwalli chŏllyak
? ^
+ Kihu pyŏnhwa e taeŭng han chisok kanŭng han kukt'o kwalli chŏllyak
? ^
Original: 기후 변화 에 대응 한 지속 가능 한 국토 관리 전략
----------------------------------------------------------------------
- Kihu Pyŏnhwa e ttarŭn nongŏp pumun yŏnghyang punsŏk
? ^
+ Kihu pyŏnhwa e ttarŭn nongŏp pumun yŏnghyang punsŏk
? ^
Original: 기후 변화 에 따른 농업 부문 영향 분석
----------------------------------------------------------------------
- Kim chŏngil kongp'o rŭl ssoa ollida
? ^
+ Kim Chŏng-il kongp'o rŭl ssoa ollida
? ^ +
Original: 김 정일 공포 를 쏘아 올리다
----------------------------------------------------------------------
- Kim ch'angmo ŭi Taehan Min'guk sŏnmul opsyŏn kyogwasŏ
? ^
+ Kim Ch'ang-mo ŭi Taehan Min'guk sŏnmul opsyŏn kyogwasŏ
? ^ +
Original: 김 창모 의 대한 민국 선물 옵션 교과서
----------------------------------------------------------------------
- Kkŭnnaji anŭn Yŏksa ap esŏ
? ^
+ Kkŭnnaji anŭn yŏksa ap esŏ
? ^
Original: 끝나지 않은 역사 앞 에서
----------------------------------------------------------------------
- Na rŭl titko sesang ŭl hyanghae Ttwiŏ ollara
? ^
+ Na rŭl titko sesang ŭl hyanghae ttwiŏ ollara
? ^
Original: 나 를 딛고 세상 을 향해 뛰어 올라라
----------------------------------------------------------------------
- Nam-Pukhan kan pogŏn ŭiryu kyoryu hyŏmnyŏk ŭi hyoyulChŏk Suhaeng Ch'egye kuch'uk pangan yŏn'gu
? ^ ^ ^
+ Nam-Pukhan kan pogŏn ŭiryu kyoryu hyŏmnyŏk ŭi hyoyulchŏk suhaeng ch'egye kuch'uk pangan yŏn'gu
? ^ ^ ^
Original: 남북한 간 보건 의류 교류 협력 의 효율적 수행 체계 구축 방안 연구
----------------------------------------------------------------------
- Namsŏngbok K'ŭllaesik p'aet'ŏn
? ^
+ Namsŏngbok k'ŭllaesik p'aet'ŏn
? ^
Original: 남성복 클래식 패턴
----------------------------------------------------------------------
- Namwŏn Kosa wŏnjŏn pip'yŏng
? ^
+ Namwŏn kosa wŏnjŏn pip'yŏng
? ^
Original: 남원 고사 원전 비평
----------------------------------------------------------------------
- Nae ka sara on Han'guk hyŏndae Munhaksa
? ^
+ Nae ka sara on Han'guk hyŏndae munhaksa
? ^
Original: 내 가 살아 온 한국 현대 문학사
----------------------------------------------------------------------
- Nae maŭm sok ŭi Han'guk Munhak
? ^
+ Nae maŭm sok ŭi Han'guk munhak
? ^
Original: 내 마음 속 의 한국 문학
----------------------------------------------------------------------
- Noin changgi yoyang pojang Ch'egye ŭi hyŏnhwang kwa kaesŏn pangan
? ^
+ Noin changgi yoyang pojang ch'egye ŭi hyŏnhwang kwa kaesŏn pangan
? ^
Original: 노인 장기 요양 보장 체계 의 현황 과 개선 방안
----------------------------------------------------------------------
The various ALA_LC romanisation tables for Cyrillic script languages have complex encoding issues, and it seems that the mappings in the file below diverge from the intend of the romanisation tables.
In the Asian Cyriillic mapping at
scriptshifter/scriptshifter/tables/data/asian_cyrillic.yml
Lines 334 to 336 in 4fca3e8
Are there the correct mappings?
Let's take a look at “Ѩ” (U+0468) which in the file maps to I\uFE20E\uFE21\u0328
, this is the unnormalised form. \uFE21
and \u0328
have different canonical classes and when you normalise them to NFD, the diacritics will be canonically ordered, yielding I\uFE20E\u0328\uFE21
.
With “Ѥ” (U+0464) is mapped in the file to I\uFE20E\uFE21\u0304
. \uFE21
and \u0304
belong to the same combining class, so the diacritics interact typographically and the two orders are not canonically equivalent. Order matters, and the two sequences constitute different graphemes or perceivable characters.
So visually (with a properly designed font for these characters)
I\uFE20E\uFE21\u0304
would render as the letter I with a left half ligature tie directly above the I, followed by an E with a right half ligature tie directly above the E, and a macron above the right ligature tie centred on the right ligature tie and the E.
While I\uFE20E\u0304\uFE21
would render as the letter I with a left half ligature tie directly above the I, followed by an E macron with a right half ligature tie directly above the E macron.
The sequences are not normalised, so I am wondering what the intended sequence is?
This gets much more complicated with
This sequence should render as the letter T with the left half ligature tie positioned directly above the T, followed by S with a right ligature tie positioned above it, and a dot-above positioned above the right half ligature tie, i.e. centred above the S.
It is important to note that U+0307 belongs to S + U+FE21
, and not to T + U+FE20 + S + U+FE21
. I assume what you are trying to map here is the sequence equivalent to T + ◌͡ + CGJ +◌̇ + S, i.e. U+0054 U+0361 U+034F U+0307 U+0053
The current mapping using ligature ties is not equivalent to the double diacritic sequence U+0054 U+0361 U+0053 U+0307
or U+0054 U+0361 U+034F U+0307 U+0053
.
If we map half ligature tie forms to double spanning diacritic equivalents, we get the following:
Half forms | Double forms |
---|---|
T + U+FE20 + S + U+0307 + U+FE21 | T + U+0361 + S + U+0307 |
T + U+FE20 + S + U+FE21 + U+0307 | Undefined |
Undefined | T + U+0361 + U+034F + U+0307 + S |
compare with:
Half forms | Double forms |
---|---|
I + U+FE20 + E + U+FE21 + U+0328 | I + U+0361 + E + U+0328 |
I + U+FE20 + E + U+0328 + U+FE21 | I + U+0361 + E + U+0328 |
Clarifications on intend and usage would be useful.
I do understand that bibliographic data can be dirty, and this is reflected in the Latin to Cyrillic mappings, but I am curious about the Cyrillic to Latin mappings mentioned above.
If $OCLC="No" Then
Local $MARC8[4][2] = [["ŏ","ŏ"],["ŭ","ŭ"],["Ŏ","Ŏ"],["Ŭ","Ŭ"]]
For $i = 0 To Ubound($MARC8, 1) - 1
$KorNameRom = StringRegExpReplace($KorNameRom, "\Q" & $MARC8[$i][0] & "\E",$MARC8[$i][1])
Next
EndIf
This is replacing the vowel + combining breve pair with a single vowel with breve. I remember discussing this as a general normalization step, and I verified this is running in my code.
However, the test strings for the expected results have the combined version, e.g. https://github.com/lcnetdev/scriptshifter/blob/korean/tests/data/sample_strings.csv#L791:
허상 과 실상 : 한국 정치 의 성숙 을 갈망 하며,Hŏsang kwa silsang: Han'guk chŏngch'i ŭi sŏngsuk ŭl kalmang hamyŏ
@hyoungl Were the test strings written with $OCLC="yes"
in mind? If so, shall I replace all the combined breve letters with their one-character versions?
Didn't transliterate: (The punctuation mark ՝ ("Armenian comma") did not transliterate)
Armenian "dz"
The Armenian character ձ does not romanize, i.e., the text Արձակ էջեր becomes Arձak ējer instead of Ardzak ējer
Armenian "ev" and "p"
Armenian յ
There's a potential issue with the Armenian character յ. The character is transliterated differently depending on whether it is in modern or classical Armenian. In modern Armenian, it is transliterated as y. According to the transliteration table, the character is transliterated as ḥ "only when the letter is in initial position of a word or of a stem in a compound, in Classical orthography." Scriptshifter only transliterates յ as y.
The "Armenian comma" or "book" should transliterate to a comma, but it doesn't transliterate at all. See the punctuation mark after "nvirvum em" in the attached screenshot.
Georgian ფ and ქ
The character ფ should be romanized with an ayn and not a hacek (pʻ not p̌). The character ქ should be romanized with an ayn and not a hacek (kʻ not ǩ).
Seems to work. (However, we usually use “regular” American quotation marks in place of vernacular ones when transcribing in Romanization
Macedonian LC Romanization table was updated a year or two ago, and now two characters, Ѕ/ѕ and Џ/џ, use combining ligatures; the latter is now distinct from Serbian. (The Transliterator application does not reflect this update, so we have to keep an eye out for these characters and update them ourselves afterwards.) ScriptShifter is transliterating as would be appropriate for Serbian, but not for Macedonian.
In some of the test strings, the romanization start with a capital letter. In some others it doesn't.
If I run the tests without capitalizing the first word, most tests fail because the first word wouldn't be normally capitalized. If I capitalize the first word of all strings, the ones that don't start with a capital letter will fail.
I am not sure if the tests are failing with capitalization off because of an error in my code that is not capitalizing a word that should always be so, or because some test strings have been capitalized on the first word.
Examples of test strings not starting with a capital:
기술 경영 의 이해 -> kisul kyŏngyŏng ŭi ihae
기업 경영 국제 경영 부문 -> kiŏp kyŏngyŏng kukche kyŏngyŏng pumun
기업 경쟁력 강화 를 위한 내부 고발 과 윤리 경영 -> kiŏp kyŏngjaengnyŏk kanghwa rŭl wihan naebu kobal kwa yulli kyŏngyŏng
Strings starting with a capital:
기업 이 알아야 할 100가지 소프트웨어 저작권 상담 사례 -> Kiŏp i araya hal 100-kaji sop'ŭt'ŭweŏ chŏjakkwŏn sangdam sarye
기업법 개설 제 12판 -> Kiŏppŏp kaesŏl che 12-p'an
기초 연구 투자 의 경제적 파급 효과 분석 -> Kich'o yŏn'gu t'uja ŭi kyŏngjejŏk p'agŭp hyogwa punsŏk
@hyoungl is there a fixed rule here?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.