Coder Social home page Coder Social logo

apertium-chv's Introduction

apertium-chv's People

Contributors

ftyers avatar hectoralos avatar ilnarselimcan avatar jonorthwash avatar mr-martian avatar qavan avatar sushain97 avatar tinodidriksen avatar unhammer avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

apertium-chv's Issues

<opt><p3><sg>

<opt><p3><sg> has a specific behaviour. I have used for it the new archiphoneme {И}. The и ending overwrites the final а or е vowels of the stem, if they exists. For у and и, an epenthesis must be added.

$ aq-morftest -ci v_кил_del.yaml | grep "<opt><p3><sg>"
[PASS] кил<v><iv><opt><p3><sg> => килин
$ aq-morftest -ci v_кай_del.yaml | grep "<opt><p3><sg>"
[PASS] кай<v><iv><opt><p3><sg> => кайин
[PASS] кайин => кай<v><iv><opt><p3><sg>
$ aq-morftest -ci v_кала_del.yaml | grep "<opt><p3><sg>"
[FAIL] кала<v><tv><opt><p3><sg> => missing results: калин
[FAIL] кала<v><tv><opt><p3><sg> => unexpected results: калаин
$ echo "кала<v><iv><opt><p3><sg>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
кала<v><iv><opt><p3><sg>	кала<v><iv><opt><p3><sg>+?	inf
$ aq-morftest -ci v_ту_del.yaml | grep "<opt><p3><sg>"
[FAIL] ту<v><tv><opt><p3><sg> => missing results: тӑвин
[FAIL] ту<v><tv><opt><p3><sg> => unexpected results: тувин
$ echo "ту<v><tv><opt><p3><sg>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
ту<v><tv><opt><p3><sg>	ту{в}>{И}н	0,000000
$ aq-morftest -ci v_выля_del.yaml | grep "<opt><p3><sg>"
[FAIL] выля<v><tv><opt><p3><sg> => missing results: вылин
[FAIL] выля<v><tv><opt><p3><sg> => unexpected results: выляин
$ echo "выля<v><tv><opt><p3><sg>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
выля<v><tv><opt><p3><sg>	выля>{И}н	0,000000

Deal with -и- in participles

05:12 <spectie> $ echo килнине | hfst-lookup chv.automorf.hfst 
05:12 <spectie> килнине кил<v><iv><ger_past><px3sp><dat> 0,000000
05:13 <spectie> but not the one that Luutonen marks as соmе-PST.PTCP-I-DAT/ACC
05:13 <spectie> page 51
05:14 <firespeaker> well he says on p.57-58 that it can thought of synchronically as px3
05:14 <firespeaker> I guess what you want is
05:14 <firespeaker> кил<v><iv><gpr_past><subst><dat>
05:15 <spectie> but then there are some cases where it should surface as -ĕ- but surfaces as -и-
05:16 <firespeaker> yeah, he says on p. 55 that they PARTIALLY overlap in form

On adjective nominalisation

There are three typical forms of adjective nominalisation:

  • The most used is the suffix (I don't know why it appears as и(н) in lexc - but that will be another issue)
  • The -скер suffix is also known and implemented
  • A third way is the simple addition of posessive, plural and/or case suffixes to an adjective (e.g. ҫамрӑк (young) > ҫамрӑксем (young people))

This third way is less used than the other two but it also happens.

The question here is that currently we have that the analysis stands for either +и and +∅. We should differenciate between both. I would use for the first one +и<subst%>, as it is done for скер.

Any problem?

<abil>

<abil> gives many problems. Here are some relevant cases:

$ echo "кил<v><iv><abil><pres><p1><sg>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
кил<v><iv><abil><pres><p1><sg>	кил>{A}й>{A}т>{Ă}п	0,000000
$ aq-morftest -ci v_кил_del.yaml | grep "<abil><pres><p1><sg>"
[FAIL] кил<v><iv><abil><pres><p1><sg> => missing results: килеетӗп
[FAIL] кил<v><iv><abil><pres><p1><sg> => unexpected results: килейетӗп

$ echo "кала<v><tv><abil><pres><p1><sg>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
кала<v><tv><abil><pres><p1><sg>	кала>{A}й>{A}т>{Ă}п	0,000000
$ aq-morftest -ci v_кала_del.yaml | grep "<abil><pres><p1><sg>"
[FAIL] кала<v><tv><abil><pres><p1><sg> => missing results: калаятӑп
[FAIL] кала<v><tv><abil><pres><p1><sg> => unexpected results: калайатӑп, калааятӑп

echo "ҫи<v><tv><abil><pres><p1><sg>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
ҫи<v><tv><abil><pres><p1><sg>	ҫи{й}>{A}й>{A}т>{Ă}п	0,000000
$ aq-morftest -ci v_ҫи_del.yaml | grep "<abil><pres><p1><sg>"
[FAIL] ҫи<v><tv><abil><pres><p1><sg> => missing results: ҫийеетӗп
[FAIL] ҫи<v><tv><abil><pres><p1><sg> => unexpected results: ҫиейетӗп

$ echo "ту<v><tv><abil><pres><p1><sg>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
ту<v><tv><abil><pres><p1><sg>	ту{в}>{A}й>{A}т>{Ă}п	0,000000
$ aq-morftest -ci v_ту_del.yaml | grep "<abil><pres><p1><sg>"
[FAIL] ту<v><tv><abil><pres><p1><sg> => missing results: тӑвaятӑп
[FAIL] ту<v><tv><abil><pres><p1><sg> => unexpected results: тӑвайатӑп, тӑваятӑп

$ echo "выля<v><tv><abil><pres><p1><sg>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
выля<v><tv><abil><pres><p1><sg>	выля>{A}й>{A}т>{Ă}п	0,000000
$ aq-morftest -ci v_выля_del.yaml | grep "<abil><pres><p1><sg>"
[FAIL] выля<v><tv><abil><pres><p1><sg> => missing results: выляятӑп
[FAIL] выля<v><tv><abil><pres><p1><sg> => unexpected results: выляаятӑп, выляайатӑп

Generating two forms of px2sg.dat

Currently apertium-chv generates two forms of some words with <px2sg><dat>, e.g.

^алӑк<n><px2sg><dat>$  ↬  алӑкуна/алӑкна

If both of these are correct, we need to chose which one to generate (my inclination would be to go with the former, since it will be distinct from px3sp.dat in many situations). If only one is correct, which is it?

@hectoralos

<n><pl>+сӑр<post> seems to sometimes fail

E.g.,

[FAIL] тетрадь<n><pl>+сӑр<post> => missing results: тетрадьсемсӗр
[FAIL] тетрадь<n><pl>+сӑр<post> => unexpected results: тетрадьсесӗр
[FAIL] июнь<n><pl>+сӑр<post> => missing results: июньсемсӗр
[FAIL] июнь<n><pl>+сӑр<post> => unexpected results: июньсесӗр
[FAIL] кукӑль<n><pl>+сӑр<post> => missing results: кукӑльсемсӗр
[FAIL] кукӑль<n><pl>+сӑр<post> => unexpected results: кукӑльсесӗр
[FAIL] тӗн<n><pl>+сӑр<post> => missing results: тӗнсемсӗр
[FAIL] тӗн<n><pl>+сӑр<post> => unexpected results: тӗнсесӗр
[FAIL] ял<n><pl>+сӑр<post> => missing results: ялсемсӗр
[FAIL] ял<n><pl>+сӑр<post> => unexpected results: ялсесӗр
[FAIL] ача<n><pl>+сӑр<post> => missing results: ачасемсӗр
[FAIL] ача<n><pl>+сӑр<post> => unexpected results: ачасесӗр
[FAIL] ҫыру<n><pl>+сӑр<post> => missing results: ҫырусемсӗр
[FAIL] ҫыру<n><pl>+сӑр<post> => unexpected results: ҫырусесӗр

All missing м.

Last vowel is not deleted in кала

In some verbal forms the last а of verb кала is not deleted:

$ echo "кала<v><iv><fut><p1><sg>" | hfst-lookup .deps/chv.LR.lexc.hfst 2> /dev/null
кала<v><iv><fut><p1><sg>	кала>{Ă}п	0,000000

калаӑп is generated instead of калӑп

$ echo "кала<v><iv><fut><p2><sg>" | hfst-lookup .deps/chv.LR.lexc.hfst 2> /dev/null
кала<v><iv><fut><p2><sg>	кала>{Ă}н	0,000000

калан is generated instead of калӑн

$ echo "кала" | hfst-lookup .deps/chv.LR.lexc.hfst 2> /dev/null
кала кала>{Ă}п{Ă}р 0,000000

калаӑпӑр is generated instead of калӑпӑр

$ echo "кала<v><iv><fut><p2><pl>" | hfst-lookup .deps/chv.LR.lexc.hfst 2> /dev/null
кала<v><iv><fut><p2><pl>	кала>{Ă}р	0,000000

калаӑр is generated instead of калӑр

$ echo "кала<v><iv><cond><p1><sg>" | hfst-lookup .deps/chv.LR.lexc.hfst 2> /dev/null
кала<v><iv><cond><p1><sg>	кала>{Ă}тт{Ă}м	0,000000

калаӑттӑм is generated instead of калӑттӑм

$ echo "кала<v><iv><cond><p2><sg>" | hfst-lookup .deps/chv.LR.lexc.hfst 2> /dev/null
кала<v><iv><cond><p2><sg>	кала>{Ă}тт{Ă}н	0,000000

калаӑттӑн is generated instead of калӑттӑн

$ echo "кала<v><iv><cond><p1><pl>" | hfst-lookup .deps/chv.LR.lexc.hfst 2> /dev/null
кала<v><iv><cond><p1><pl>	кала>{Ă}тт{Ă}м{Ă}р	0,000000

калаӑттӑмӑр is generated instead of калӑттӑмӑр

$ echo "кала<v><iv><cond><p2><pl>" | hfst-lookup .deps/chv.LR.lexc.hfst 2> /dev/null
кала<v><iv><cond><p2><pl>	кала>{Ă}тт{Ă}р	0,000000

калаӑттӑр is generated instead of калӑттӑр

$ echo "кала<v><iv><imp><p1><sg>" | hfst-lookup .deps/chv.LR.lexc.hfst 2> /dev/null
кала<v><iv><imp><p1><sg>	кала>{A}м	0,000000

калаам is generated instead of калам

$ echo "кала<v><iv><imp><p2><pl>" | hfst-lookup .deps/chv.LR.lexc.hfst 2> /dev/null
кала<v><iv><imp><p2><pl>	кала>{Ă}р	0,000000

калаӑр is generated instead of калӑр

+(ч)чен<post>

The first ч should only appear in contexts where чен should be read as voiced. So, it has to be deleted after voiceless consonants. Exemples:

$ echo "мулкач<n>+ччен<post>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
мулкач<n>+ччен<post>	мулкач>{ч}чен	0,000000

Generates мулкачччен instead of мулкаччен.

$ echo "автан<n>+ччен<post>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
автан<n>+ччен<post>	автан>{ч}чен	0,000000

Generates автанччен: correct.

$ echo "йытӑ<n>+ччен<post>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
йытӑ<n>+ччен<post>	йыт{ː}ӑ>{ч}чен	0,000000

Generates йытӑччен: Correct.

л{й}я = н{й}я

In words finishing in н{й}я й is substituted by ь before ӑ. This is correct. The same should happen for words finishing in л{й}я. For instance:

$ echo "Коля<np><ant><m><ins>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
Коля<np><ant><m><ins>	Кол{й}я>п{A}	0,000000

Generates Колйӑпа instead of Кольӑпа.

Epenthetic й in verb ҫи

For the following forms of verb ҫи epenthetic й is not generated:

$ echo "ҫи<v><tv><fut><p1><sg>" | hfst-lookup .deps/chv.LR.lexc.hfst 2> /dev/null
ҫи<v><tv><fut><p1><sg>	ҫи>{Ă}п	0,000000

ҫиӗп is generated instead of ҫийӗп

$ echo "ҫи<v><tv><fut><p2><sg>" | hfst-lookup .deps/chv.LR.lexc.hfst 2> /dev/null
ҫи<v><tv><fut><p2><sg>	ҫи>{Ă}н	0000000

ҫин is generated instead of ҫийӗн

$ echo "ҫи<v><tv><fut><p3><sg>" | hfst-lookup .deps/chv.LR.lexc.hfst 2> /dev/null
ҫи<v><tv><fut><p3><sg>	ҫи>{ӗ}	0,000000

ҫиӗ is generated instead of ҫийӗ

$ echo "ҫи<v><tv><fut><p1><pl>" | hfst-lookup .deps/chv.LR.lexc.hfst 2> /dev/null
ҫи<v><tv><fut><p1><pl>	ҫи>{Ă}п{Ă}р	0,000000

ҫиӗпӗр is generated instead of ҫийӗпӗр

$ echo "ҫи<v><tv><fut><p2><pl>" | hfst-lookup .deps/chv.LR.lexc.hfst 2> /dev/null
ҫи<v><tv><fut><p2><pl>	ҫи>{Ă}р	0,000000

ҫиӗр is generated instead of ҫийӗр

$ echo "ҫи<v><tv><fut><p3><pl>" | hfst-lookup .deps/chv.LR.lexc.hfst 2> /dev/null
ҫи<v><tv><fut><p3><pl>	ҫи>{ӗ}ҫ	0,000000

ҫиӗҫ is generated instead of ҫийӗҫ

$ echo "ҫи<v><tv><cond><p1><sg>" | hfst-lookup .deps/chv.LR.lexc.hfst 2> /dev/null
ҫи<v><tv><cond><p1><sg>	ҫи>{Ă}тт{Ă}м	0,000000

ҫиӗттӗм is generated instead of ҫийӗттӗм

$ echo "ҫи<v><tv><cond><p2><sg>" | hfst-lookup .deps/chv.LR.lexc.hfst 2> /dev/null
ҫи<v><tv><cond><p2><sg>	ҫи>{Ă}тт{Ă}н	0,000000

ҫиӗттӗн is generated instead of ҫийӗттӗн

$ echo "ҫи<v><tv><cond><p3><sg>" | hfst-lookup .deps/chv.LR.lexc.hfst 2> /dev/null
ҫи<v><tv><cond><p3><sg>	ҫи>{ӗ}ҫҫ{ӗ}	0,000000

ҫиӗҫҫӗ is generated instead of ҫийӗҫҫӗ

$ echo "ҫи<v><tv><cond><p1><pl>" | hfst-lookup .deps/chv.LR.lexc.hfst 2> /dev/null
ҫи<v><tv><cond><p1><pl>	ҫи>{Ă}тт{Ă}м{Ă}р	0,000000

ҫиӗттӗмӗр is generated instead of ҫийӗттӗмӗр

$ echo "ҫи<v><tv><cond><p2><pl>" | hfst-lookup .deps/chv.LR.lexc.hfst 2> /dev/null
ҫи<v><tv><cond><p2><pl>	ҫи>{Ă}тт{Ă}р	0,000000

ҫиӗттӗр is generated instead of ҫийӗттӗр

$ echo "ҫи<v><tv><cond><p3><pl>" | hfst-lookup .deps/chv.LR.lexc.hfst 2> /dev/null
ҫи<v><tv><cond><p3><pl>	ҫи>{ӗ}ҫҫ{ӗ}ҫ	0,000000

ҫиӗҫҫӗҫ is generated instead of ҫийӗҫҫӗҫ

$ echo "ҫи<v><tv><imp><p2><pl>" | hfst-lookup .deps/chv.LR.lexc.hfst 2> /dev/null
ҫи<v><tv><imp><p2><pl>	ҫи>{Ă}р	0,000000

ҫиӗр is generated instead of ҫийӗр

different tags for <abe> and <term>?

It looks like <abe> has been changed to +сӑр<post> and <term> has been changed to <ter>. I'm curious when these changes were made and what the reasoning for the changes was.

Extra line when calling chv-morph

For now, we have a very "dirty" Chuvash corpus. We have automatically converted PDF files to TXT and that's all. When it comes to chv-morph, some extra lines appear. This makes difficult to match line by line the text file with the output of the analysis. This is a case of the appearance of an extra line. Any idea how to fix it (without fully "clean" the corpus)?

$ cat -n corpus/chv.crp.tantash.net.txt | head -n 519 | tail -n 2
   518	кӗтеттӗм. Кӗтсе илейменнипе пӗр эрне
   519	маларах ҫыхма пуҫларӑм. Кӗтнӗ кун
$ cat -n corpus/chv.crp.tantash.net.txt | head -n 519 | tail -n 2 | apertium -d . chv-morph
   ^518/518<num>$	^кӗтеттӗм/кӗт<v><tv><dur><p1><sg>$^./.<sent>$ ^Кӗтсе/кӗт<v><tv><gna_impf>$ ^илейменнипе/*илейменнипе$ ^пӗр/пӗрре<num><attr>/пӗр<v><iv><imp><p2><sg>/пӗр<v><tv><imp><p2><sg>$ ^эрне
   519/*эрне
   519$	^маларах/маларах<adj>/маларах<adv>$ ^ҫыхма/ҫых<v><iv><ger>/ҫых<v><iv><neg><imp><p2><sg>/ҫых<v><tv><ger>/ҫых<v><tv><neg><imp><p2><sg>$ ^пуҫларӑм/пуҫла<v><iv><past><p1><sg>/пуҫла<v><tv><past><p1><sg>$^./.<sent>$ ^Кӗтнӗ/кӗт<v><tv><gpr_past>/кӗт<v><tv><past><evid>$ ^кун/кун<adj>/кун<n><attr>/кун<n><nom>/кун<v><iv><imp><p2><sg>/кун<v><tv><imp><p2><sg>/ку<prn><dem><gen>$^./.<sent>$

Idea

Verbs like кай or выля and affixes like Ай give a lot of work not because they are "irregular" but because of the stupid alien Stalinist orthography of Chuvash, that uses я, ю et al. Maybe it would be easier to have in the lexc forms like выльа, and to work in twol with pre-1936 orthography, i.e. producing e.g. кайатӑп. A final step could change all йа and ьа to я. Wouldn't it be easier?

Politeness suffix сĂм

I've tried to introduce this suffix, but I've done it wrong and I don't know how I should do it.

There are 3 politeness suffixes that can be used in imp.p2. The problem is with сĂм, because it is added next to the stem before the person suffix (which exists only is p2.pl). For instance, for the verb кала in p2.pl we can have

калӑр ("normal" form without any politeness suffix)

But with сӑм:

кала.сӑм.ӑр

Other options are simple, since suffixes/clitics ччӗ and ха are added after the person suffix:

калӑрччӗ
каласӑмӑрччӗ
калӑр-ха
калӑрччӗ-ха
каласӑмӑрччӗ-ха

New mode required

A new Chuvash grammar textbook is being prepared on the basis of a 3M+ word corpus and our morphological analysis. The author is asking for a composite output of modes chv-morph and chv-segment in which he could more easily search for specific surface forms of morphems.

For instance currently we have these two analysis for ачисен:

$ echo "ачисен" | apertium -d . chv-morph
^ачисен/ача<n><px3sp><pl><gen>$^./.<sent>$
$ echo "ачисен" | apertium -d . chv-segment
^ачисен/ач>и>се>н$^./.$

He is asking for something like this:

^ачисен/ача<n>и<px3sp>се<pl>н<gen>$

This request seems not illogical and probably can be useful for other people and languages.

Could this more or less easily be done?

+(ч)чӗ

This seems to be the same problem as #26 but for two other suffixes.

The first ч should only appear in contexts where чӗ should be read as voiced. So, it has to be deleted after voiceless consonants. Example:

$ echo "йывӑҫ<n><nom>+ӗ<cop><ifi>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
йывӑҫ<n><nom>+ӗ<cop><ifi>	йывӑҫ>{ч}чӗ	0,000000

Generates йывӑҫччӗ instead of йывӑҫчӗ.

$ echo "калаҫ<v><tv><imp><p2><sg>+ччӗ<mod>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
калаҫ<v><tv><imp><p2><sg>+ччӗ<mod>	калаҫ>{ч}чӗ	0,000000

Generates калаҫччӗ instead of калаҫчӗ

past p3

There are problems in p3.sg and p3.pl in the past tense.

According to Krueger (p. 144) ч appears in p3 of stems in /l n r/ (actually he says p3.sg but in the examples he gives we can see that it also happens in p3.pl). So:

$ echo "кай<v><iv><past><p3><sg>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
кай<v><iv><past><p3><sg>	кай>{T}>ӗ	0,000000

Currently кайчӗ is generated instead of кайрӗ

$ echo "кай<v><iv><past><p3><pl>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
кай<v><iv><past><p3><pl>	кай>{T}>ӗҫ	0,000000

Currently кайчӗҫ is generated instead of кайрӗҫ

(By the way, https://www.sapatlav.club is useful for conjugating verbs in Chuvash)

+ах<mod>

There are problems with +ах when it follows something finishing in vowel. Here are some cases:

$ echo "урам<n><px3sp><nom>+ах<mod>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
урам<n><px3sp><nom>+ах<mod>	урам>{и}{н}>{A}х	0,000000

Generates урамнех instead of урамех

$ echo "пулӑ<n><nom>+ах<mod>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
пулӑ<n><nom>+ах<mod>	пул{ː}ӑ>{A}х	0,000000

Generates пуллӑах instead of пулах (notice also the lack of gemination)

$ echo "пулӑ<n><dat>+ах<mod>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
пулӑ<n><dat>+ах<mod>	пул{ː}ӑ>{N}{A}>{A}х	0,000000

Generates пуллаах instead of пуллах.

$ echo "кӗнеке<n><nom>+ах<mod>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
кӗнеке<n><nom>+ах<mod>	кӗнеке>{A}х	0,000000

Generates кӗнекеех instead of кӗнекех

$ echo "лаша<n><nom>+ах<mod>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
лаша<n><nom>+ах<mod>	лаша>{A}х	0,000000

Generates лашаах instead of лашах

$ echo "кӗнеке<n><px3sp><nom>+ах<mod>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
кӗнеке<n><px3sp><nom>+ах<mod>	кӗнеке>{и}{н}>{A}х	0,000000

Generates кӗнекинех instead of кӗнекиех

$ echo "информаци<n><px3sp><nom>+ах<mod>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
информаци<n><px3sp><nom>+ах<mod>	информаци{й}>{и}{н}>{A}х	0,000000

Generates информацинех instead of информациех

$ echo "правительство<n><nom>+ах<mod>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
правительство<n><nom>+ах<mod>	правительств{о}>{A}х	0,000000

Generates правительствоах instead of правительствах

$ echo "кофе<n><nom>+ах<mod>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
кофе<n><nom>+ах<mod>	коф{е}>{A}х	0,000000

Generates кофеех instead of кофех

$ echo "кил<v><iv><pres><p3><pl>+ах<mod>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
кил<v><iv><pres><p3><pl>+ах<mod>	кил>{A}ҫҫӗ>{A}х	0,000000

Generates килеҫҫӗех instead of килеҫҫех

$ echo "кил<v><iv><pres><p3><pl>+ах<mod>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
кил<v><iv><pres><p3><pl>+ах<mod>	кил>{A}ҫҫӗ>{A}х	0,000000

Generates килеҫҫӗех instead of килеҫҫех

$ echo "кил<v><iv><ger_nec>+ах<mod>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
кил<v><iv><ger_nec>+ах<mod>	кил>м{A}лл{A}>{A}х	0,000000

Generates килмеллеех instead of килмеллех

Allative

A suffix АллА is called "allative". The problem is that the vowels previous to it are not deleted or no epenthetic consonants are added:

  • пулӑ<n><all> = пуллӑалла instead of пуллалла
  • пулӑ<n><pl><all> = пулӑсеелле instead of пулӑсенелле
  • лаша<n><all> = лашаалла instead of лашалла
  • лаша<n><pl><all> = лашасеелле instead of лашасенелле
  • музей<n><all> = музейелле or музейалла instead of музеелле or музеялла

It could be added that, according to Chuvash grammarians, this is the dative case + (л)лА. This can help understand the rules about duplication of л in пуллалла and the н in пулӑсенелле.

Problems with выля

Looking at the differences in the analysis after the fix of #27, I have noticed two word forms that are not analysed now. I cannot understand why, because I cannot see the relationship with the fix, but it seems that it is a very strange side effect. Here are the two forms I noticed:

$ echo "выля<v><tv><pres><p3><pl>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
выля<v><tv><pres><p3><pl>	выл{й}я>{A}ҫҫӗ	0,000000

Now выляаҫҫӗ is generated instead of выляҫҫӗ.
Notice that the change should be, as I understand:
в ы л {й}:й я:0 > {A}:а ҫ ҫ ӗ

$ echo "выля<v><tv><gna_impf>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
выля<v><tv><gna_impf>	выл{й}я>с{A}	0,000000

Now выльӑса is generated instead of выляса

ту <past>

{в} :в in past tense, but this should not happen:

$ echo "ту<v><tv><past><p1><sg>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
ту<v><tv><past><p1><sg>	ту{в}>{T}>{Ă}м	0,000000
s$ aq-morftest -ci v_ту_del.yaml | grep "<past>"
[FAIL] ту<v><tv><past><p1><sg> => missing results: турӑм
[FAIL] ту<v><tv><past><p1><sg> => unexpected results: туврӑм
[FAIL] ту<v><tv><past><p2><sg> => missing results: турӑн
[FAIL] ту<v><tv><past><p2><sg> => unexpected results: туврӑн
[FAIL] ту<v><tv><past><p3><sg> => missing results: турӗ
[FAIL] ту<v><tv><past><p3><sg> => unexpected results: туврӗ
[FAIL] ту<v><tv><past><p1><pl> => missing results: турӑмӑр
[FAIL] ту<v><tv><past><p1><pl> => unexpected results: туврӑмӑр
[FAIL] ту<v><tv><past><p2><pl> => missing results: турӑр
[FAIL] ту<v><tv><past><p2><pl> => unexpected results: туврӑр
[FAIL] ту<v><tv><past><p3><pl> => missing results: турӗҫ
[FAIL] ту<v><tv><past><p3><pl> => unexpected results: туврӗҫ

Allative for литература-like words

Sovietisms with this unstressed ending vowels should have allatives like in similar Chuvash words ending in ӑ or ӗ. So:

литература<n><all> = литературалла instead of литературӑналла (cf, пуллалла from пулӑ)
няня<n><all> = нянялла instead of няньӑналла
аллея<n><all> = аллеялла instead of аллейӑналла
министерство<n><all> = министерствалла instead of министерствӑналла
училище<n><all> = училищелле instead of училищӗнелле

-(ч)чен

Suffix/postposition -(ч)чен is currently added always with two чч. Only one should be written after a voiceless consonant:
автанччен, чӗрӗпчен, ҫуртчен, автобусчен, июльччен, пулӑччен, лашаччен, литератураччен.

plu and prep tenses

In 8b6e805 a number of new tenses were added, which don't seem to be standard. Some of the negative forms look regular, others not. We should discuss tag names and how to organise the lexica.

05:35 <@firespeaker> prep is {A}t+past / m(A)s+past
05:36 <@firespeaker> plu is s{A}t+past / m{A}s{A}t+past

I've done a bit of simplication in 8ecbe54 .

Installed modes are missing files

modes.xml includes some modes with install="yes", but the required
files aren't installed.

Some generic suggestions:

  • -lexc and -twol modes probably aren't useful to users

  • -spell modes should depend on --enable-ospell

  • .deps files are never installed, so any modes using them shouldn't be
    installed.

  • Messages for package app-dicts/apertium-chv-9999:

  • Failed to find '/usr/share/apertium/apertium-chv/.deps/chv.twol.hfst' in install image.

  • QA: missing files required for mode chv-twol.

  • Failed to find '/usr/share/apertium/apertium-chv/.deps/chv.lexc.hfst' in install image.

  • QA: missing files required for mode chv-lexc.

-(л)лĂ

I've been surprised to find this suffix as a postposition. I always considered it as a form of adjectivation. Of course, it can be considered as a postposition if this is more pan-Turkic :)

In any case, the suffix has two л after a vowel. So it is similar to (ч)чен, but easier, since there are no differences between consonants.

Examples:

пуртӑллӑ from пуртӑ.
мозаикăллă from мозаика.
няньăллă from няня.
илемлӗ from илем.
хутлӑ from хут.

+и<subst>

I have been investigating on the substantivisation suffix и. It was not very clear.

As your quote Luutonen in #5, it is very close to the third person affix (and, I guess, maybe comes from it), but there are differences with it.

  • There is no ӗ in any form. It is always и.
  • The duplication rules are different: as for px3sg, there is duplication for words that end in ӑ or ӗ, if preceded only by a consonant (ҫурӑ > ҫурри, but тутлӑ > тутли); but also adjectives finishing in consonant have duplication (пысӑк > пысӑкки, пӗчӗк > пӗчӗкки, аван > аванни).
  • The oblique forms are almost always the ones that come after px3sg, but not for genitive (лайӑх<adj>+и<subst><dat> : лайӑххин, cf. информаци<n><px3sp><gen> : информацийӗн)
  • This affix cannot be added to adjectives ending in и, у, or ӳ (I have created an A2-IUÜ paradigm for them).

So:

  • Whether I continue to put {и} in the definition of this suffix in twol, it should be another archiphoneme (not the one for px3sg).
  • I added {ː} to all adjectives finishing in VC[ӑӗ] (as it is done for nouns), and it seems to work
  • Unfortunately we cannot add {ː} to all adjectives finishing by one single consonant. I tried it, but the problem is that duplication now happens with other suffixes, for instance +ах. There is duplication only for +и (it seems).

Tests can be found in (и_subst.yaml). The results bellow are from a test with лайӑх defined as лайӑх:лайӑх%{ː%} A2; (but пысӑк, пӗчӗк and аван defined without {ː}). Notice how the test passes for лайӑх+и, for instance, but not for лайӑх<adj>+ах<mod>, while for аван (defined without {ː}), the situation is the opposite.

$ aq-morftest и_subst.yaml | more
--------------------------------------
Test 0: и <subst> (Lexical/Generation)
--------------------------------------
[PASS] лайӑх<adj>+и<subst><nom> => лайӑххи
[FAIL] пысӑк<adj>+и<subst><nom> => missing results: пысӑкки
[FAIL] пысӑк<adj>+и<subst><nom> => unexpected results: пысӑкӗ
[FAIL] пӗчӗк<adj>+и<subst><nom> => missing results: пӗчӗкки
[FAIL] пӗчӗк<adj>+и<subst><nom> => unexpected results: пӗчӗкӗ
[FAIL] аван<adj>+и<subst><nom> => missing results: аванни
[FAIL] аван<adj>+и<subst><nom> => unexpected results: аванӗ
[PASS] тутлӑ<adj>+и<subst><nom> => тутли
[PASS] ҫурӑ<adj>+и<subst><nom> => ҫурри
[PASS] хура<adj>+и<subst><nom> => хури
[PASS] хитре<adj>+и<subst><nom> => хитри
[FAIL] лайӑх<adj>+и<subst><dat> => missing results: лайӑххин
[FAIL] лайӑх<adj>+и<subst><dat> => unexpected results: лайӑххине
[PASS] лайӑх<adj>+и<subst><ins> => лайӑххипе
[PASS] лайӑх<adj>+и<subst><pl><nom> => лайӑххисем
[PASS] лайӑх<adj>+и<subst><pl><gen> => лайӑххисен
[FAIL] лайӑх<adj>+ах<mod> => missing results: лайӑхах
[FAIL] лайӑх<adj>+ах<mod> => unexpected results: лайӑххах
[PASS] лайӑх<adj><comp> => лайӑхрах
[PASS] аван<adj>+ах<mod> => аванах
[PASS] аван<adj><comp> => авантарах
[FAIL] аван<adj><comp> => unexpected results: аванрах
Test 0 - Passes: 11, Fails: 11, Total: 22

P.S.
I was in the Humanities Institute discussing it with the whole philology department. They do not agree whether it is possible or not that a possessive comes after this substantivisation suffix. If it is, it should be extremely rare, and I have not find such cases in our corpus, so I removed the possibility to add possessives after this suffix (I put there a few days ago: now it is like it was).

Verbs ту and тӳ

Ту is defined as:
ту:ту%{в%} V-TV;!"" But {в} is not working properly in many cases.

ту<v><tv><pres><p1><sg> generates туатӑп instead of тӑватӑп
ту<v><tv><pres><p2><sg> generates туатӑн instead of тӑватӑн
ту<v><tv><pres><p3><sg> generates туать instead of тӑвать
ту<v><tv><pres><p1><pl> generates туатпӑр instead of тӑватпӑр
ту<v><tv><pres><p2><pl> generates туатӑр instead of тӑватӑр
(But p3.pl is correctly generated: тӑваҫҫӗ)

There is a similar problem for the future:
ту<v><tv><fut><p1><sg> generates тувӑп instead of тӑвӑп
ту<v><tv><fut><p3><sg> generates тувӗ instead of тӑвӗ
ту<v><tv><fut><p1><pl> generates тувӑпӑр instead of тӑвӑпӑр
ту<v><tv><fut><p2><pl> generates тувӑр instead of тӑвӑр
ту<v><tv><fut><p3><pl> generates тувӗҫ instead of тӑвӗҫ
(But p2.sg is correctly generated: тӑвӑн)

For some persons in imperative:
ту<v><tv><imp><p3><sg> generates тувтӑр instead of тутӑр
ту<v><tv><imp><p2><pl> generates тувӑр instead of тӑвӑр
ту<v><tv><imp><p3><pl> generates тувччӑр instead of туччӑр

More:

$ echo "^ту<v><tv><gpr_pres>$    " | hfst-proc -g chv.autogen.hfst
туакан  

Should be тӑвакан
(But ту<v><tv><gpr_fut> is correctly generated: тӑвас)

The same problems can be found for тӳ,

<num><ord><subst><px3sp>+ччен<post>

This is a quite often combination, but it fails:

$ echo "23<num><ord><subst><px3sp>+ччен<post>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
23<num><ord><subst><px3sp>+ччен<post>	23>-мӗш>{и}{н}>{ч}чен	0,000000

Generates 23-мӗшнччен instead of 23-мӗшӗччен

музеа/музея

One of the two possible dative forms of музей is музея (the other is музее), but instead музеа is recognised/generated. Vowel+a requests an epenthetic й: музея.

<iter>

There are problems when joining <iter> to the suffix of present and conditional tenses. For example:

$ echo "кил<v><iv><iter><pres><p1><sg>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
кил<v><iv><iter><pres><p1><sg>	кил>к{A}л{A}>{A}т>{Ă}п	0,000000
$ aq-morftest -ci v_кил_del.yaml | grep "<iter>"
[FAIL] кил<v><iv><iter><pres><p1><sg> => missing results: килкелетӗп
[FAIL] кил<v><iv><iter><pres><p1><sg> => unexpected results: килкелеетӗп
[FAIL] кил<v><iv><iter><pres><p2><sg> => missing results: килкелетӗн
[FAIL] кил<v><iv><iter><pres><p2><sg> => unexpected results: килкелеетӗн
[FAIL] кил<v><iv><iter><pres><p3><sg> => missing results: килкелет
[FAIL] кил<v><iv><iter><pres><p3><sg> => unexpected results: килкелеет
[FAIL] кил<v><iv><iter><pres><p1><pl> => missing results: килкелетпӗр
[FAIL] кил<v><iv><iter><pres><p1><pl> => unexpected results: килкелеетпӗр
[FAIL] кил<v><iv><iter><pres><p2><pl> => missing results: килкелетӗр
[FAIL] кил<v><iv><iter><pres><p2><pl> => unexpected results: килкелеетӗр
[FAIL] кил<v><iv><iter><pres><p3><pl> => missing results: килкелеҫҫӗ
[FAIL] кил<v><iv><iter><pres><p3><pl> => unexpected results: килкелееҫҫӗ

$ echo "кала<v><tv><iter><pres><p1><sg>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
кала<v><tv><iter><pres><p1><sg>	кала>к{A}л{A}>{A}т>{Ă}п	0,000000
$ aq-morftest -ci v_кала_del.yaml | grep "<iter>"
[FAIL] кала<v><tv><iter><pres><p1><sg> => missing results: калакалатӑп
[FAIL] кала<v><tv><iter><pres><p1><sg> => unexpected results: калакалаатӑп
[FAIL] кала<v><tv><iter><pres><p2><sg> => missing results: калакалатӑн
[FAIL] кала<v><tv><iter><pres><p2><sg> => unexpected results: калакалаатӑн
[FAIL] кала<v><tv><iter><pres><p3><sg> => missing results: калакалать
[FAIL] кала<v><tv><iter><pres><p3><sg> => unexpected results: калакалаать
[FAIL] кала<v><tv><iter><pres><p1><pl> => missing results: калакалатпӑр
[FAIL] кала<v><tv><iter><pres><p1><pl> => unexpected results: калакалаатпӑр
[FAIL] кала<v><tv><iter><pres><p2><pl> => missing results: калакалатӑр
[FAIL] кала<v><tv><iter><pres><p2><pl> => unexpected results: калакалаатӑр
[FAIL] кала<v><tv><iter><pres><p3><pl> => missing results: калакалаҫҫӗ
[FAIL] кала<v><tv><iter><pres><p3><pl> => unexpected results: калакалааҫҫӗ

$ echo "кил<v><iv><iter><cond><p1><sg>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
кил<v><iv><iter><cond><p1><sg>	кил>к{A}л{A}>{Ă}т>т{Ă}м	0,000000
$ aq-morftest -ci v_кил_del.yaml | grep "<iter><cond>"
[FAIL] кил<v><iv><iter><cond><p1><sg> => missing results: килкелӗттӗм
[FAIL] кил<v><iv><iter><cond><p1><sg> => unexpected results: килкелеӗттӗм
[FAIL] кил<v><iv><iter><cond><p2><sg> => missing results: килкелӗттӗн
[FAIL] кил<v><iv><iter><cond><p2><sg> => unexpected results: килкелеӗттӗн
[FAIL] кил<v><iv><iter><cond><p3><sg> => missing results: килкелӗччӗ
[FAIL] кил<v><iv><iter><cond><p3><sg> => unexpected results: килкелеӗччӗ
[FAIL] кил<v><iv><iter><cond><p1><pl> => missing results: килкелӗттӗмӗр
[FAIL] кил<v><iv><iter><cond><p1><pl> => unexpected results: килкелеӗттӗмӗр
[FAIL] кил<v><iv><iter><cond><p2><pl> => missing results: килкелӗттӗр
[FAIL] кил<v><iv><iter><cond><p2><pl> => unexpected results: килкелеӗттӗр
[FAIL] кил<v><iv><iter><cond><p3><pl> => missing results: килкелӗччӗҫ
[FAIL] кил<v><iv><iter><cond><p3><pl> => unexpected results: килкелеӗччӗҫ

$ echo "кала<v><tv><iter><cond><p1><sg>" | hfst-lookup .deps/chv.LR.lexc.hfst 2>/dev/null
кала<v><tv><iter><cond><p1><sg>	кала>к{A}л{A}>{Ă}т>т{Ă}м	0,000000
$ aq-morftest -ci v_кала_del.yaml | grep "<iter><cond>"
[FAIL] кала<v><tv><iter><cond><p1><sg> => missing results: калакалӑттӑм
[FAIL] кала<v><tv><iter><cond><p1><sg> => unexpected results: калакалаӑттӑм
[FAIL] кала<v><tv><iter><cond><p2><sg> => missing results: калакалӑттӑн
[FAIL] кала<v><tv><iter><cond><p2><sg> => unexpected results: калакалаӑттӑн
[FAIL] кала<v><tv><iter><cond><p3><sg> => missing results: калакалӗччӗ
[FAIL] кала<v><tv><iter><cond><p3><sg> => unexpected results: калакалаӗччӗ
[FAIL] кала<v><tv><iter><cond><p1><pl> => missing results: калакалӑттӑмӑр
[FAIL] кала<v><tv><iter><cond><p1><pl> => unexpected results: калакалаӑттӑмӑр
[FAIL] кала<v><tv><iter><cond><p2><pl> => missing results: калакалӑттӑр
[FAIL] кала<v><tv><iter><cond><p2><pl> => unexpected results: калакалаӑттӑр
[FAIL] кала<v><tv><iter><cond><p3><pl> => missing results: калакалӗччӗҫ
[FAIL] кала<v><tv><iter><cond><p3><pl> => unexpected results: калакалаӗччӗҫ

Make sure that .lexc file does not contain Latin ç ă etc.

fran@ipek:~/source/apertium/languages/apertium-chv$ bash dev/lint.sh 
0	missing multi	
5	mixed script	çи:çи N-INFORMACI; ! src: Chuvash wordlist ""	ӗç:ӗç V-TD; ! src: Chuvash wordlist ""	çи:çи V-TD; ! src: Chuvash wordlist ""	

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.