apertium / apertium-kaz Goto Github PK
View Code? Open in Web Editor NEWApertium linguistic data for Kazakh
Home Page: https://apertium.github.io/apertium-kaz/
License: GNU General Public License v3.0
Apertium linguistic data for Kazakh
Home Page: https://apertium.github.io/apertium-kaz/
License: GNU General Public License v3.0
Currently all possible <neg><ifi><evid>
forms have two possible generated forms. For example, кет<v><iv><neg><ifi><evid><p1><sg>
outputs both кетпеппін
and кеткен жоқ екенмін
.
The forms кетпедім
and кеткен жоқпын
both analyse as кет<v><iv><neg><ifi><p1><sg>
, but this analysis only generates the latter form.
We eventually need to find a (tag-based) way to distinguish between these forms. For the time being, we probably need to set one of the neg.ifi.evid
forms to Dir/LR
.
See this issue in apertium-grn for code/ideas.
The issue with the reorganisation of the lexicon in de4c77a is that different parts of speech are all lumped together.
Every single other Turkic transducer uses the lexicon names Nouns
, Adjectives
, Verbs
, ProperNouns
, etc. This is standardised for several reasons. One of which is so that we have an easy way to count the number of stems of a particular type. E.g., note that the countstems script was broken by your changes.
@IlnarSelimcan, could you justify why you did this reorganisation? Also, in principle this sort of major restructuring should be done in consultation with and by consensus among everyone it affects—that is, everyone who's committed to this repo, or at least the apertium-turkic mailing list.
hfst-fst2strings -c 1 .deps/NUM.hfst | gzip -c > NUM.txt.gz
Killed
make: *** No rule to make target 'NUM-ROMAN.txt.gz', needed by 'all'. Stop.
rm .deps/NUM.hfst .deps/NUM.prefix.bin .deps/NUM.prefix.upper .deps/NUM.prefix.att .deps/NUM.prefix.hfst .deps/NUM.prefixes
Took about an hour and used up all 64GB of RAM on my lab machine plus the additional 64GB of swap before dying.
I'm wondering if it might be cyclical?
(1) 5 солай сол PRON prn PronType=Dem 8 advmod _ _
(2) 2 солай солай ADV adv _ 5 ccomp _ _
(3) 3 олай ол PRON prn PronType=Dem 5 advmod _ _
A relevant snippet's from validate.py
's output:
[Line 1935 Sent akorda-random.tagged.txt:164:2942 Node 5]: [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'PRON'
Therefore #17 will change UPOS
to ADV
, but keep XPOS
= prn
and also keep the PronType=Dem
to keep puupankki apertium compatible. kaz.udx
file will be adjusted accordingly so that {ол|сол<prn>}
get converted into ADV prn PronType=Dem
.
This is issue is here for people who won't see the validation errors I'm going through and who's likely to ask me later "why did you change this". (rant)
./configure: line 2392: syntax error near unexpected token HFSTOSPELL,' ./configure: line 2392:
PKG_CHECK_MODULES(HFSTOSPELL, hfstospell >= 0.2)'
Similar to #10, Kazakh has the issue of two neg.ifi paradigms
.
First-person singular (neg.ifi.p1.sg
) looks like this:
The question is whether there is a difference in usage between these two forms, or if they are identical. The answer to this question will inform what needs to be done in the transducer in regards to the issue.
Sorry, that seems to be my mistake ;-)
In generating files for testvoc lite in tests/morphotactics, N1.txt.gz contains things like
мектепсұңдаршү:мектеп<n><nom>+е<cop><aor><p2><pl>+шы<emph><err_orth>
. Is this intended?
Include c → с
can I user your program as a stemmer or/and lemmatizer for Kazakh language?
The kaz-morph
mode uses lt-proc
and kaz.automorf.bin
.
However, it only returns this:
Error: Invalid dictionary (hint: the left side of an entry is empty)
This appears to be because of some ~empty paths, e.g.
0 2 @0@ <ltr> 0.000000
2 83630 @0@ @0@ 0.000000
2 848 @0@ @0@ 0.000000
848 0.000000
83630 0.000000
These empty paths appear to be due to the guesser being intersected with kaz@[email protected]
. Relevant excerpts below:
apertium-kaz.kaz.lexc
:
LEXICON LTR
%<ltr%>: # ;
LEXICON Guesser
<( а | ә | б | в | г | ғ | д | е | ё | ж | з | и | і | й | к | қ | л | м | н |
ң | о | ө | п | р | с | т | у | ұ | ү | ф | х | һ | ц | ч | ш | щ | ь | ы |
ъ | э | ю | я )> LTR ;
apertium-kaz.Cyrl-Arab.twol
:
ь:0
ы:ى
ъ:0
Ideally, I think we need to find some way to not intersect the guesser part of the transducer with Cyrl-Arab
. Alternatively, we could tweak the lexical conversion to not allow paths that would just be 0
(though I'm not positive how to do that upon first contemplation).
(Thanks to @mr-martian for helping me figure out why lt-proc was failing.)
@IlnarSelimcan @ftyers
The vocabulary of apertium-kaz.kaz.lexc
requires checking for redundancy, consistency and miscategorizations. Here are some examples:
кептірген:кептірген A1 ; ! ""
аршылған:аршылған A1 ; ! ""
жонылған:жонылған A1 ; ! ""
сүрілген:сүрілген A1 ; ! ""
Along with that, reasons why these are considered mistakes, and, generally, choices made should be documented in apertium-kaz/docs
so that this kind of issues don't happen in the future.
At that point, (since the coverage of apertium-kaz
is relatively high, that documentation will probably be more useful for other (Turkic) languages rather than for Kazakh.
modes.xml includes a handful of modes with install="yes"
, but the
required files aren't installed.
* Failed to find '/usr/share/apertium/apertium-kaz/.deps/kaz.twol.hfst' in install image.
* QA: missing files required for mode kaz-twol.
* Failed to find '/usr/share/apertium/apertium-kaz/.deps/kaz.lexc.hfst' in install image.
* QA: missing files required for mode kaz-lexc.
* Failed to find '/usr/share/apertium/apertium-kaz/kaz.zhfst' in install image.
* QA: missing files required for mode kaz-spell.
* Failed to find '/usr/share/apertium/apertium-kaz/.deps/acceptor.default.hfst' in install image.
* QA: missing files required for mode kaz-tokenise.
My guess is kaz-{twol,lexc} shouldn't be installed, kaz-spell should
be dependent on --enable-ospell. Not sure about kaz-tokenise.
General context: #17
Actually several related issues:
мың
and миллиард
are NUM num
everywhere, while миллион
in some cases is NUM num
, and in others NOUN n
.
млрд.
and трлн.
are NOUN abbr
everywhere, while млн.
is some cases tagged as NUM num
, in others as NOUN abbr
.
(a)
4 2 2 NUM num NumType=Card 5 compound _ _
5 миллиард миллиард NUM num NumType=Card 6 compound _ _
6 300 300 NUM num NumType=Card 7 compound _ _
7 миллион миллион NUM num NumType=Card 8 nummod _ _
8 теңгеден теңге NOUN n Case=Abl 10 nmod _ _
9 астам астам ADJ adj _ 10 amod _ _
10 қаржы қаржы NOUN n Case=Nom 11 obj _ _
vs (b)
3 4,3 4,3 NUM num NumType=Card 4 nummod _ _
4 мыңнан мың NUM num Case=Abl|NumType=Card,Ord 6 nmod _ _
5 астам астам ADJ adj _ 6 amod _ _
6 шақырымды шақырым NOUN n Case=Acc 7 obj _ _
Hereby I suggest:
мың
, миллион
, миллиард
, триллион
, млн.
, млрд.
and трлн.
as NUM num
. For the latter three, apertium-kaz & co can be modified to output <abbr>
as a secondary tag, i.e. млн\.?
--> <num><abbr>
. Since there are abbreviated nouns, abbreviated numerals etc, for known abbreviations I think it makes sense to make <abbr>
a secondary tag, especially in context of UD annotating:[quote https://universaldependencies.org/u/pos/all.html#sym-symbol]
Strings that consists entirely of alphanumeric characters are not symbols but they may be proper nouns: 130XE, DC10; others may be tagged PROPN (rather than SYM) even if they contain special characters: DC-10. Similarly, abbreviations for single words are not symbols but are assigned the part of speech of the full form. For example, Mr. (mister), kg (kilogram), km (kilometer), Dr (Doctor) should be tagged nouns. Acronyms for proper names such as UN and NATO should be tagged as proper nouns.
[unquote]
but also generally speaking knowing the POS of the unabbreviated form is considered helpful for applications.
UPDATE: note that in UD there is the Abbr feature: https://universaldependencies.org/u/feat/Abbr.html
compounds
(i.e. as done in 3a). In other words, a flat chain of compound
s, with the rightmost element being the head receiving nummod
or nmod
whatever.A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.