mideind / greynirengine Goto Github PK

A fast, efficient natural language processing engine for Icelandic.

License: Other

Shell 0.28% Python 89.71% C++ 10.01%

earley icelandic natural-language-processing nlp parser parsing parsing-engine parsing-library python python-library python3

greynirengine's People

Contributors

Stargazers

Watchers

Forkers

bjornkri sverrirab stefankjartansson haukurb vesteinn demux jokull alxmrphi rbadi76 sultur

greynirengine's Issues

Add a facility to lemmatize text for search indexing

Greynir makes it easy to lemmatize text. If the parser fails I can fallback to the bintokenizer and get multiple lemmas for all meanings. This makes for a great search index even if there are some extra lemmas there when the parser fails.

Perhaps Greynir should provide a function out of the box to do this, as it will be a common use case? I can share my code if anyone wants to see it.

Common names interpreted as verbs

I ran the 100 most common first names in Iceland through greynir.parse. No female names are interpreted as verbs but there are a few male ones. See this gist for the code.

https://gist.github.com/jokull/2c1048bbc845feb46c717ac7c77e0cc5

Einar → eina so
Árni → árna so
Helgi → helga so
Ragnar → ragna so
Óskar → óska so
Birgir → birgja so
Brynjar → brynja so
Rúnar → rúna so
Ómar → óma so
Reynir → reyna so
Garðar → garða so
Steinar → steina so

Document how to append to the grammar

If there is a way to augment the grammar file for specific project contexts, that should be documented.

Having BÍN meanings and terminals, make it easier to get strings for variants

I would loooove it if were easier to reach other variant cases and number when you have a token meaning or terminal instance. Something like token.get_singular and token.get_accusative.

AttributeError: 'BIN_Nonterminal' object has no attribute 'matches_category'

Using the example code from https://greynir.is/doc/quickstart.html and the text from ruv.is:

my_text = "Ákæran var þingfest í Héraðsdómi Reykjaness í dag en fréttastofu er ekki kunnugt um hvort maðurinn játaði eða neitaði sök þar sem þinghaldið í málinu er lokað."

Gives the exception in the title.

AttributeError: 'NoneType' object has no attribute 'first'

Using the example code from https://greynir.is/doc/quickstart.html and the text from ruv.is:

my_text = "Viðar Garðarsson , sem setti upp vefsíður fyrir Sigmund Davíð Gunnlaugsson í kjölfar birtingu Panamaskjalanna , segist ekki vita hvers vegna ákveðið var að segja að vefjunum væri haldið úti af stuðningsmönnum Sigmundar"

Gives the exception in the title.

Doesn't recognise letter in house number as part of address

Reynir doesn't identify alphabetic characters appended to house numbers as part of the address.

>>> s = r.parse_single("Hann býr á Bárugötu 14.")
>>> print(s.tree.view)
P
+-S-MAIN
  +-IP
    +-NP-SUBJ
      +-pfn_kk_et_nf: 'Hann'
    +-VP-SEQ
      +-VP
        +-so_0_et_p3: 'býr'
      +-PP
        +-fs_þgf: 'á'
        +-NP-ADDR
          +-gata_þgf_kvk: 'Bárugötu'
          +-tala: '14'
+-'.'
>>> s = r.parse_single("Hann býr á Bárugötu 14b.")
>>> print(s.tree.view)
P
+-S-MAIN
  +-IP
    +-NP-SUBJ
      +-pfn_kk_et_nf: 'Hann'
    +-VP-SEQ
      +-VP
        +-so_0_et_p3: 'býr'
      +-PP
        +-fs_þgf: 'á'
        +-NP
          +-sérnafn: 'Bárugötu'
          +-tala: '14'
          +-no_et_þgf_hk: 'b'
+-'.'
>>> s = r.parse_single("Hann býr á Bárugötu 14A.")
>>> print(s.tree.view)
P
+-S-MAIN
  +-IP
    +-NP-SUBJ
      +-pfn_kk_et_nf: 'Hann'
    +-VP-SEQ
      +-VP
        +-so_0_et_p3: 'býr'
      +-PP
        +-fs_þgf: 'á'
        +-NP
          +-no_et_þgf_kvk: 'Bárugötu'
          +-NP-POSS
            +-NP-MEASURE
              +-mælieining: '14 A'
+-'.'

NounPhrase class returns a wrong word

Thank you for the great project!

While using the python package, I ran into a bug with a wrong word being returned. The nominative case of Þór in "Ragnar Þór Valgeirsson" is wrong when using a NounPhrase.

How to reproduce

greynir_bug.py

from reynir import NounPhrase as Nl

nafn = Nl("Ragnar Þór Valgeirsson")

print(f"{nafn:nf}")

Results after executing ten times

for i in {1..10}; do python greynir_bug.py; done

Ragnar Þórr Valgeirsson
Ragnar Þórr Valgeirsson
Ragnar Þór Valgeirsson
Ragnar Þórr Valgeirsson
Ragnar Þórr Valgeirsson
Ragnar Þór Valgeirsson
Ragnar Þórr Valgeirsson
Ragnar Þór Valgeirsson
Ragnar Þór Valgeirsson
Ragnar Þórr Valgeirsson

struct.error: unpack_from requires a buffer of at least 4 bytes

Steps:

apt install git-lfs
git clone https://github.com/mideind/ReynirPackage
cd ReynirPackage
python setup.py develop
pip install pytest
python -m pytest

Results in the following error:

Comma sensitivity in sentences

I’m not familiar with the parsing pipeline but I thought I would share an instance of where the parser tripped in a (to me) surprising way:

greynir.parse_single('Sótt er um leyfi til að byggja 50 leiguíbúðir fyrir námsme
nn á lóð við Austurhlíð.').lemmas

This is fine and gives me the right lemmas for leiguíbúð, námsmaður etc.

greynir.parse_single('Sótt er um leyfi til að byggja 50 leiguíbúðir fyrir námsme
nn, á lóð við Austurhlíð.').lemmas

The comma before "á lóð" gives me the "eiga" lemma for "á" instead of just "á".

Sorry if GitHub issues is the wrong place. I’m mainly curious about the roadmap, design and limitations. I assume Greynir uses commas to fragment sentences to keep down the parse pathways.

BTW this is a real world example.

Loving Greynir and following your progress! ✨

EDIT: Screenshot might help

Use company suffix abbreviations to label company entities

I’m writing some code to detect mentions of companies. The corpus uses the ehf/of/sf/etc. suffixes so that’s a strong indicator for me, and potentially for Greynir too.

I know that the Greynir website has an entity recognizer, but it seems quite strongly coupled to the database. Is there a case for bintokenizer to adapt a new token type? Or perhaps for Greynir to become company-entity aware?

I have some interesting examples of company names if that’s useful. I’m currently using an imperfect regex to match company names and then using Greynir to go back to the indefinite form.

Miðbæjarhótel/Centerhotels ehf.
Reitir - hótel ehf.
105 Miðborg slhf.
Faxaflóahafnir sf.
Bjarg íbúðafélag hses.
Efstaleitis Apótek ehf.
Íþrótta- og sýningahöllin hf.
V-16 ehf.

These are the suffixes I’ve come across:

ehf.
slhf.
sf.
hses.
hf.
ohf.
bs.

NounPhrase not behaving as expected

from reynir import NounPhrase

np_1 = NounPhrase('ýmsir menn, þar á meðal þessi')
print(f'Ég er í slagtogi með {np_1:þgf}.')
# Output: Ég er í slagtogi með ýmsum mönnum, þar á meðal þessum.
# Expected: Same as output

np_2 = NounPhrase('ýmsir menn, til dæmis þessi')
print(f'Ég er í slagtogi með {np_2:þgf}.')
# Output: Ég er í slagtogi með ýmsir menn, til dæmis þessi.
# Expected: Ég er í slagtogi með ýmsum mönnum, til dæmis þessum.

np_3 = NounPhrase('ýmsir menn, t.d. þessi')
print(f'Ég er í slagtogi með {np_3:þgf}.')
# Output: Ég er í slagtogi með ýmsir menn, t.d. þessi.
# Expected: Ég er í slagtogi með ýmsum mönnum, t.d. þessum.

np_4 = NounPhrase('ýmsir menn, þ.á m. þessi')
print(f'Ég er í slagtogi með {np_4:þgf}.')
# Output: Ég er í slagtogi með ýmsir menn, þ.á m. þessi.
# Expected: Ég er í slagtogi með ýmsum mönnum, þ.á m. þessum.

Document memory usage and provide parsing a parameter to limit max tokens

Parsing leads to extreme memory requirements when it encounters long sentences. Perhaps it should be documented that 8GB is the mimum memory and that 100 tokens is the maximum sentence fragment length?

Incorrect declension of "skipulags- og byggingarlög" in version 3.5.5.

Using version 3.5.5, consider the following ipython log.

In [15]: from reynir import NounPhrase

In [16]: name = NounPhrase("skipulags- og byggingarlög")

In [17]: name
Out[17]: <reynir.NounPhrase('skipulags- og byggingarlög'), parsed>

In [18]: name.dative
Out[18]: 'skipulags- og byggingarlög'

In [19]: name.genitive
Out[19]: 'skipulags- og byggingarlög'

The dative is expected to be "skipulags- og byggingarlögum" and the genitive is expected to be "skipulags- og byggingarlaga".

Lookup verb form variants

Is there a way to get verb form variants in the same way you can get case variants for nouns?

Something like

>>> BIN_Db.lookup_past_participle("sækja")

The use case is for a results highlighter. Lemmas are indexed, but I would like to highlight the original forms based on search string lemmas. For this I need to potentially highlight derived word forms. I’m basically writing a get_all_meaning_wordforms function that returns a set of strings that should be highlighted.

auto_uppercase for 'sólar í dag í reykjavík"

Tokenizing the phrase 'hver er hæð sólar í dag í reykjavík' with auto_uppercase=True results in 'Sólar Í Dag Í Reykjavík' being interpreted as a single name token (TOK.PERSON).

This also happens when 'í reykjavík' is omitted (resulting in 'Sólar Í Dag').
('hæð' probably shouldn't be capitalized either in this case).

Add POS tags from BÍN to variants returned by Reynir

BÍN has POS tags such as present and past tense (NT, ÞT), and attached definite article (gr), which are not always returned in Reynir's variant lists since they are not significant for the parse as such. However this may well be useful information for Reynir clients. An augmentation feature should be added to Reynir that adds any significant missing BÍN tags to terminal variants before they are returned from Reynir, for instance in the _Sentence.terminals property.