Coder Social home page Coder Social logo

mideind / greynirengine Goto Github PK

View Code? Open in Web Editor NEW
59.0 59.0 10.0 333.45 MB

A fast, efficient natural language processing engine for Icelandic.

Home Page: https://greynir.is

License: Other

Shell 0.28% Python 89.71% C++ 10.01%
earley icelandic natural-language-processing nlp parser parsing parsing-engine parsing-library python python-library python3

greynirengine's People

Contributors

demux avatar holado avatar jokull avatar sultur avatar sveinbjornt avatar thorunna avatar vthorsteinsson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

greynirengine's Issues

Add a facility to lemmatize text for search indexing

Greynir makes it easy to lemmatize text. If the parser fails I can fallback to the bintokenizer and get multiple lemmas for all meanings. This makes for a great search index even if there are some extra lemmas there when the parser fails.

Perhaps Greynir should provide a function out of the box to do this, as it will be a common use case? I can share my code if anyone wants to see it.

Common names interpreted as verbs

I ran the 100 most common first names in Iceland through greynir.parse. No female names are interpreted as verbs but there are a few male ones. See this gist for the code.

https://gist.github.com/jokull/2c1048bbc845feb46c717ac7c77e0cc5

  • Einar → eina so
  • Árni → árna so
  • Helgi → helga so
  • Ragnar → ragna so
  • Óskar → óska so
  • Birgir → birgja so
  • Brynjar → brynja so
  • Rúnar → rúna so
  • Ómar → óma so
  • Reynir → reyna so
  • Garðar → garða so
  • Steinar → steina so

Doesn't recognise letter in house number as part of address

Reynir doesn't identify alphabetic characters appended to house numbers as part of the address.

>>> s = r.parse_single("Hann býr á Bárugötu 14.")
>>> print(s.tree.view)
P
+-S-MAIN
  +-IP
    +-NP-SUBJ
      +-pfn_kk_et_nf: 'Hann'
    +-VP-SEQ
      +-VP
        +-so_0_et_p3: 'býr'
      +-PP
        +-fs_þgf: 'á'
        +-NP-ADDR
          +-gata_þgf_kvk: 'Bárugötu'
          +-tala: '14'
+-'.'
>>> s = r.parse_single("Hann býr á Bárugötu 14b.")
>>> print(s.tree.view)
P
+-S-MAIN
  +-IP
    +-NP-SUBJ
      +-pfn_kk_et_nf: 'Hann'
    +-VP-SEQ
      +-VP
        +-so_0_et_p3: 'býr'
      +-PP
        +-fs_þgf: 'á'
        +-NP
          +-sérnafn: 'Bárugötu'
          +-tala: '14'
          +-no_et_þgf_hk: 'b'
+-'.'
>>> s = r.parse_single("Hann býr á Bárugötu 14A.")
>>> print(s.tree.view)
P
+-S-MAIN
  +-IP
    +-NP-SUBJ
      +-pfn_kk_et_nf: 'Hann'
    +-VP-SEQ
      +-VP
        +-so_0_et_p3: 'býr'
      +-PP
        +-fs_þgf: 'á'
        +-NP
          +-no_et_þgf_kvk: 'Bárugötu'
          +-NP-POSS
            +-NP-MEASURE
              +-mælieining: '14 A'
+-'.'

NounPhrase class returns a wrong word

Thank you for the great project!

While using the python package, I ran into a bug with a wrong word being returned. The nominative case of Þór in "Ragnar Þór Valgeirsson" is wrong when using a NounPhrase.

How to reproduce

greynir_bug.py

from reynir import NounPhrase as Nl

nafn = Nl("Ragnar Þór Valgeirsson")

print(f"{nafn:nf}")

Results after executing ten times

for i in {1..10}; do python greynir_bug.py; done

Ragnar Þórr Valgeirsson
Ragnar Þórr Valgeirsson
Ragnar Þór Valgeirsson
Ragnar Þórr Valgeirsson
Ragnar Þórr Valgeirsson
Ragnar Þór Valgeirsson
Ragnar Þórr Valgeirsson
Ragnar Þór Valgeirsson
Ragnar Þór Valgeirsson
Ragnar Þórr Valgeirsson

Comma sensitivity in sentences

I’m not familiar with the parsing pipeline but I thought I would share an instance of where the parser tripped in a (to me) surprising way:

greynir.parse_single('Sótt er um leyfi til  byggja 50 leiguíbúðir fyrir námsme
nn á lóð við Austurhlíð.').lemmas

This is fine and gives me the right lemmas for leiguíbúð, námsmaður etc.

greynir.parse_single('Sótt er um leyfi til  byggja 50 leiguíbúðir fyrir námsme
nn, á lóð við Austurhlíð.').lemmas

The comma before "á lóð" gives me the "eiga" lemma for "á" instead of just "á".

Sorry if GitHub issues is the wrong place. I’m mainly curious about the roadmap, design and limitations. I assume Greynir uses commas to fragment sentences to keep down the parse pathways.

BTW this is a real world example.

Loving Greynir and following your progress! ✨

EDIT: Screenshot might help

Screenshot 2020-05-13 at 22 50 49

Use company suffix abbreviations to label company entities

I’m writing some code to detect mentions of companies. The corpus uses the ehf/of/sf/etc. suffixes so that’s a strong indicator for me, and potentially for Greynir too.

I know that the Greynir website has an entity recognizer, but it seems quite strongly coupled to the database. Is there a case for bintokenizer to adapt a new token type? Or perhaps for Greynir to become company-entity aware?

I have some interesting examples of company names if that’s useful. I’m currently using an imperfect regex to match company names and then using Greynir to go back to the indefinite form.

  • Miðbæjarhótel/Centerhotels ehf.
  • Reitir - hótel ehf.
  • 105 Miðborg slhf.
  • Faxaflóahafnir sf.
  • Bjarg íbúðafélag hses.
  • Efstaleitis Apótek ehf.
  • Íþrótta- og sýningahöllin hf.
  • V-16 ehf.

These are the suffixes I’ve come across:

  • ehf.
  • slhf.
  • sf.
  • hses.
  • hf.
  • ohf.
  • bs.

NounPhrase not behaving as expected

from reynir import NounPhrase

np_1 = NounPhrase('ýmsir menn, þar á meðal þessi')
print(f'Ég er í slagtogi með {np_1:þgf}.')
# Output: Ég er í slagtogi með ýmsum mönnum, þar á meðal þessum.
# Expected: Same as output

np_2 = NounPhrase('ýmsir menn, til dæmis þessi')
print(f'Ég er í slagtogi með {np_2:þgf}.')
# Output: Ég er í slagtogi með ýmsir menn, til dæmis þessi.
# Expected: Ég er í slagtogi með ýmsum mönnum, til dæmis þessum.

np_3 = NounPhrase('ýmsir menn, t.d. þessi')
print(f'Ég er í slagtogi með {np_3:þgf}.')
# Output: Ég er í slagtogi með ýmsir menn, t.d. þessi.
# Expected: Ég er í slagtogi með ýmsum mönnum, t.d. þessum.

np_4 = NounPhrase('ýmsir menn, þ.á m. þessi')
print(f'Ég er í slagtogi með {np_4:þgf}.')
# Output: Ég er í slagtogi með ýmsir menn, þ.á m. þessi.
# Expected: Ég er í slagtogi með ýmsum mönnum, þ.á m. þessum.

Incorrect declension of "skipulags- og byggingarlög" in version 3.5.5.

Using version 3.5.5, consider the following ipython log.

In [15]: from reynir import NounPhrase

In [16]: name = NounPhrase("skipulags- og byggingarlög")

In [17]: name
Out[17]: <reynir.NounPhrase('skipulags- og byggingarlög'), parsed>

In [18]: name.dative
Out[18]: 'skipulags- og byggingarlög'

In [19]: name.genitive
Out[19]: 'skipulags- og byggingarlög'

The dative is expected to be "skipulags- og byggingarlögum" and the genitive is expected to be "skipulags- og byggingarlaga".

Lookup verb form variants

Is there a way to get verb form variants in the same way you can get case variants for nouns?

Something like

>>> BIN_Db.lookup_past_participle("sækja")

The use case is for a results highlighter. Lemmas are indexed, but I would like to highlight the original forms based on search string lemmas. For this I need to potentially highlight derived word forms. I’m basically writing a get_all_meaning_wordforms function that returns a set of strings that should be highlighted.

auto_uppercase for 'sólar í dag í reykjavík"

Tokenizing the phrase 'hver er hæð sólar í dag í reykjavík' with auto_uppercase=True results in 'Sólar Í Dag Í Reykjavík' being interpreted as a single name token (TOK.PERSON).

image
image

This also happens when 'í reykjavík' is omitted (resulting in 'Sólar Í Dag').
('hæð' probably shouldn't be capitalized either in this case).

Add POS tags from BÍN to variants returned by Reynir

BÍN has POS tags such as present and past tense (NT, ÞT), and attached definite article (gr), which are not always returned in Reynir's variant lists since they are not significant for the parse as such. However this may well be useful information for Reynir clients. An augmentation feature should be added to Reynir that adds any significant missing BÍN tags to terminal variants before they are returned from Reynir, for instance in the _Sentence.terminals property.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.